
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | August 25, 2014 |
Latest Amendment Date: | June 17, 2016 |
Award Number: | 1422767 |
Award Instrument: | Continuing Grant |
Program Manager: |
Maria Zemankova
IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | September 1, 2014 |
End Date: | August 31, 2018 (Estimated) |
Total Intended Award Amount: | $450,000.00 |
Total Awarded Amount to Date: | $450,000.00 |
Funds Obligated to Date: |
FY 2016 = $81,304.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
5000 FORBES AVE PITTSBURGH PA US 15213-3890 (412)268-8746 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
5000 Forbes Ave Pittsburgh PA US 15213-3890 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Robust Intelligence |
Primary Program Source: |
01001617DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
This project develops new data-driven techniques for egocentric (first-person) video stream analysis that exploit the structure and redundancy in streams captured over days, months, and even years, to significantly reduce the size of these datasets without losing the most useful visual information. Simultaneously, the research team is developing parallel programming frameworks that simplify expression and acceleration of these video analysis algorithms at scale. While the focus of this research is the design of core algorithms and systems, success stands to enable the development of new classes of applications (in domains such as navigation, personal assistance, health/behavior monitoring) that use the extensive visual history of a camera to intelligently interpret continuous visual data sources and immediately respond to the observed input. A further output of this research is the collection and organization of a large egocentric video database from the life of a single individual.
The core idea of this research is to identify and exploit redundancy in everyday life. While it is not tractable to maintain an easily analyzable representation of all video ever seen by a camera, it is likely possible to identify and provide future applications fast access to the most important visual information. The challenge is to determine what visual data is the most important. This work explores the use of video stream predictability as a notion of importance. Specifically, the vast visual history of the camera (e.g., life experiences captured by a head-mounted camera) is used to make predictions about what the camera will see next, and the accuracy of these predictions dictates what data is retained. (Highly predictable occurrences are judged to be less valuable to retain in the database.) In addition, this research is characterizing the structure of always-on egocentric video streams (What is the "working set" of a person's day? How much novel information is collected from day to day?), leveraging this structure to inform the design of new algorithms for video corpus analysis (data compression, accelerated retrieval), and exploring the design of specialized programming abstractions for authoring visual data understanding applications at scale.
URL: http://graphics.cs.cmu.edu/projects/egocentricPrediction
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
The project was based on the idea that large-scale continuous capture of large video datasets presented unique opportunities to develop new video understanding algorithms and to better understand the systems requirements of applications that process always-on video. Our work on the project was organized into three main themes:
(1) Long-Running Video Dataset Capture
The project results in the creation of two large video datasets for the computer vision community. These datasets differ from most current video datasets in that they focused on a small number of long-running video streams, rather than containing a small number of short video clips. The first dataset is the KrishnaCam dataset, which is a 70-hour outdoor egocentric dataset spanning the life of a single computer vision graduate student (He literally dedicated his life to science!). This dataset captures a diverse range of outdoor scenes in the greater Pittsburgh area, ranging from urban cityscapes to city parks, and spans multiple seasons and times of day. The second is called the Long Video Streams Dataset (LVS), a dataset of 30-minute clips from 30 HD video streams. LVS consists of streams featuring a diverse array of challenges: from fixed-viewpoint cameras, to constantly moving and zooming television cameras, to hand-held and egocentric video.
(2) New Algorithms for Increasing the Efficiency of Image/Video Understanding
We developed a number of techniques that aimed to improve the efficiency of DNN inference on images and video (accurate per unit cost). The first was a DNN-design methodology that we called HydraNets, which enables state-of-the-art architectures for image classification to be transformed into dynamic architectures which exploit conditional execution for efficient inference. HydraNets are wide networks containing distinct components each specialized to compute features for visually similar classes, but they retain efficiency by dynamically selecting only a small number of components to evaluate for any one input image. In other words, the main idea of HydraNets is that a large amount of DNN capacity is necessary to accurately classify all images a network might encounter, only a small fraction of that capacity is necessary to correctly classify any one image.
Building upon the idea of dynamic DNN execution, we developed the approach of online model distillation for training efficient models that are specialized for the content of specific video streams. The central motivation for this work was that most video cameras observe a very small fraction of the visual world. A fraction that, in many cases, evolves over time. For example, stationary cameras observe scenes that evolve with time of day or changing weather conditions, TV cameras pan and zoom, smartphone videos are hand-held, and egocentric cameras on vehicles or robots move through dynamic scenes.
Therefore, we moved away from pre-training specialized models using camera-specific datasets curated in advance, and instead train models online on a live video stream as new video frames arrive. Specifically, we employed the well-known technique of model distillation, training a lightweight "student" model to output the predictions of a larger, reliable high-capacity "teacher'', but did so in an online fashion, intermittently running the teacher on a live stream to provide a target for student learning. We found that simple DNN models can be accurate, provided they are continuously adapted to the specific contents of a video stream as new frames arrive. (i.e. models can learn to cheat---segmenting people sitting on a park lawn might be as easy as looking for shades of green!) Specifically, we applied this methodology to the task of realizing high-accuracy and low-cost semantic segmentation models that continuously adapt to the contents of a video stream. We were able to create models that were 20-30X more efficient than state-of-the-art DNN architectures without much loss in accuracy on the target video stream. Although our initial efforts focused on segmentation efficiency, we believe there are many future opportunities where continuous online training will provide valuable for improving inference efficiency or even accuracy.
(3) Systems Infrastructure for Analysis Large Video Datasets
Of course, analyzing large video datasets requires significant compute and storage costs, as well as significant parallel computing knowledge to develop applications that scale to large video datasets or large machines. To make it both more productive, and more efficient, to develop large-scale video processing applications, we developed Scanner, a platform for video processing at cloud scale. Scanner was developed in part in this project, as well as in IIS-1539069, and is now an open source project available to the public at (http://scanner.run). We are aware of use of Scanner at a number of academic institutions, as well as in industry.
In addition to these three research thrusts, early work on this project catalyzed broader interest in system-support for video processing at the PI's home institution, shaping technical direction for new research programs such as the CMU-hosted Intel Science and Technology Center for Visual Cloud Computing.
Last Modified: 05/13/2019
Modified by: Kayvon Fatahalian
Please report errors in award information by writing to: awardsearch@nsf.gov.