
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | April 13, 2018 |
Latest Amendment Date: | April 13, 2018 |
Award Number: | 1755785 |
Award Instrument: | Standard Grant |
Program Manager: |
Jie Yang
jyang@nsf.gov (703)292-4768 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | June 1, 2018 |
End Date: | May 31, 2021 (Estimated) |
Total Intended Award Amount: | $172,903.00 |
Total Awarded Amount to Date: | $172,903.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
300 TURNER ST NW BLACKSBURG VA US 24060-3359 (540)231-5281 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
1185 Perry Street Blacksburg VA US 24061-0101 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Robust Intelligence |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Recent success in visual recognition relies on training deep neural networks (DNNs) on a large-scale annotated image classification dataset in a fully supervised fashion. The learned representation encoded in the parameters of DNNs have shown remarkable transferability to a wide range of tasks. However, the dependency on supervised learning substantially limits the scalability to new problem domains because manual labeling is often expensive and in some cases requires expertise. In contrast, a massive amount of free unlabeled images and videos are readily available on the Internet. This project develops algorithms to capitalize on large amounts of unlabeled videos for representation learning and adaptation. The developed methods significantly alleviate the high cost and scarcity of manual annotations for constructing large-scale datasets. The project involves both graduate and undergraduate students in the research. The research materials are also integrated to curriculum development in courses on deep learning for machine perception. Results will be disseminated through scientific publications, open-source software, and dataset releases.
This research tackles two key problems in representation learning. In the first research aim, the project simultaneously leverages spatial and temporal contexts in videos to learn generalizable representation. The research takes advantages of rich supervisory signals for representation learning from appearance variations and temporal coherence in videos. Compared to the supervised counterpart (which requires millions of manually labeled images), learning from unlabeled videos is inexpensive and is not limited in scope. The project also seeks to adapt the learned representation to handle appearance variations in new domains with minimal manual supervision. The effectiveness of representation adaptation is validated in the context of instance-level video object segmentation.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
The goal of this project is to tackle two key problems in representation learning. First, the project simultaneously leverages spatial and temporal contexts in videos to learn generalizable representation for computer vision tasks. Second, the project also seeks to adapt the learned representation to handle appearance variations in new domains with minimal manual supervision.
** Intellectual merit:
Through the project, we studied the representation learning and adaption from videos and dessimiated our findings via top-tier publications at Computer Vision and Machine Learning conferneces. Our key activities include the following research findings.
We first demonstrate we can learn monocular depth estimation and optical flow estimation networks from unlabeled videos by leveraging temporal contexts. The core idea is to use cross-task geometric consistency to train the models jointly in a self-supervised manner. The resulting models lead to the state-of-the-art performance without using any manually labeled training data. [1]
We then study the problem of representation learning for video activity recognition. Training a model using existing video datasets inevitably captures and leverages unwanted scene bias. The learned representation may not generalize well to new action classes or different tasks. We propose to mitigate scene bias for video representation learning. Our proposed method show consistent improvement over the baseline model without debiasing. [2]
For representation adaptation, we show that adapting a carefully pretrained model with standard multiclass loss can achieve highly competitive results compared with complicated meta-learning algorithms. Our findings have high-impact in the field of meta learning ([3] cited over 1052 times as of Aug 2022).
We also propose a representation adaptation method across domains for activity recognition in videos. Specially, we design self-supervised loss using temporal contexts and an attention mechanism to filter out uninformative video clips during domain adaptation.
[1] Zou, Yuliang and Luo, Zelunl and Huang, Jia-Bin (2018). DF-Net: Unsupervised Joint Learning of Depth and Flow Using Cross-Task Consistency. ECCV 2018
[2] Choi, Jinwoo and Gao, Chen and Messou, Joseph and Huang, Jia-Bin (2019). Why Can't I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition. NeurIPS 2019
[3] Chen, Wei-Yu and Liu, Yen-Cheng and Kira, Zsolt and Wang, Yu-Chiang Frank and Huang, Jia-Bin (2019). A Closer Look at Few-shot Classification, ICLR 2019
[4] Choi, Jinwoo, Sharma, Gaurav, Schulter, Samuel, and Huang, Jia-Bin (2020) Shuffle and Attend: Video Domain Adaptation. ECCV 2020
The project funding provided graduate research assistantship and conference travel support for two PhD students at Virginia Tech. Both are now graudated with their PhD.
** Broader of impacts:
For representation learning, our work on few-shot learning (publisehd at ICLR 2019) advance the field by providing a carefully designed benchmark, a strong baseline without complex meta learning algorithm, and introducing a new problem on cross-domain few-shot recognition.
To date (08/08/2022), according to Google Scholar, the paper has been cited 1052 times. There are several notable follow-up work based on our research, including
our own follow-up research: Tseng, Hung-Yu and Lee, Hsin-Ying and Huang, Jia-Bin and Yang, Ming-Hsuan (2020). Cross-Domain Few-Shot Classification via Learned Feature-Wise Transform. ICLR 2020
and other groups, e.g.,
from MIT: Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B. Tenenbaum, Phillip Isola. Rethinking Few-Shot Image Classification: a Good Embedding Is All You Need? ECCV 2020.
from UCSD/UC Berkeley: Yinbo Chen, Zhuang Liu, Huijuan Xu, Trevor Darrell, Xiaolong Wang.
Meta-Baseline: Exploring Simple Meta-Learning for Few-Shot Learning. ICCV, 2021
Last Modified: 08/09/2022
Modified by: Jia-Bin Huang
Please report errors in award information by writing to: awardsearch@nsf.gov.