Award Abstract # 1755785
CRII: RI: Representation Learning and Adaptation using Unlabeled Videos

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: VIRGINIA POLYTECHNIC INSTITUTE & STATE UNIVERSITY
Initial Amendment Date: April 13, 2018
Latest Amendment Date: April 13, 2018
Award Number: 1755785
Award Instrument: Standard Grant
Program Manager: Jie Yang
jyang@nsf.gov
 (703)292-4768
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: June 1, 2018
End Date: May 31, 2021 (Estimated)
Total Intended Award Amount: $172,903.00
Total Awarded Amount to Date: $172,903.00
Funds Obligated to Date: FY 2018 = $172,903.00
History of Investigator:
  • Jia-Bin Huang (Principal Investigator)
    jbhuang0604@gmail.com
Recipient Sponsored Research Office: Virginia Polytechnic Institute and State University
300 TURNER ST NW
BLACKSBURG
VA  US  24060-3359
(540)231-5281
Sponsor Congressional District: 09
Primary Place of Performance: Virginia Polytechnic Institute and State University
1185 Perry Street
Blacksburg
VA  US  24061-0101
Primary Place of Performance
Congressional District:
09
Unique Entity Identifier (UEI): QDE5UHE5XD16
Parent UEI: X6KEFGLHSJX7
NSF Program(s): Robust Intelligence
Primary Program Source: 01001819DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7495, 8228
Program Element Code(s): 749500
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Recent success in visual recognition relies on training deep neural networks (DNNs) on a large-scale annotated image classification dataset in a fully supervised fashion. The learned representation encoded in the parameters of DNNs have shown remarkable transferability to a wide range of tasks. However, the dependency on supervised learning substantially limits the scalability to new problem domains because manual labeling is often expensive and in some cases requires expertise. In contrast, a massive amount of free unlabeled images and videos are readily available on the Internet. This project develops algorithms to capitalize on large amounts of unlabeled videos for representation learning and adaptation. The developed methods significantly alleviate the high cost and scarcity of manual annotations for constructing large-scale datasets. The project involves both graduate and undergraduate students in the research. The research materials are also integrated to curriculum development in courses on deep learning for machine perception. Results will be disseminated through scientific publications, open-source software, and dataset releases.

This research tackles two key problems in representation learning. In the first research aim, the project simultaneously leverages spatial and temporal contexts in videos to learn generalizable representation. The research takes advantages of rich supervisory signals for representation learning from appearance variations and temporal coherence in videos. Compared to the supervised counterpart (which requires millions of manually labeled images), learning from unlabeled videos is inexpensive and is not limited in scope. The project also seeks to adapt the learned representation to handle appearance variations in new domains with minimal manual supervision. The effectiveness of representation adaptation is validated in the context of instance-level video object segmentation.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Chen, Wei-Yu and Liu, Yen-Cheng and Kira, Zsolt and Wang, Yu-Chiang Frank and Huang, Jia-Bin "A Closer Look at Few-shot Classification" International Conference on Learning Representations , 2019 Citation Details
Choi, Jinwoo and Gao, Chen and Messou, Joseph and Huang, Jia-Bin "Why Can't I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition" Advances in neural information processing systems , 2019 Citation Details
Choi, Jinwoo and Sharma, Gaurav and Chandraker, Manmohan and Huang, Jia-Bin "Unsupervised and Semi-Supervised Domain Adaptation for Action Recognition from Drones" IEEE Winter Conference on Applications of Computer Vision (WACV) , 2020 https://doi.org/10.1109/WACV45572.2020.9093511 Citation Details
Hu, Yuan-Ting and Huang, Jia-Bin and Schwing, Alexander "VideoMatch: Matching Based Video Object Segmentation" European conference on Computer Vision , 2018 978-3-030-01237-3_4 Citation Details
Tseng, Hung-Yu and Lee, Hsin-Ying and Huang, Jia-Bin and Yang, Ming-Hsuan "Cross-Domain Few-Shot Classification via Learned Feature-Wise Transformation" International Conference on Learning Representations , 2020 Citation Details
Zou, Yuliang and Ji, Pan and Tran, Quoc-Hu and Huang, Jia-Bin and Chandraker, Manmohan "Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling" European Conference on Computer Vision , 2020 https://doi.org/10.1007/978-3-030-58568-6_42 Citation Details
Zou, Yuliang and Luo, Zelunl and Huang, Jia-Bin "DF-Net: Unsupervised Joint Learning of Depth and Flow Using Cross-Task Consistency" European Conference on Computer Vision , 2018 978-3-030-01228-1_3 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The goal of this project is to tackle two key problems in representation learning. First, the project simultaneously leverages spatial and temporal contexts in videos to learn generalizable representation for computer vision tasks. Second, the project also seeks to adapt the learned representation to handle appearance variations in new domains with minimal manual supervision.

** Intellectual merit

Through the project, we studied the representation learning and adaption from videos and dessimiated our findings via top-tier publications at Computer Vision and Machine Learning conferneces. Our key activities include the following research findings.

We first demonstrate we can learn monocular depth estimation and optical flow estimation networks from unlabeled videos by leveraging temporal contexts. The core idea is to use cross-task geometric consistency to train the models jointly in a self-supervised manner. The resulting models lead to the state-of-the-art performance without using any manually labeled training data. [1]

We then study the problem of representation learning for video activity recognition. Training a model using existing video datasets inevitably captures and leverages unwanted scene bias. The learned representation may not generalize well to new action classes or different tasks. We propose to mitigate scene bias for video representation learning. Our proposed method show consistent improvement over the baseline model without debiasing. [2]

For representation adaptation, we show that adapting a carefully pretrained model with standard multiclass loss can achieve highly competitive results compared with complicated meta-learning algorithms. Our findings have high-impact in the field of meta learning ([3] cited over 1052 times as of Aug 2022).

We also propose a representation adaptation method across domains for activity recognition in videos. Specially, we design self-supervised loss using temporal contexts and an attention mechanism to filter out uninformative video clips during domain adaptation.

[1] Zou, Yuliang and Luo, Zelunl and Huang, Jia-Bin (2018). DF-Net: Unsupervised Joint Learning of Depth and Flow Using Cross-Task Consistency. ECCV 2018

[2] Choi, Jinwoo and Gao, Chen and Messou, Joseph and Huang, Jia-Bin (2019). Why Can't I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition. NeurIPS 2019

[3] Chen, Wei-Yu and Liu, Yen-Cheng and Kira, Zsolt and Wang, Yu-Chiang Frank and Huang, Jia-Bin (2019). A Closer Look at Few-shot Classification, ICLR 2019

[4] Choi, Jinwoo, Sharma, Gaurav, Schulter, Samuel, and Huang, Jia-Bin (2020) Shuffle and Attend: Video Domain Adaptation. ECCV 2020

The project funding provided graduate research assistantship and conference travel support for two PhD students at Virginia Tech. Both are now graudated with their PhD. 

 

** Broader of impacts:

For representation learning, our work on few-shot learning (publisehd at ICLR 2019) advance the field by providing a carefully designed benchmark, a strong baseline without complex meta learning algorithm, and introducing a new problem on cross-domain few-shot recognition. 

To date (08/08/2022), according to Google Scholar, the paper has been cited 1052 times. There are several notable follow-up work based on our research, including 

our own follow-up research: Tseng, Hung-Yu and Lee, Hsin-Ying and Huang, Jia-Bin and Yang, Ming-Hsuan (2020). Cross-Domain Few-Shot Classification via Learned Feature-Wise Transform. ICLR 2020

and other groups, e.g.,

from MIT: Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B. Tenenbaum, Phillip Isola. Rethinking Few-Shot Image Classification: a Good Embedding Is All You Need? ECCV 2020.

from UCSD/UC Berkeley: Yinbo Chen, Zhuang Liu, Huijuan Xu, Trevor Darrell, Xiaolong Wang.
Meta-Baseline: Exploring Simple Meta-Learning for Few-Shot Learning. ICCV, 2021

Our research on video domain adaptation is integrated into "Video Recognition" in the ECE 5554 Computer Vision course and "Domain Adaptation" in ECE 5524 Machine Learning courses at Virginia Tech. These courses often have high student enrollment (around 70 graduate students and 30 undergraduate students) taught in Fall 2018 and Fall 2019. 

Last Modified: 08/09/2022
Modified by: Jia-Bin Huang

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page