Skip to feedback

Award Abstract # 2150012
CAREER: Weakly-Supervised Visual Scene Understanding: Combining Images and Videos, and Going Beyond Semantic Tags

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: UNIVERSITY OF WISCONSIN SYSTEM
Initial Amendment Date: September 23, 2021
Latest Amendment Date: May 24, 2022
Award Number: 2150012
Award Instrument: Continuing Grant
Program Manager: Jie Yang
jyang@nsf.gov
 (703)292-4768
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2021
End Date: March 31, 2024 (Estimated)
Total Intended Award Amount: $500,499.00
Total Awarded Amount to Date: $332,659.00
Funds Obligated to Date: FY 2019 = $15,563.00
FY 2020 = $111,193.00

FY 2021 = $167,524.00

FY 2022 = $38,379.00
History of Investigator:
  • Yong Jae Lee (Principal Investigator)
    yongjaelee@cs.wisc.edu
Recipient Sponsored Research Office: University of Wisconsin-Madison
21 N PARK ST STE 6301
MADISON
WI  US  53715-1218
(608)262-3822
Sponsor Congressional District: 02
Primary Place of Performance: University of Wisconsin-Madison
21 North Park Street Suite 640
Madison
WI  US  53715-1218
Primary Place of Performance
Congressional District:
02
Unique Entity Identifier (UEI): LCLSJAGTNZQ7
Parent UEI:
NSF Program(s): Robust Intelligence
Primary Program Source: 01001920DB NSF RESEARCH & RELATED ACTIVIT
01002021DB NSF RESEARCH & RELATED ACTIVIT

01002122DB NSF RESEARCH & RELATED ACTIVIT

01002223DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 1045, 7495
Program Element Code(s): 749500
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

The internet provides an endless supply of images and videos, replete with weakly-annotated meta-data such as text tags, GPS coordinates, timestamps, or social media sentiments. This huge resource of visual data provides an opportunity to create scalable and powerful recognition algorithms that do not depend on expensive human annotations. The research component of this project develops novel visual scene understanding algorithms that can effectively learn from such weakly-annotated visual data. The main novelty is to combine both images and videos together. The developed algorithms could have broad impact in numerous fields including AI, security, and agricultural sciences. In addition to scientific impact, the project performs complementary educational and outreach activities. Specifically, it provides mentorship to high school, undergraduate, and graduate students, teaches new undergraduate and graduate computer vision courses that have been lacking at UC Davis, and organizes an international workshop on weakly-supervised visual scene understanding.

This project develops novel algorithms to advance weakly-supervised visual scene understanding in two complementary ways: (1) learning jointly with both images and videos to take advantage of their complementarity, and (2) learning from weak supervisory signals that go beyond standard semantic tags such as timestamps, captions, and relative comparisons. Specifically, it investigates novel approaches to advance tasks like fully-automatic video object segmentation, weakly-supervised object detection, unsupervised learning of object categories, and mining of localized patterns in the image/video data that are correlated with the weak supervisory signal. Throughout, the project explores ways to understand and mitigate noise in the weak labels and to overcome the domain differences between images and videos.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 26)
Cai, Mu and Liu, Haotian and Mustikovela, Siva and Meyer, Gregory and Chai, Yuning and Park, Dennis and Lee, Yong Jae "ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts" , 2024 Citation Details
Huang, Zeyi and Wang, Haohan and Huang, Dong and Lee, Yong Jae and Xing, Eric P. "The Two Dimensions of Worst-case Training and Their Integrated Effect for Out-of-domain Generalization" Conference on Computer Vision and Pattern Recognition (CVPR) , 2022 https://doi.org/10.1109/cvpr52688.2022.00941 Citation Details
Huang, Zeyi and Zhou, Andy and Lin, Zijian and Cai, Mu and Wang, Haohan and Lee, Yong Jae "A Sentence Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance" , 2023 https://doi.org/10.1109/ICCV51070.2023.01073 Citation Details
Li, Chunyuan and Liu, Haotian and Li, Harold and Zhang, Pengchuan and Aneja, Jyoti and Yang, Jianwei and Jin, Ping and Hu, Houdong and Liu, Zicheng and Lee, Yong Jae and Gao, Jianfeng "ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models" Neural Information Processing Systems (NeurIPS) , 2022 Citation Details
Liu, H and Li, C and Wu, Q and Lee, Yong Jae "Visual Instruction Tuning" , 2023 Citation Details
Liu, Haotian and Cai, Mu and Lee, Yong Jae "Masked Discrimination for Self-Supervised Learning on Point Clouds" European Conference on Computer Vision (ECCV) , 2022 https://doi.org/10.1007/978-3-031-20086-1_38 Citation Details
Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae "Improved Baselines with Visual Instruction Tuning" , 2024 Citation Details
Liu, Haotian and Rivera-Soto, Rafael and Xiao, Fanyi and Lee, Yong Jae "YolactEdge: Real-time Instance Segmentation on the Edge" IEEE International Conference on Robotics and Automation (ICRA) , 2021 Citation Details
Liu, Haotian and Son, Kilho and Yang, Jianwei and Liu, Ce and Gao, Jianfeng and Lee, Yong Jae and Li, Chunyuan "Learning Customized Visual Models with Retrieval-Augmented Knowledge" , 2023 https://doi.org/10.1109/CVPR52729.2023.01454 Citation Details
Li, Yuheng and Liu, Haotian and Wu, Qingyang and Mu, Fangzhou and Yang, Jianwei and Gao, Jianfeng and Li, Chunyuan and Lee, Yong Jae "GLIGEN: Open-Set Grounded Text-to-Image Generation" , 2023 https://doi.org/10.1109/CVPR52729.2023.02156 Citation Details
Li, Yuheng and Li, Yijun and Lu, Jingwan and Shechtman, Eli and Lee, Yong Jae and Singh, Krishna Kumar "Collaging Class-specific GANs for Semantic Image Synthesis" International Conference on Computer Vision (ICCV) , 2021 https://doi.org/10.1109/iccv48922.2021.01415 Citation Details
(Showing: 1 - 10 of 26)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The goal of this project was to develop novel weakly-supervised computer vision algorithms. In particular, it investigated two main thrusts: (1) Enhancing the input data by learning jointly from weakly-labeled images and videos, to combine complementary advantages of both domains: diverse and high-quality information from images, and motion and temporal information from videos; and (2) Enhancing the weak supervisory signal by going beyond semantic tags (i.e., object/scene/action/attribute labels) to learn from weak annotations such as captions, timestamps, GPS coordinates, and relative comparisons.

In terms of intellectual merit, there were broadly three key areas of technical contributions. The first is the development of novel weakly-supervised algorithms for visual understanding. The second is the development of novel algorithms for controllable image generation. The third is the development of novel multimodal vision-language assistants. The work produced 38 peer reviewed papers in top-tier computer vision and machine learning conferences, and new publicly available codebases for the algorithms which are linked from https://pages.cs.wisc.edu/~yongjaelee/. The research results were also regularly presented by the PI at international meetings and university seminars.

In terms of broader impact, the main project outcomes were graduate student mentorship and training, outreach activities to promote wider participation of underrepresented students in CS and STEM education, and broad scientific impact of the algorithms. In particular, the project helped train MS and PhD students in conducting and presenting research in the topics of this project. Several MS and PhD students completed their degrees and accepted new PhD, postdoc, and research industry positions. The project's outreach component contributed to efforts that widen underrepresented student participation in STEM.

 


Last Modified: 07/15/2024
Modified by: Yong Jae Lee

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page