Award Abstract # 2143197
RI: Small: Visual How: Task Understanding and Description in the Real World

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: REGENTS OF THE UNIVERSITY OF MINNESOTA
Initial Amendment Date: June 15, 2022
Latest Amendment Date: June 15, 2022
Award Number: 2143197
Award Instrument: Standard Grant
Program Manager: Jie Yang
jyang@nsf.gov
 (703)292-4768
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: June 15, 2022
End Date: May 31, 2025 (Estimated)
Total Intended Award Amount: $262,237.00
Total Awarded Amount to Date: $262,237.00
Funds Obligated to Date: FY 2022 = $262,237.00
History of Investigator:
  • Qi Zhao (Principal Investigator)
    qzhao@umn.edu
Recipient Sponsored Research Office: University of Minnesota-Twin Cities
2221 UNIVERSITY AVE SE STE 100
MINNEAPOLIS
MN  US  55414-3074
(612)624-5599
Sponsor Congressional District: 05
Primary Place of Performance: University of Minnesota-Twin Cities
4-192, 200 Union Street SE
Minneapolis
MN  US  55455-0169
Primary Place of Performance
Congressional District:
05
Unique Entity Identifier (UEI): KABJZBBJ4B54
Parent UEI:
NSF Program(s): Robust Intelligence
Primary Program Source: 01002223DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7495, 7923
Program Element Code(s): 749500
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Problem solving is an innate capability that humans develop through evolution and experience. Compared to human intelligence that can solve general and complex problems, current AI systems only perform well in narrow and structured tasks. With the overarching goal of bridging this gap, this project develops AI systems that can understand general real-world tasks (e.g., How to set up a tent? How to teach kids to garden? How to travel in London?) and come up with solutions with step-by-step language and visual guidance. It will allow for real-world tasks to be solved even in general and complex circumstances, resulting in more human-like AI. Ultimately, the project will take a step forward toward artificial general intelligence. The project will provide a publicly available dataset, a framework of computational models, and a mobile application prototype. Furthermore, this project will support integrated research and education with a focus on increasing minority participation through K-12 outreach, underrepresented and undergraduate mentoring, and curriculum development.

This project proposes a VisualHow problem that represents a rich spectrum of real-world tasks. The generality and complexity of the problem call for capabilities to understand the visual and textual contents of the task, reason with knowledge relevant to the task, and generate step-by-step multimodal descriptions about how the task can be completed. This project aims to achieve these goals in three tasks. First, generate a new dataset with diverse and real-world tasks and solutions, with rich annotations of key semantics and task structures to guide the multimodal attention and structural reasoning. Second, develop a novel framework in which a series of models are derived for explainable VisualHow learning to understand the visual-textual contents and generate steps to complete real-world tasks. Third, develop novel methods to generalize the models with knowledge and validate them on mobile platforms to assist people in real-world applications. Achieving these goals will not only lead to new vision-language tasks and computational methods for real-world problem solving, but also spur innovations in the development of explainable and generalizable AI models and systems.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Chen, X and Jiang, M and Zhao, Q "Beyond Average: Individualized Visual Scanpath Prediction" , 2024 Citation Details
Chen, Shi and Jiang, Ming and Zhao, Qi "What Do Deep Saliency Models Learn about Visual Attention?" , 2023 Citation Details
Chen, S. and Zhao, Q. "Divide and Conquer: Answering Questions with Object Factorization and Compositional Reasoning" IEEE Conference on Computer Vision and Pattern Recognition , 2023 Citation Details
Chen, Xianyu and Yang, Jinhui and Chen, Shi and Wang, Louis and Jiang, Ming and Zhao, Qi "Every Problem, Every Step, All in Focus: Learning to Solve Vision-Language Problems With Integrated Attention" IEEE Transactions on Pattern Analysis and Machine Intelligence , v.46 , 2024 https://doi.org/10.1109/TPAMI.2024.3357631 Citation Details
Luo, Yan and Wong, Yongkang and Kankanhalli, Mohan and Zhao, Qi "Learning to Predict Gradients for Semi-Supervised Continual Learning" IEEE Transactions on Neural Networks and Learning Systems , 2024 https://doi.org/10.1109/TNNLS.2024.3361375 Citation Details
Yang, J. and Chen, X. and Jiang, M. and Chen, S. and Wang, L. and Zhao, Q. "VisualHow: Multimodal Problem Solving" IEEE Conference on Computer Vision and Pattern Recognition , 2022 https://doi.org/10.1109/CVPR52688.2022.01518 Citation Details
Zhang, Yifeng and Chen, Shi and Zhao, Qi "Toward Multi-Granularity Decision-Making: Explicit Visual Reasoning with Hierarchical Knowledge" , 2023 https://doi.org/10.1109/ICCV51070.2023.00243 Citation Details

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page