
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | June 25, 2018 |
Latest Amendment Date: | June 5, 2023 |
Award Number: | 1816039 |
Award Instrument: | Standard Grant |
Program Manager: |
Jie Yang
jyang@nsf.gov (703)292-4768 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | August 1, 2018 |
End Date: | July 31, 2024 (Estimated) |
Total Intended Award Amount: | $499,999.00 |
Total Awarded Amount to Date: | $515,999.00 |
Funds Obligated to Date: |
FY 2019 = $16,000.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
660 S MILL AVENUE STE 204 TEMPE AZ US 85281-3670 (480)965-5479 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
699 S. Mill Avenue Tempe AZ US 85281-3673 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Robust Intelligence |
Primary Program Source: |
01001920DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Understanding of visual and textual inputs are important aspects of Artificial Intelligence systems. Often such inputs are presented together to instruct and explain. As examples, an intelligent robot might learn about its tasks and environment by observing both language and gesture; and an intelligent system addressing scientific questions must interpret figures and diagrams along with text. While there has been a lot of research concerning visual understanding and textual understanding in isolation, there has been very little research that addresses them jointly. This project is developing a framework for answering hard questions about combined visual and textual inputs, and providing supporting explanations. By developing a system that integrates visual and linguistic information for this task, the project could provide the basis for automated tutoring systems in K-12 education, and interpretable interfaces for the workers operating intelligent machines.
The project will employ an integrated approach of deep model-based visual recognition and natural language processing, and knowledge representation and reasoning to develop a question answering engine and its components. It will create a challenge corpus that has visual and textual inputs and questions about those inputs given in natural language. It will provide a baseline for semantic image and text parsing and reasoning-based question answering systems. It will develop semantic parsing of non-continuous text items, such as figures, diagrams, and graphs. It will enhance semantic parsing to various formats of natural language text and questions. It will develop methods to acquire knowledge and reasoning with them for answering questions and providing explanations to the answers. Together these contributions of the project will advance Artificial General Intelligence and allow future service robots and personal mobile applications to understand combined visual and textual inputs. The findings from this project will advance the development of knowledge-driven, reasoning-based question answering by filling the current gap on how to efficiently conduct explainable probabilistic reasoning over deep models. This helps to overcome the fragility of the trained visual and textual understanding models. It will also uncover the intrinsic connections between deep model-based vision and language understanding algorithms and probabilistic knowledge representation and reasoning by exploring a joint solution for answering the hard questions. In general, this project may result in advances in multiple sub-fields of Artificial Intelligence; namely, computer vision, natural language processing, and question answering; and may impact others such as robotics.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
This project that started in 2018 was about joint understanding of text and images together. Until then most datasets treated vision and language modalities separately. Inspired by the "comprehension" part of the international PISA test administered to high school students, whose evaluation of students' comprehension included understanding text and images together, the Pis proposed and then developed the VLQA (visuo-linguistic question answering) dataset and published about it in 2020 EMNLP Findings. A key aspect of the data items in this dataset was that to answer a question one had to understand both an image and the associated text; Understanding just one or the other was snot sufficient to answer the associated question. With progress in vision-language models, this aspect has finally caught on and is now referred to as multi-modal understanding or multi-modal question answering and many other datasets have been developed by others; some referring to our initial work. After the initial VLQA dataset the PIs envisioned the importance of interaction between objects in an image and actions that may happen and their impact on those objects and created the CLEVR_HYP dataset that had images from CLEVR and the text included information about hypothetical action sequences that may take places and expected joint understanding of both. This work, first published in NAACL 2021, could be considered a precursor of "world models" that is now becoming a popular research trend. Thus, our research results in this project plays an important role on two major future research trends with implications to AI, Computer Science and the broader research community: multi-modal understanding and world models. This project involved more than six PhD students (two female), and four undergraduate (two female) students, at different times. One of the PhD students is now a tenure-track faculty. This project enriched graduate courses in NLP, and Vision at ASU; and it led to other projects through which the PIs were able to build a GPU server infrastructure that is playing a big role in the overall research of the two PIs.
Last Modified: 12/15/2024
Modified by: Chitta R Baral
Please report errors in award information by writing to: awardsearch@nsf.gov.