Award Abstract # 1816039
RI: Small: A Cognitive Framework for Technical, Hard and Explainable Question Answering (THE-QA) with respect to Combined Textual and Visual Inputs

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: ARIZONA STATE UNIVERSITY
Initial Amendment Date: June 25, 2018
Latest Amendment Date: June 5, 2023
Award Number: 1816039
Award Instrument: Standard Grant
Program Manager: Jie Yang
jyang@nsf.gov
 (703)292-4768
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: August 1, 2018
End Date: July 31, 2024 (Estimated)
Total Intended Award Amount: $499,999.00
Total Awarded Amount to Date: $515,999.00
Funds Obligated to Date: FY 2018 = $499,999.00
FY 2019 = $16,000.00
History of Investigator:
  • Chitta Baral (Principal Investigator)
    chitta@asu.edu
  • Yezhou Yang (Co-Principal Investigator)
Recipient Sponsored Research Office: Arizona State University
660 S MILL AVENUE STE 204
TEMPE
AZ  US  85281-3670
(480)965-5479
Sponsor Congressional District: 04
Primary Place of Performance: Arizona State University
699 S. Mill Avenue
Tempe
AZ  US  85281-3673
Primary Place of Performance
Congressional District:
04
Unique Entity Identifier (UEI): NTLHJXM55KZ6
Parent UEI:
NSF Program(s): Robust Intelligence
Primary Program Source: 01001819DB NSF RESEARCH & RELATED ACTIVIT
01001920DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7923, 7495, 9251
Program Element Code(s): 749500
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Understanding of visual and textual inputs are important aspects of Artificial Intelligence systems. Often such inputs are presented together to instruct and explain. As examples, an intelligent robot might learn about its tasks and environment by observing both language and gesture; and an intelligent system addressing scientific questions must interpret figures and diagrams along with text. While there has been a lot of research concerning visual understanding and textual understanding in isolation, there has been very little research that addresses them jointly. This project is developing a framework for answering hard questions about combined visual and textual inputs, and providing supporting explanations. By developing a system that integrates visual and linguistic information for this task, the project could provide the basis for automated tutoring systems in K-12 education, and interpretable interfaces for the workers operating intelligent machines.

The project will employ an integrated approach of deep model-based visual recognition and natural language processing, and knowledge representation and reasoning to develop a question answering engine and its components. It will create a challenge corpus that has visual and textual inputs and questions about those inputs given in natural language. It will provide a baseline for semantic image and text parsing and reasoning-based question answering systems. It will develop semantic parsing of non-continuous text items, such as figures, diagrams, and graphs. It will enhance semantic parsing to various formats of natural language text and questions. It will develop methods to acquire knowledge and reasoning with them for answering questions and providing explanations to the answers. Together these contributions of the project will advance Artificial General Intelligence and allow future service robots and personal mobile applications to understand combined visual and textual inputs. The findings from this project will advance the development of knowledge-driven, reasoning-based question answering by filling the current gap on how to efficiently conduct explainable probabilistic reasoning over deep models. This helps to overcome the fragility of the trained visual and textual understanding models. It will also uncover the intrinsic connections between deep model-based vision and language understanding algorithms and probabilistic knowledge representation and reasoning by exploring a joint solution for answering the hard questions. In general, this project may result in advances in multiple sub-fields of Artificial Intelligence; namely, computer vision, natural language processing, and question answering; and may impact others such as robotics.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Banerjee, Pratyay and Gokhale, Tejas and Yang, Yezhou and Baral, Chitta "Weakly Supervised Relative Spatial Reasoning for Visual Question Answering" 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , 2021 https://doi.org/10.1109/ICCV48922.2021.00192 Citation Details
Banerjee, Pratyay and Gokhale, Tejas and Yang, Yezhou and Baral, Chitta "WeaQA: Weak Supervision via Captions for Visual Question Answering" Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , 2021 https://doi.org/10.18653/v1/2021.findings-acl.302 Citation Details
Fang, Zhiyuan and Gokhale, Tejas and Banerjee, Pratyay and Baral, Chitta and Yang, Yezhou "Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning" Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2020 https://doi.org/10.18653/v1/2020.emnlp-main.61 Citation Details
Gokhale, Tejas and Banerjee, Pratyay and Baral, Chitta and Yang, Yezhou "MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering" Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2020 https://doi.org/10.18653/v1/2020.emnlp-main.63 Citation Details
Luo, Man and Fang, Zhiyuan and Gokhale, Tejas and Yang, Yezhou and Baral, Chitta "End-to-end Knowledge Retrieval with Multi-modal Queries" 61st Annual Meeting of the Association for Computational Linguistics , v.1 , 2023 Citation Details
Luo, Man and Sampat, Shailaja Keyur and Tallman, Riley and Zeng, Yankai and Vancha, Manuha and Sajja, Akarshan and Baral, Chitta "Just because you are right, doesnt mean I am wrong: Overcoming a bottleneck in development and evaluation of Open-Ended VQA tasks" Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , 2021 https://doi.org/10.18653/v1/2021.eacl-main.240 Citation Details
Luo, Man and Zeng, Yankai and Banerjee, Pratyay and Baral, Chitta "Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering" Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , 2021 https://doi.org/10.18653/v1/2021.emnlp-main.517 Citation Details
Sampat, Shailaja and Banerjee, Pratyay and Yang, Yezhou and and Baral, Chitta. "Learning Action-Effect Dynamics for Hypothetical Vision-Language Reasoning Task" Findings of EMNLP 2022. , 2022 Citation Details
Sampat, Shailaja Keyur and Kumar, Akshay and Yang, Yezhou and Baral, Chitta "CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images" Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , 2021 https://doi.org/10.18653/v1/2021.naacl-main.289 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This project that started in 2018 was about joint understanding of text and images together. Until then most datasets treated vision and language modalities separately. Inspired by the "comprehension" part of the international PISA test administered to high school students, whose evaluation of students' comprehension included understanding text and images together, the Pis proposed and then developed the VLQA (visuo-linguistic question answering) dataset and published about it in 2020 EMNLP Findings. A key aspect of the data items in this dataset was that to answer a question one had to understand both an image and the associated text; Understanding just one or the other was snot sufficient to answer the associated question. With progress in vision-language models, this aspect has finally caught on and is now referred to as multi-modal understanding or multi-modal question answering and many other datasets have been developed by others; some referring to our initial work. After the initial VLQA dataset the PIs envisioned the importance of interaction between objects in an image and actions that may happen and their impact on those objects and created the CLEVR_HYP dataset that had images from CLEVR and the text included information about hypothetical action sequences that may take places and expected joint understanding of both. This work, first published in NAACL 2021, could be considered a precursor of "world models" that is now becoming a popular research trend. Thus, our research results in this project plays an important role on two major future research trends with implications to AI, Computer Science and the broader research community:  multi-modal understanding and world models. This project involved more than six PhD students (two female), and four undergraduate (two female) students, at different times. One of the PhD students is now a tenure-track faculty. This project enriched graduate courses in NLP, and Vision at ASU; and it led to other projects through which the PIs were able to build a GPU server infrastructure that is playing a big role in the overall research of the two PIs.


Last Modified: 12/15/2024
Modified by: Chitta R Baral

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page