Award Abstract # 1815948
III: Small: Collaborative Research: Explainable Natural Language Inference

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: UNIVERSITY OF ARIZONA
Initial Amendment Date: July 27, 2018
Latest Amendment Date: June 22, 2020
Award Number: 1815948
Award Instrument: Standard Grant
Program Manager: Sylvia Spengler
sspengle@nsf.gov
 (703)292-7347
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2018
End Date: August 31, 2023 (Estimated)
Total Intended Award Amount: $254,463.00
Total Awarded Amount to Date: $262,463.00
Funds Obligated to Date: FY 2018 = $254,463.00
FY 2020 = $8,000.00
History of Investigator:
  • Peter Jansen (Principal Investigator)
    pajansen@email.arizona.edu
  • Mihai Surdeanu (Co-Principal Investigator)
Recipient Sponsored Research Office: University of Arizona
845 N PARK AVE RM 538
TUCSON
AZ  US  85721
(520)626-6000
Sponsor Congressional District: 07
Primary Place of Performance: University of Arizona
AZ  US  85721-0001
Primary Place of Performance
Congressional District:
07
Unique Entity Identifier (UEI): ED44Y3W6P7B9
Parent UEI:
NSF Program(s): Info Integration & Informatics
Primary Program Source: 01001819DB NSF RESEARCH & RELATED ACTIVIT
01002021DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 075Z, 7364, 7923, 9251
Program Element Code(s): 736400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Natural language inference (NLI) can support decision-making using information contained in natural language texts (e.g, detecting undiagnosed medical conditions in medical records, finding alternate treatments from scientific literature). This requires gathering facts extracted from text and reasoning over them. Current automated solutions for NLI are largely incapable of producing explanations for their inferences, but this capacity is essential for users to trust their reasoning in domains such as scientific discovery and medicine where the cost of making errors is high. This project develops natural language inference methods that are both accurate and explainable. They are accurate because they build on state-of-the-art deep learning frameworks which use powerful, automatically learned, representations of text. They are explainable because they aggregate information in units that can be represented in both a human readable explanation and a machine-usable vector representation. This project will advance methods in explainable natural language inference to enable the application of automated inference methods in critical domains such as medical knowledge extraction. The project will also evaluate the explainability of the inference decisions in collaboration with domain experts.

This project reframes natural language inference as the task of constructing and reasoning over explanations. In particular, inference assembles smaller component facts into a graph (explanation graph) that it reasons over to make decisions. In this view, generating explanations is an integral part of the inference process and not a separate post-hoc mechanism. The project has three main goals: (a) Develop multiagent reinforcement learning models that can effectively and efficiently explore the space of explanation graphs, (b) Develop deep learning based aggregation mechanisms that can prevent inference from combining semantically incompatible evidence, and (c) Build a continuum of hypergraph based text representations that combine discrete forms of structured knowledge with their continuous embedding based representations. The techniques will be evaluated on three application domains: complex question answering, medical relation extraction, and clinical event detection from medical records. The results of the project will be disseminated through the project website, scholarly venues, and the software and datasets will be made available to the public.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 20)
Dalvi, Bhavana and Jansen, Peter and Tafjord, Oyvind and Xie, Zhengnan and Smith, Hannah and Pipatanangkura, Leighanna and Clark, Peter "Explaining Answers with Entailment Trees" Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , 2021 https://doi.org/10.18653/v1/2021.emnlp-main.585 Citation Details
Jansen, Peter "A Systematic Survey of Text Worlds as Embodied Natural Language Environments" Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022) , 2022 https://doi.org/10.18653/v1/2022.wordplay-1.1 Citation Details
Jansen, Peter "CoSaTa: A Constraint Satisfaction Solver and Interpreted Language for Semi-Structured Tables of Sentences" , 2020 https://doi.org/10.18653/v1/2020.emnlp-demos.10 Citation Details
Jansen, Peter and Cote, Marc-alexandre "TextWorldExpress: Simulating Text Games at One Million Steps Per Second" , 2023 https://doi.org/10.18653/v1/2023.eacl-demo.20 Citation Details
Jansen, Peter and Smith, Kelly J. and Moreno, Dan and Ortiz, Huitzilin "On the Challenges of Evaluating Compositional Explanations in Multi-Hop Inference: Relevance, Completeness, and Expert Ratings" Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , 2021 https://doi.org/10.18653/v1/2021.emnlp-main.596 Citation Details
Jansen, Peter and Thayaparan, Mokanarangan and Valentino, Marco and Ustalov, Dmitry "TextGraphs 2021 Shared Task on Multi-Hop Inference for Explanation Regeneration" Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15) , 2021 https://doi.org/10.18653/v1/2021.textgraphs-1.17 Citation Details
Jansen, Peter and Ustalov, Dmitry "TextGraphs 2019 Shared Task on Multi-Hop Inference for Explanation Regeneration" , 2019 https://doi.org/10.18653/v1/D19-5309 Citation Details
Jansen, Peter and Ustalov, Dmitry "TextGraphs 2020 Shared Task on Multi-Hop Inference for Explanation Regeneration" Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs) , 2020 https://doi.org/10.18653/v1/2020.textgraphs-1.10 Citation Details
Liang, Zhengzhong and Bethard, Steven and Surdeanu, Mihai "Explainable Multi-hop Verbal Reasoning Through Internal Monologue" Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , 2021 https://doi.org/10.18653/v1/2021.naacl-main.97 Citation Details
Smith, Hannah and Zhang, Zeyu and Culnan, John and Jansen, Peter "ScienceExamCER: A High-Density Fine-Grained Science-Domain Corpus for Common Entity Recognition" , 2020 Citation Details
Thiem, Sebastian and Jansen, Peter "Extracting Common Inference Patterns from Semi-Structured Explanations" , 2019 https://doi.org/10.18653/v1/D19-6006 Citation Details
(Showing: 1 - 10 of 20)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This project (Explainable Natural Language Inference) broadly aimed to improve the capabilities of artificial intelligence systems to correctly answer questions posed by humans while also producing explanations for why the model believes its answers are correct. This ability to generate explanations for its reasoning vastly improves the usefulness of such a system.


The work on this award made scientific contributions in designing algorithms that answer questions and build explanations, building data to help AI models learn to build explanations automatically, designing new methods of representing explanations that make inference easier, and designing new ways of measuring a system's ability to perform reasoning. More specifically:


Designing new algorithms that build explanations:

This award constructed new methods and algorithms for building explanations to questions, centrally by combining multiple smaller facts into larger explanatory wholes. One of the main topic areas studied was in elementary and middle-school level scientific reasoning, where a student might need to combine multiple facts (such as that water has a boiling point of 100 degrees Celsius, that a metal pot is a good thermal conductor, that a stove is a source of heat, and that objects that are heated increase temperature) to answer a question about why water boiled in a pot on a stove. This project examined the ability of a large number of existing and modified algorithms related to deep learning, symbolic learning, reinforcement learning, and other kinds of learning, for their ability to perform this kind of explanatory inference.


Building data to help AI models learn to build better explanations:

Contemporary artificial intelligence systems typically work by "training" a model to perform a task on a large number of high-quality examples of that task, and then further evaluate the ability of that model to correctly perform the task on a different set of "testing" data that was hidden during training. Prior to this award, essentially no significant data existed for training or evaluating systems to build large explanations by combining facts together. During this award, a large amount of data for training and evaluating artificial reasoning and explanation construction was produced. Similarly, while counter-intuitive, machines don't often learn best in text, but frequently prefer having some very clear and logical "structure" to their data that helps make the task easier to perform. We developed a new formalism called "Entailment Trees" that help represent explanations in a structure that resembles a tree. This kind of representation makes it easier for machines to learn to perform reasoning because it breaks down large reasoning problems into smaller steps and makes it very explicit what facts are required to solve each step, and exactly what the model should infer from those facts. Similarly, it's sometimes very hard to know how well our AI systems perform at some task because they might be tested on very large amounts of data (for example, thousands of questions every day), which is too large or expensive for humans to manually rate. Because of this, scientists typically develop methods of automatically performing evaluation which — while not as good as human evaluation — help tell us enough about the performance of the system on average that they're useful for monitoring day-to-day progress. In this work, we showed that some existing methods for this automatic evaluation are underestimating the performance of some systems to generate explanations, and developed new methods of evaluation that experiments show are better.


Designing new ways of measuring a system's ability to perform reasoning:

One of the surprising results of this work is that, while the systems that we and others developed rapidly increased their ability to correctly answer questions (and build explanations) over the years of this award, that knowledge and reasoning tended to be very brittle. For example, large language models eventually were able to score near an "A" grade on elementary science exams — a major scientific achievement — suggesting that AI systems understood science at the level of a 5th grader. However, we showed, by building a virtual environment (similar to a game) that AI models can interact with, that simulated much of the same content that's on science exams (like how to boil water, build simple electronic circuits, or understand basic properties of genetics), that AI systems that were able to score extremely well on written tests were unable to demonstrate that same knowledge when tested in a different way — interactively — suggesting that the AI models knew much less than we originally thought. This has spawned a great deal of work in building interactive virtual environments that help train and evaluate the multi-step reasoning capabilities of language models, so that they can begin to perform detailed multi-step processes that are helpful and relevant for tasks that are useful for humans.


Last Modified: 07/27/2024
Modified by: Peter A Jansen

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page