
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | May 14, 2020 |
Latest Amendment Date: | June 21, 2021 |
Award Number: | 1948322 |
Award Instrument: | Standard Grant |
Program Manager: |
Hector Munoz-Avila
hmunoz@nsf.gov (703)292-4481 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | May 15, 2020 |
End Date: | April 30, 2023 (Estimated) |
Total Intended Award Amount: | $174,332.00 |
Total Awarded Amount to Date: | $190,332.00 |
Funds Obligated to Date: |
FY 2021 = $16,000.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
107 S INDIANA AVE BLOOMINGTON IN US 47405-7000 (317)278-3473 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
535 W Michigan St., IT475 Indianapolis IN US 46202-6151 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Info Integration & Informatics |
Primary Program Source: |
01002122DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Text mining made important advances in methods to convert vast and unstructured text data into knowledge. However, the current paradigm of relationship extraction has one major limitation: it models snapshots of information but fails to capture the fundamentally dialogic and dynamic nature of knowledge: conflicting findings, inconsistent discoveries, refutations, contradictions, reinforcements or confirmations, all changing over time. This project aims to capture such fundamental dynamics of knowledge, specifically focusing on causal relationships. Whereas numerous articles, including academic articles, present knowledge and relationships that express causality, such relationships are not static and can change over time due to changing conditions. The objective of this project is to identify cues of causal knowledge from text data, quantify the strength of the causal relationship, and model its dynamics over changing conditions. Ultimately, the project aims at modelling a more holistic view of the knowledge extracted from text. As text data is extensively used by researchers and practitioners from different domains of national importance, including, medicine and health, economics, public policy, journalism, the results of this project seek to provide the foundation to offer practitioners new ways to understand the evolving nature of the causal relationships present in large text datasets. Specifically, the novel approaches developed in the project will be applied to explore public health data to determine how changing climatic, political, economic conditions may affect the mental and physical health of the population in different geographic areas. In addition, there will be various educational activities as part of this project - emerging and related topics from this project will be included in the curricula of various courses in the applied data science master?s program; promote undergraduate research, specifically, recruit students to work in the project who are from underrepresented and economically disadvantaged communities; organize a research workshop to encourage participation of high school students in STEM research.
The project activities include the development of a novel model of causal relationship extraction that leverages a unified deep learning framework combining both semantic and syntax cues. This approach will utilize the key syntactical features of a sentence represented by the grammar relationships between noun, verbs and other parts of speech through graphical or tree-like models. This work will determine whether the sentence features a structure that signals causality. Moreover, the sequential component of the model will utilize the semantics and identify the influence of certain words in the sentence to characterize the nature of the causal relationship expressed in the text. This task will capture the strength of the relationship (e.g., using cues like "extremely likely", "definitely"), any supporting or opposing evidences (e.g., "will lead to" or "does not lead to"), and will identify conditional cues (e.g., "in the presence of") etc. Quantifying such qualitative properties will lead to the second innovation of this project ? causal distance. Causal distance is a time-variant metric that will denote the magnitude of causality between two entities as well as capture the dynamism of the relationship by modifying itself over time with changing conditions or new evidences. Collectively, the advances pursued in this projects will further enhance our understanding of the novel computational approaches needed to unearth and reason on cues of causal relationships embedded in large text data sets. The outcomes of this project, such as datasets, source code, final software, results and publications will be shared via publicly accessible URLs and online code repositories. Additionally, all the project resources and outcomes will be made available on the project website.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
In this project, we sought to find novel ways of extracting causal knowledge from text. Our aim was to have a deeper understanding of how causality is expressed and embedded in the text, which will help to extract such knowledge. In addition, our goal was to capture the dynamism exhibited by these relationships.
The extraction models developed as part of this project have demonstrated two important accomplishments: first, they exhibit improved performance compared to existing methods on established benchmark datasets; second, they introduce unsupervised approaches that can analyze natural language text without the need for annotated datasets, expanding their applicability. To capture the dynamism, we introduced Pointwise Causal Information metric, which offers a continuous real-valued measurement of causal relationship strength. This metric quantifies causality with greater nuance and precision than previous binary classifications. Our efforts extend to a new annotation scheme, producing more comprehensive datasets that facilitate the extraction of complex relationships. These enriched datasets address intricate causal relationships, including contradiction, conditional, temporal, transitive, and triangular causality. Our exploration into Large Language Models (LLMs) has uncovered insights into how causality is embedded within their parameters. Our evaluation of LLMs has yielded valuable findings about their causal knowledge and understanding capabilities. Leveraging this knowledge, we have developed a transfer learning-based model capable of extracting causality without extensive training on specific data sources. This transfer learning technique has enabled the models, trained on a diverse set of causal sentences, to be used on an out-of-domain dataset with minimal supervision. This model's application in the field of biomedical literature has led to significant results, detecting latent factors of diseases such as symptoms, risk factors, and associated conditions. Our contributions extend beyond technical innovations. We have generated a new dataset to foster further advancements in causality extraction from biomedical literature.
By refining causality detection and quantification, we are advancing NLP applications involving sequence tagging and semantic relationships. Our novel frameworks and approaches are poised to elevate the state-of-the-art in causality detection, addressing previously overlooked scenarios. They are likely to improve the applicability of causality in different domains. The real-valued Pointwise Causal Information metric will strengthen causality applications in domains such as medicine, public health, and economics.
The significance of our work is recognized through four peer-reviewed publications and presentations at prestigious research venues. Regarding mentorship and collaboration, we have supported and funded numerous researchers at different academic levels, including two doctoral students, four graduate research assistants, two REU undergraduate researchers, and three high school participants.
In summary, our collective efforts have made remarkable strides in understanding, detecting, and applying causal relationships. These contributions transcend technical domains, impacting fields as diverse as medicine, public health, economics, and social sciences.
Last Modified: 08/30/2023
Modified by: Sunandan Chakraborty
Please report errors in award information by writing to: awardsearch@nsf.gov.