
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | May 31, 2017 |
Latest Amendment Date: | July 16, 2020 |
Award Number: | 1704532 |
Award Instrument: | Continuing Grant |
Program Manager: |
Sorin Draghici
sdraghic@nsf.gov (703)292-2232 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | July 1, 2017 |
End Date: | June 30, 2023 (Estimated) |
Total Intended Award Amount: | $400,000.00 |
Total Awarded Amount to Date: | $400,000.00 |
Funds Obligated to Date: |
FY 2018 = $103,167.00 FY 2020 = $0.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
506 S WRIGHT ST URBANA IL US 61801-3620 (217)333-2187 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
IL US 61820-6235 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Info Integration & Informatics |
Primary Program Source: |
01001819DB NSF RESEARCH & RELATED ACTIVIT 01002021DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Science disciplines have been generating huge volume of research publications, which is of tremendous value but far beyond researchers' capacity to digest and analyze. There is a critical need to automatically (with the help of widely available, general knowledge-bases) transform research text into structured information networks on which advanced search and analytics tools can be developed to facilitate researchers and practitioners to quickly locate knowledge, make inferences, and even generate new scientific hypothesis.
This project aims at developing a new data-to-network-to-knowledge (D2N2K) paradigm to transform massive, unstructured but interconnected research text data into actionable knowledge, by integrating semi-structured and unstructured data. First, organized heterogeneous information networks (hence called StructNet) are constructed, and then powerful mining mechanisms on such organized networks are developed. With a focus on biomedical sciences, the project investigates the principles, methodologies and algorithms for (i) construction of relatively structured heterogeneous information networks (called MediNet) by mining biomedical research corpora via attribute extraction, relation typing, and claim mining, and (ii) exploration and mining of the networks so constructed via graph OLAP and task-guided embedding. The project develops an extensible framework to facilitate literature-based scientific research. The study on construction and exploration of MediNet not only impacts biomedical research but also consolidates this data-to-network-to knowledge methodology, readily to be transferred to other domains, for automatic transformation of massive unstructured text data in those domains into structured and actionable knowledge.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
The project, "NSF III: Medium: Collaborative Research: StructNet: Constructing and Mining Structure-Rich Information Networks for Scientific Research" (NSF IIS 17-04532 (07/01/2017--06/30/2023)), has achieved fruitful and high-impact results, with over 100 research papers in major research venues and numerous citations. The project has generated 15 PhDs, and many Masters/B.S.?s. Six PhDs have become assistant professors in universities (including Georgia Tech, UCSD, Emory, Virginia Tech, Washington U. at St. Louis, and Univ. of Virginia) and two of them have received ACM SIGKDD Dissertation Award Runner-Up. The research has also attracted a good number of female and minority students to join our team (for example, 6 of our current 12 PhD students are female) and the students have received a few prominent awards from industry including Google, Microsoft, and Amazon PhD Fellowships for PhDs and two Siebel Scholar Fellowships for MS's. The project has proceeded in the direction as planned: Transform massive, unstructured but interconnected data into actionable knowledge, with a new, promising paradigm: data-to-network-to-knowledge, by integrating semi-structured and unstructured data, constructing organized heterogeneous information networks, and then developing powerful mining mechanisms on such organized networks. We have been working on principles, methodologies and algorithms for mining massive corpora and transforming text knowledge into relatively structured heterogeneous information networks.
Six selected papers (one/year) in the attached images demonstrate its intellectual merit and broad impact.
AutoPhrase: Automated Phrase Mining from Massive Text Corpora (2018). We develop a novel framework for automated phrase mining, AutoPhrase, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available. Compared to the state-of-the-art methods, AutoPhrase has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. The research paper has been published as follows: Jingbo Shang, et al., "Automated Phrase Mining from Massive Text Corpora", IEEETKDE (2018) (citation 330+). The software is open source in GitHub and has been used by many.
JoSE: A joint spherical embedding method for text embedding (2019). We propose a spherical generative model based on which unsupervised word and paragraph embeddings are jointly learned. To learn text embeddings in spherical space, we develop an efficient optimization algorithm with convergence guarantee. Our model enjoys high efficiency and achieves state-of-the-art performances on various text embedding tasks including word similarity and document clustering. This leads to a paper: Yu Meng, et al., "Spherical Text Embedding," in NeurIPS'19.
CatE: Category-Name Guided Text Embedding for Discriminative Topic Mining (2020): We propose a new task, discriminative topic mining, which leverages a set of user-provided category names to mine discriminative topics from text corpora. This novel category-name guided text embedding method for discriminative topic mining effectively leverages minimal user guidance to learn a discriminative embedding space and discover category representative terms in an iterative manner. CatE mines high-quality set of topics guided by category names only and benefits a variety of downstream applications including weakly-supervised classification and lexical entailment direction identification. The study leads to a research paper: Yu Meng, et al., "Discriminative Topic Mining via Category-Name Guided Text Embedding", in WWW'20.
TaxoClass: A taxonomy-guided, hierarchical multi-label text classification method, which (1) calculates document-class similarities using a textual entailment model, (2) identifies a document?s core classes and utilizes confident core classes to train a taxonomy-enhanced classifier, and (3) generalizes the classifier via multi-label self-training. Our experiments show TaxoClass uses only class names but outperforms the best previous method by 25%. The research paper has been published as follows: Jiaming Shen et al., "TaxoClass: Hierarchical Multi-Label Text Classification Using Only Class Names", in NAACL-HLT'21.
EvMine: Unsupervised Key Event Detection from Massive Text Corpus. Real-world events have different granularities, from the top-level themes to key events and then to event mentions corresponding to concrete actions. We propose a new task, key event detection at the intermediate level, which aims to detect from a news-corpus key events (e.g., HK Airport Protest on Aug. 12-14), each happening at a particular time/location and focusing on the same topic. We develop an unsupervised key event detection framework, EvMine. Extensive experiments show EvMine outperforms all the baseline methods and its ablations on two real-world news corpora. The research paper has been published as follows: Yunyi Zhang, et al., "Unsupervised Key Event Detection from Massive Text Corpus", in KDD'22.
Heterformer: Transformer-based Deep Node Representation Learning on Heterogeneous Text-Rich Networks, which performs contextualized text encoding and heterogeneous structure encoding in a unified model. Specifically, we inject heterogeneous structure information into each Transformer layer when encoding node texts. Meanwhile, Heterformer is capable of characterizing node/edge type heterogeneity and encoding nodes with or without texts. Comprehensive experiments show Heterformer outperforms competitive baselines significantly and consistently. The research paper has been published as follows: Bowen Jin, et al., "Heterformer: Transformer-based Deep Node Representation Learning on Heterogeneous Text-Rich Networks", in KDD'23.
Last Modified: 10/25/2023
Modified by: Jiawei Han
Please report errors in award information by writing to: awardsearch@nsf.gov.