Award Abstract # 1704532
III: Medium: Collaborative Research: StructNet: Constructing and Mining Structure-Rich Information Networks for Scientific Research

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: UNIVERSITY OF ILLINOIS
Initial Amendment Date: May 31, 2017
Latest Amendment Date: July 16, 2020
Award Number: 1704532
Award Instrument: Continuing Grant
Program Manager: Sorin Draghici
sdraghic@nsf.gov
 (703)292-2232
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: July 1, 2017
End Date: June 30, 2023 (Estimated)
Total Intended Award Amount: $400,000.00
Total Awarded Amount to Date: $400,000.00
Funds Obligated to Date: FY 2017 = $296,833.00
FY 2018 = $103,167.00

FY 2020 = $0.00
History of Investigator:
  • Jiawei Han (Principal Investigator)
    hanj@illinois.edu
Recipient Sponsored Research Office: University of Illinois at Urbana-Champaign
506 S WRIGHT ST
URBANA
IL  US  61801-3620
(217)333-2187
Sponsor Congressional District: 13
Primary Place of Performance: University of Illinois at Urbana-Champaign
IL  US  61820-6235
Primary Place of Performance
Congressional District:
13
Unique Entity Identifier (UEI): Y8CWNJRCNN91
Parent UEI: V2PHZ2CSCH63
NSF Program(s): Info Integration & Informatics
Primary Program Source: 01001718DB NSF RESEARCH & RELATED ACTIVIT
01001819DB NSF RESEARCH & RELATED ACTIVIT

01002021DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7364, 7924
Program Element Code(s): 736400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Science disciplines have been generating huge volume of research publications, which is of tremendous value but far beyond researchers' capacity to digest and analyze. There is a critical need to automatically (with the help of widely available, general knowledge-bases) transform research text into structured information networks on which advanced search and analytics tools can be developed to facilitate researchers and practitioners to quickly locate knowledge, make inferences, and even generate new scientific hypothesis.

This project aims at developing a new data-to-network-to-knowledge (D2N2K) paradigm to transform massive, unstructured but interconnected research text data into actionable knowledge, by integrating semi-structured and unstructured data. First, organized heterogeneous information networks (hence called StructNet) are constructed, and then powerful mining mechanisms on such organized networks are developed. With a focus on biomedical sciences, the project investigates the principles, methodologies and algorithms for (i) construction of relatively structured heterogeneous information networks (called MediNet) by mining biomedical research corpora via attribute extraction, relation typing, and claim mining, and (ii) exploration and mining of the networks so constructed via graph OLAP and task-guided embedding. The project develops an extensible framework to facilitate literature-based scientific research. The study on construction and exploration of MediNet not only impacts biomedical research but also consolidates this data-to-network-to knowledge methodology, readily to be transferred to other domains, for automatic transformation of massive unstructured text data in those domains into structured and actionable knowledge.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 76)
Yoon, Susik and Meng, Yu and Lee, Dongha and Han, Jiawei "SCStory: Self-supervised and Continual Online Story Discovery" , 2023 https://doi.org/10.1145/3543507.3583507 Citation Details
Balepur, Nishant and Agarwal, Shivam and Venkat Ramanan, Karthik and Yoon, Susik and Yang, Diyi and Han, Jiawei "DynaMiTE: Discovering Explosive Topic Evolutions with User Guidance" , 2023 https://doi.org/10.18653/v1/2023.findings-acl.14 Citation Details
Dong, Xin Luna and He, Xiang and Kan, Andrey and Li, Xian and Liang, Yan and Ma, Jun and Xu, Yifan Ethan and Zhang, Chenwei and Zhao, Tong and Blanco Saldana, Gabriel and Deshpande, Saurabh and Michetti Manduca, Alexandre and Ren, Jay and Singh, Surender "AutoKnow: Self-Driving Knowledge Collection for Products of Thousands of Types" KDD:20 The 26th {ACM} {SIGKDD} Conference on Knowledge Discovery and Data Mining , v.1 , 2020 https://doi.org/10.1145/3394486.3403323 Citation Details
El-Kishky, Ahmed and Xu, Frank and Zhang, Aston and Han, Jiawei "Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings" 2019 {IEEE} International Conference on Big Data (Big Data) , v.1 , 2019 10.1109/BigData47090.2019.9005957 Citation Details
Fouche, Edouard and Meng, Yu and Guo, Fang and Zhuang, Honglei and Bohm, Klemens and Han, Jiawei "Mining Text Outliers in Document Directories" ICDM'20: IEEE 2020 Int. Conf. on Data Mining, Nov. 2020 , v.2020 , 2020 https://doi.org/10.1109/ICDM50108.2020.00024 Citation Details
Ge, Suyu and Huang, Jiaxin and Meng, Yu and Han, Jiawei "FineSum: Target-Oriented, Fine-Grained Opinion Summarization" , 2023 https://doi.org/10.1145/3539597.3570397 Citation Details
Gu, Xiaotao and Wang, Zihan and Bi, Zhenyu and Meng, Yu and Liu, Liyuan and Han, Jiawei and Shang, Jingbo "UCPhrase: Unsupervised Context-aware Quality Phrase Tagging" KDD'21:The 27th {ACM} {SIGKDD} Conference on Knowledge Discovery and Data Mining, August 14-18, 2021 , v.2021 , 2021 https://doi.org/10.1145/3447548.3467397 Citation Details
Huang, Jiaxin and Meng, Yu and Guo, Fang and Ji, Heng and Han, Jiawei "Weakly-Supervised Aspect-Based Sentiment Analysis via Joint Aspect-Sentiment Topic Embedding" EMNLP'20: 2020 Conf. on Empirical Methods in Natural Language Processing, Nov. 2020 , v.2020 , 2020 https://doi.org/10.18653/v1/2020.emnlp-main.568 Citation Details
Huang, Jiaxin and Meng, Yu and Han, Jiawei "Few-Shot Fine-Grained Entity Typing with Automatic Label Interpretation and Instance Generation" KDD'22:The 28th {ACM} {SIGKDD} Conference on Knowledge Discovery and Data Mining, August 14-18, 2021 , v.2022 , 2022 https://doi.org/10.1145/3534678.3539443 Citation Details
Huang, Jiaxin and Xie, Yiqing and Meng, Yu and Zhang, Yunyi and Han, Jiawei "CoRel: Seed-Guided Topical Taxonomy Construction by Concept Learning and Relation Transferring" KDD:20 The 26th {ACM} {SIGKDD} Conference on Knowledge Discovery and Data Mining , v.1 , 2020 https://doi.org/10.1145/3394486.3403244 Citation Details
Jiang, Meng and Shang, Jingbo and Cassidy, Taylor and Ren, Xiang and Kaplan, Lance M. and Hanratty, Timothy P. and Han, Jiawei "MetaPAD: Meta Pattern Discovery from Massive Text Corpora" Proceedings of the 23rd {ACM} {SIGKDD} International Conference on Knowledge Discovery and Data Mining , v.23 , 2017 https://doi.org/10.1145/3097983.3098105 Citation Details
(Showing: 1 - 10 of 76)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The project, "NSF III: Medium: Collaborative Research: StructNet: Constructing and Mining Structure-Rich Information Networks for Scientific Research" (NSF IIS 17-04532 (07/01/2017--06/30/2023)), has achieved fruitful and high-impact results, with over 100 research papers in major research venues and numerous citations.  The project has generated 15 PhDs, and many Masters/B.S.?s.  Six PhDs have become assistant professors in universities (including Georgia Tech, UCSD, Emory, Virginia Tech, Washington U. at St. Louis, and Univ. of Virginia) and two of them have received ACM SIGKDD Dissertation Award Runner-Up.  The research has also attracted a good number of female and minority students to join our team (for example, 6 of our current 12 PhD students are female) and the students have received a few prominent awards from industry including Google, Microsoft, and Amazon PhD Fellowships for PhDs and two Siebel Scholar Fellowships for MS's.   The project has proceeded in the direction as planned: Transform massive, unstructured but interconnected data into actionable knowledge, with a new, promising paradigm: data-to-network-to-knowledge, by integrating semi-structured and unstructured data, constructing organized heterogeneous information networks, and then developing powerful mining mechanisms on such organized networks.  We have been working on principles, methodologies and algorithms for mining massive corpora and transforming text knowledge into relatively structured heterogeneous information networks. 

Six selected papers (one/year) in the attached images demonstrate its intellectual merit and broad impact.

AutoPhrase: Automated Phrase Mining from Massive Text Corpora (2018). We develop a novel framework for automated phrase mining, AutoPhrase, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available. Compared to the state-of-the-art methods, AutoPhrase has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. The research paper has been published as follows: Jingbo Shang, et al., "Automated Phrase Mining from Massive Text Corpora", IEEETKDE (2018) (citation 330+). The software is open source in GitHub and has been used by many.

JoSE: A joint spherical embedding method for text embedding (2019).  We propose a spherical generative model based on which unsupervised word and paragraph embeddings are jointly learned. To learn text embeddings in spherical space, we develop an efficient optimization algorithm with convergence guarantee. Our model enjoys high efficiency and achieves state-of-the-art performances on various text embedding tasks including word similarity and document clustering. This leads to a paper: Yu Meng, et al., "Spherical Text Embedding," in NeurIPS'19.

CatE: Category-Name Guided Text Embedding for Discriminative Topic Mining (2020): We propose a new task, discriminative topic mining, which leverages a set of user-provided category names to mine discriminative topics from text corpora. This novel category-name guided text embedding method for discriminative topic mining effectively leverages minimal user guidance to learn a discriminative embedding space and discover category representative terms in an iterative manner. CatE mines high-quality set of topics guided by category names only and benefits a variety of downstream applications including weakly-supervised classification and lexical entailment direction identification.  The study leads to a research paper: Yu Meng, et al., "Discriminative Topic Mining via Category-Name Guided Text Embedding", in WWW'20.

TaxoClass: A taxonomy-guided, hierarchical multi-label text classification method, which (1) calculates document-class similarities using a textual entailment model, (2) identifies a document?s core classes and utilizes confident core classes to train a taxonomy-enhanced classifier, and (3) generalizes the classifier via multi-label self-training. Our experiments show TaxoClass uses only class names but outperforms the best previous method by 25%.  The research paper has been published as follows: Jiaming Shen et al., "TaxoClass: Hierarchical Multi-Label Text Classification Using Only Class Names", in NAACL-HLT'21.

EvMine: Unsupervised Key Event Detection from Massive Text Corpus. Real-world events have different granularities, from the top-level themes to key events and then to event mentions corresponding to concrete actions.  We propose a new task, key event detection at the intermediate level, which aims to detect from a news-corpus key events (e.g., HK Airport Protest on Aug. 12-14), each happening at a particular time/location and focusing on the same topic. We develop an unsupervised key event detection framework, EvMine.  Extensive experiments show EvMine outperforms all the baseline methods and its ablations on two real-world news corpora.  The research paper has been published as follows: Yunyi Zhang, et al., "Unsupervised Key Event Detection from Massive Text Corpus", in KDD'22.

Heterformer: Transformer-based Deep Node Representation Learning on Heterogeneous Text-Rich Networks, which performs contextualized text encoding and heterogeneous structure encoding in a unified model. Specifically, we inject heterogeneous structure information into each Transformer layer when encoding node texts. Meanwhile, Heterformer is capable of characterizing node/edge type heterogeneity and encoding nodes with or without texts.   Comprehensive experiments show Heterformer outperforms competitive baselines significantly and consistently.  The research paper has been published as follows: Bowen Jin, et al., "Heterformer: Transformer-based Deep Node Representation Learning on Heterogeneous Text-Rich Networks", in KDD'23. 


Last Modified: 10/25/2023
Modified by: Jiawei Han

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page