Award Abstract # 1514053
III: Medium: Constructing Knowledge Bases by Extracting Entity-Relations and Meanings from Natural Language via "Universal Schema"

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: UNIVERSITY OF MASSACHUSETTS
Initial Amendment Date: August 24, 2015
Latest Amendment Date: September 17, 2015
Award Number: 1514053
Award Instrument: Continuing Grant
Program Manager: Hector Munoz-Avila
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2015
End Date: August 31, 2020 (Estimated)
Total Intended Award Amount: $1,000,000.00
Total Awarded Amount to Date: $1,000,000.00
Funds Obligated to Date: FY 2015 = $1,000,000.00
History of Investigator:
  • Andrew McCallum (Principal Investigator)
    mccallum@cs.umass.edu
Recipient Sponsored Research Office: University of Massachusetts Amherst
101 COMMONWEALTH AVE
AMHERST
MA  US  01003-9252
(413)545-0698
Sponsor Congressional District: 02
Primary Place of Performance: University of Massachusetts Amherst
MA  US  01003-9242
Primary Place of Performance
Congressional District:
02
Unique Entity Identifier (UEI): VGJHK59NMPK9
Parent UEI: VGJHK59NMPK9
NSF Program(s): Info Integration & Informatics
Primary Program Source: 01001516DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7364, 7924
Program Element Code(s): 736400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Automated knowledge base (KB) construction from natural language is of fundamental importance to (a) scientists (for example, there has been long-standing interest in building KBs of genes and proteins), (b) social scientists (for example, building social networks from textual data), and (c) national defense (where network analysis of criminals and terrorists have proven useful). The core of a knowledge base is its objects ("entities", such as proteins, people, organizations and locations) and its connections between these objects ("relations", such as one protein increasing production of another, or a person working for an organization). This project aims to greatly increase the accuracy with which entity-relations can be extracted from text, as well as increase the fidelity which many subtle distinctions among types of relations can be represented. The project's technical approach -- which we call "universal schema" -- is a markedly novel departure from traditional methods, based on representing all of the input relation expressions as positions in a common multi-dimensional space, with nearby relations having similar meanings. Broader impacts will include collaboration with industry on applications of economic importance, collaboration with academic non-computer-scientists on a multidisciplinary application, creating and publicly releasing new data sets for benchmark evaluation by ourselves and others (enabling scientific progress through improved performance comparisons), creating and publicly releasing an open-source implementation of our methods (enabling further scientific research, easy large-scale use, rapid commercialization and third-party enhancements). Education impacts include creating and teaching a new course on knowledge base construction for the sciences, organizing a research workshop on embeddings, extraction and knowledge representation, and training multiple undergraduates and graduate students.

Most previous research in relation extraction falls into one of two categories. In the first, one must define a pre-fixed schema of relation types (such as lives-in, employed-by and a handful of others), which limits expressivity and hides language ambiguities. Training machine learning models here either relies on labeled training data (which is scarce and expensive), or uses lightly-supervised self-training procedures (which are often brittle and wander farther from the truth with additional iterations). In the second category, one extracts into an "open" schema based on language strings themselves (lacking ability to generalize among them), or attempts to gain generalization with unsupervised clustering of these strings (suffering from clusters that fail to capture reliable synonyms, or even find the desired semantics at all). This project proposes research in relation extraction of "universal schema", where we learn a generalizing model of the union of all input schemas, including multiple available pre-structured KBs as well as all the observed natural language surface forms. The approach thus embraces the diversity and ambiguity of original language surface forms (not trying to force relations into pre-defined boxes), yet also successfully generalizes by learning non-symmetric implicature among explicit and implicit relations using new extensions to the probabilistic matrix factorization and vector embedding methods that were so successful in the NetFlix prize competition. Universal schema provide for a nearly limitless diversity of relation types (due to surface forms), and support convenient semi-supervised learning through integration with existing structured data (i.e., the relation types of existing databases). In preliminary experiments, the approach already surpassed by a wide margin the previous state-of-the-art relation extraction methods on a benchmark task. New proposed research includes new training processes, new representations that include multiple-senses for the same surface form as well as embeddings with variances, new methods of incorporating constraints, joint inference between entity- and relation-types, new models of non-binary and higher-order relations, and scalability through parallel distribution. The project web site (http://www.iesl.cs.umass.edu/projects/NSF_USchema.html) will include information on the project and provide access to data sets, source code and documentation, teaching and workshop materials, and publications. In addition, datasets will be disseminated via UCI Machine Learning Repository (or other similar archive location for machine learning data) to facilitate sharing with other researchers and ensure long-term availability, and GitHub will be used to facilitate release, sharing, and archiving of code.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 19)
Derek Tam, Nicholas Monath, Ari Kobren,Aaron Traylor, Rajarshi Das, Andrew McCallum "Optimal Transport-based Alignment of Learned Character Representations for String Similarity" In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , 2019
Dongxu Zhang, Subhabrata Mukherjee, Colin Lockard, Luna Dong, Andrew McCallum "OpenKI: Integrating Open Information Extraction and Knowledge Bases with Relation Inference" In 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics , 2019
Emma Strubell and Andrew McCallum "Dependency Parsing with Dilated Iterated Graph CNNs" CoRR , v.abs/170 , 2017
Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, Andrew McCallum "Linguistically-Informed Self-Attention for Semantic Role Labeling" In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , 2018
Haw-Shiuan Chang, Amol Agrawal, Ananya Ganesh, Anirudha Desai, Vinayak Mathur, Alfred Hough, Andrew McCallum "Haw-Shiuan Chang, Amol Agrawal, Ananya Ganesh, Anirudha Desai, Vinayak Mathur, Alfred Hough, Andrew McCallum" Natural Language Processing at NAACL , 2018
Haw{-}Shiuan Chang and Erik G. Learned{-}Miller and Andrew McCallum "Active Bias: Training a More Accurate Neural Network by Emphasizing High Variance Samples" CoRR , v.abs/170 , 2017
Haw-Shiuan Chang, Shankar Vembu, Sunil Mohan, Rheeya Uppaal, Andrew McCallum "Using error decay prediction to overcome practical issues of deep active learning for named entity recognition" Machine Learning Journal , v.109 , 2020 , p.1749 https://doi.org/10.1007/s10994-020-05897-1
Haw-Shiuan Chang, ZiYun Wang, Luke Vilnis, Andrew McCallum "Distributional Inclusion Vector Embedding for Unsupervised Hypernymy Detection" he 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018) , 2018
Nathan Greenberg, Trapit Bansal, Patrick Verga, Andrew McCallum "Marginal Likelihood Training of BiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label Sets" In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , 2018
Patrick Verga , David Belanger, Emma Strubell, Benjamin Roth, and Andrew McCallum "Multilingual Relation Extraction using Compositional Universal Schema" North American Chapter of the Association for Computational Linguistics (NAACL) , 2016
Patrick Verga, Emma Strubell, Andrew McCallum. "Simultaneously Self-Attending to All Mentions for Full-Abstract Biological Relation Extraction." The 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018) , 2018
(Showing: 1 - 10 of 19)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

A knowledge base (KB) is a structured, navigable map of knowledge about objects ("entities", such as proteins, people, organizations and locations) and the connections between them ("relations", such as one protein increasing production of another, or a person working for an organization).  KBs provide a queryable, browsable, and machine-readable view on knowledge that usually otherwise would be spread across many natural language textual documents.  For example, if given only a document collection, biomedical scientist can merely search for documents and read them; however, if given a KB, biomedical scientists can browse a navigable map from gene to protein to disease to symptom---a map that has integrated disparate information from across many scientific documents about any one gene, etc.  KBs are of fundamental importance to accelerating scientific progress, national defense, business, and many other disciplines.  

Automated knowledge base construction (AKBC) is the process of using natural language processing (NLP) and other artificial intelligence (AI) techniques to build the KB from massive document collections without requiring humans to do data-entry.  It is a tremendously challenging task that requires subtle interpretation, resolving ambiguities, and flexible knowledge representation.

This project has developed new methods in machine learning (ML), NLP and AI that greatly increase the accuracy with which entity-relations can be extracted from text, as well as increase the fidelity with which many subtle distinctions among types of relations can be represented. The project's technical approach---which we call "universal schema"---is based on representing all of the input relation expressions as vector positions in a common multi-dimensional space, with nearby relations having similar meanings.  

Prior to this project, most research in relation extraction fell into one of two categories. In the first, one must define a pre-fixed schema of relation types (such as “lives-in”, “employed-by”, and a handful of others), which limits expressivity and hides language ambiguities. In the second category, one extracts into an ‘‘open’’ schema based on language strings themselves (lacking ability to generalize among them), or attempts to gain generalization with unsupervised clustering of these strings (suffering from clusters that fail to capture reliable synonyms, or even find the desired semantics at all).

In our “universal schema” we avoid this dichotomy, using a vector space to learn a model of the union of all input schemas, including multiple available pre-structured KBs as well as all the observed natural language surface forms. Our approach thus embraces the diversity and ambiguity of original language surface forms (not trying to force relations into predefined categories), yet also successfully generalizes by learning non-symmetric implicature among explicit and implicit relations using new extensions to the probabilistic matrix factorization and vector embedding methods that have been so successful in recommender systems and deep neural networks.

 

Intellectual Merit

Our approach is a markedly novel departure from previous, traditional methods. Our latent embedding vectors for entities and relations offer an intriguing flexible approach to semantics, relational implicature, and semi-supervised training. Universal schema provide for a nearly limitless diversity of relation types (coming from the subtleties of natural language), and also support convenient semi-supervised learning through integration with existing structured data (i.e. the relation types and schema of existing databases). 

The work supported by this grant bore out our optimism about our approach.  We surpassed previous state-of-the-art results in many areas, including the NIST TAC-Knowledge Base Population task.  We developed new machine learning methods and natural language processing techniques, publishing over 20 peer-reviewed research papers.  We developed intellectual connections between “universal schema” and other areas of ML and NLP, including joint inference, deep neural networks, parsing, coreference, and question answering.

 

Broader Impacts

Information overload has become an increasingly burdensome problem across many fields of high national priority, including biomedicine, material science, national defense, business decision-making, and many other areas.  Improved methods for building and maintaining knowledge bases are a key ingredient to help decision-makers navigate high-information domains.  The work supported by this grant has yielded new methods that not only have merit through intellectual novelty, but also yielded broad practical impact.  

Our “universal schema” method and its successors are now widely used in industry, in both large companies such as IBM, Oracle and Google, and startup companies such as Lexalytics.  External deployments of our ideas have been employed in a wide variety of fields, including biomedicine, business decision-making, and national defense.

We have made more than five open-source software releases implementing our methods, enabling others to apply and extend our work.  We have released multiple datasets allowing others to test their  methods and compare them to ours.  

The work of this project has also supported the creation of new scientific communities.  In 2019 the PI launched (and served as the General Chair of) the first international conference on “Automated Knowledge Base Construction” which is now in its third successful year, bringing together researchers from ML, NLP, semantic web, and databases.

 


Last Modified: 02/11/2021
Modified by: Andrew K Mccallum

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page