Award Abstract # 1016754
III: Small: Active Learning of Language Models for Information Extraction

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: NORTHWESTERN UNIVERSITY
Initial Amendment Date: August 1, 2010
Latest Amendment Date: August 1, 2010
Award Number: 1016754
Award Instrument: Standard Grant
Program Manager: Sylvia Spengler
sspengle@nsf.gov
 (703)292-7347
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: August 15, 2010
End Date: July 31, 2013 (Estimated)
Total Intended Award Amount: $183,736.00
Total Awarded Amount to Date: $183,736.00
Funds Obligated to Date: FY 2010 = $183,736.00
History of Investigator:
  • Douglas Downey (Principal Investigator)
    dougd@allenai.org
Recipient Sponsored Research Office: Northwestern University
633 CLARK ST
EVANSTON
IL  US  60208-0001
(312)503-7955
Sponsor Congressional District: 09
Primary Place of Performance: Northwestern University
633 CLARK ST
EVANSTON
IL  US  60208-0001
Primary Place of Performance
Congressional District:
09
Unique Entity Identifier (UEI): EXZVPWZBLUE8
Parent UEI:
NSF Program(s): Info Integration & Informatics
Primary Program Source: 01001011DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7923
Program Element Code(s): 736400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

This project studies methods for extracting accurate knowledge bases from the
Web. Fully-automated Web information extraction techniques are massively
scalable, but have accuracy and coverage limitations. This proposal
investigates how to improve automated extraction techniques by introducing
carefully-selected human guidance. The proposed system continually extracts
knowledge from the Web, along the way dynamically synthesizing and issuing
queries to humans to increase the accuracy of the system's knowledge base and
extractors.

The approach extends the PI's previous work utilizing statistical language
models (SLMs) for information extraction. Novel SLMs are investigated for
unifying the extraction of relational data expressed in Web tables with
extraction from free text. New active learning techniques utilize the models
to identify "high-leverage" queries -- requesting, for example, textual
extraction patterns that when retrieved from the Web yield thousands of novel
extractions. The queries investigated are mostly amenable to non-experts,
meaning that much of the human input can be acquired at scale via online
mass-collaboration.

The broader impact of this project lies in the potential for accurate Web
extraction to radically improve Web search, allowing users to answer
complicated questions by synthesizing information across multiple Web pages.
In domains like medicine and biology, mining extracted knowledge bases could
lead to important discoveries and novel therapies.

Further information may be found at the project web page:
http://wail.eecs.northwestern.edu/projects/activelms/index.html

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This project studied methods for automatically extracting knowledge bases from the World Wide Web.  The goal behind our work is to transform the Web’s vast human-readable content into machine-understandable knowledge.  This capability would enable transformative technologies, such as new search engines that answer complex questions by synthesizing information scattered across the Web.

We focused on three primary research questions:

  • How can we integrate knowledge extracted from both Web text and Web tables?
  • How can statistical language models trained over large text corpora help improve extraction accuracy?
  • How can an extraction system actively solicit well-selected human input to improve the extraction process?

The project led to the invention of new knowledge extraction techniques, primarily aimed at Wikipedia’s text and data tables.  A fundamental knowledge extraction challenge involves automatically identifying relationships between concepts.  We developed state-of-the-art methods for estimating the degree of semantic relatedness (SR) between two Wikipedia concepts, along with new methods for explaining the relationships to Web users in natural language.  These methods leveraged machine learning techniques to mine Wikipedia’s text, hyperlinks, and categories for semantic information.  We also developed new techniques for extracting data from Wikipedia tables and automatically joining together different tables that contain related information.

We also developed new methods for scaling up statistical language models (SLMs) for information extraction.  “Latent-variable” SLMs have been shown to improve extraction systems, but the memory required to train the models forms a bottleneck.  We developed a new method for overcoming the memory bottleneck, based on intelligently partitioning the corpus across a parallel computing cluster.  Our experiments showed that the partitioning method decreases the memory footprint of model training by half for large data sets.

The broader impacts of our work included student training, public prototype applications, and the release of data sets and code to the research community.  Multiple PhD, MS, and undergraduate students participated in our research and co-authored publications.  We also delivered a public prototype demonstrating our table extraction research, called “WikiTables.”  An additional public prototype of the “Atlasify” system, which uses our semantic relatedness research to create interactive visualizations query concepts (e.g. “nuclear power”) on familiar reference systems (e.g. the World Map or periodic table), is under development.  We disseminated our work to the research community in the form of multiple papers at major conferences and workshops, and we released other resources (including a codebase for our SLM training technique, new datasets for SR and table extraction, and a scalable public API for computing SR).  The papers, prototypes, and other research products are publicly available.  For further information, please consult the project Web site: http://websail.cs.northwestern.edu/activelms/

 


Last Modified: 10/17/2013
Modified by: Douglas C Downey