
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | August 1, 2010 |
Latest Amendment Date: | August 1, 2010 |
Award Number: | 1016754 |
Award Instrument: | Standard Grant |
Program Manager: |
Sylvia Spengler
sspengle@nsf.gov (703)292-7347 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | August 15, 2010 |
End Date: | July 31, 2013 (Estimated) |
Total Intended Award Amount: | $183,736.00 |
Total Awarded Amount to Date: | $183,736.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
633 CLARK ST EVANSTON IL US 60208-0001 (312)503-7955 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
633 CLARK ST EVANSTON IL US 60208-0001 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Info Integration & Informatics |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
This project studies methods for extracting accurate knowledge bases from the
Web. Fully-automated Web information extraction techniques are massively
scalable, but have accuracy and coverage limitations. This proposal
investigates how to improve automated extraction techniques by introducing
carefully-selected human guidance. The proposed system continually extracts
knowledge from the Web, along the way dynamically synthesizing and issuing
queries to humans to increase the accuracy of the system's knowledge base and
extractors.
The approach extends the PI's previous work utilizing statistical language
models (SLMs) for information extraction. Novel SLMs are investigated for
unifying the extraction of relational data expressed in Web tables with
extraction from free text. New active learning techniques utilize the models
to identify "high-leverage" queries -- requesting, for example, textual
extraction patterns that when retrieved from the Web yield thousands of novel
extractions. The queries investigated are mostly amenable to non-experts,
meaning that much of the human input can be acquired at scale via online
mass-collaboration.
The broader impact of this project lies in the potential for accurate Web
extraction to radically improve Web search, allowing users to answer
complicated questions by synthesizing information across multiple Web pages.
In domains like medicine and biology, mining extracted knowledge bases could
lead to important discoveries and novel therapies.
Further information may be found at the project web page:
http://wail.eecs.northwestern.edu/projects/activelms/index.html
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
This project studied methods for automatically extracting knowledge bases from the World Wide Web. The goal behind our work is to transform the Web’s vast human-readable content into machine-understandable knowledge. This capability would enable transformative technologies, such as new search engines that answer complex questions by synthesizing information scattered across the Web.
We focused on three primary research questions:
- How can we integrate knowledge extracted from both Web text and Web tables?
- How can statistical language models trained over large text corpora help improve extraction accuracy?
- How can an extraction system actively solicit well-selected human input to improve the extraction process?
The project led to the invention of new knowledge extraction techniques, primarily aimed at Wikipedia’s text and data tables. A fundamental knowledge extraction challenge involves automatically identifying relationships between concepts. We developed state-of-the-art methods for estimating the degree of semantic relatedness (SR) between two Wikipedia concepts, along with new methods for explaining the relationships to Web users in natural language. These methods leveraged machine learning techniques to mine Wikipedia’s text, hyperlinks, and categories for semantic information. We also developed new techniques for extracting data from Wikipedia tables and automatically joining together different tables that contain related information.
We also developed new methods for scaling up statistical language models (SLMs) for information extraction. “Latent-variable” SLMs have been shown to improve extraction systems, but the memory required to train the models forms a bottleneck. We developed a new method for overcoming the memory bottleneck, based on intelligently partitioning the corpus across a parallel computing cluster. Our experiments showed that the partitioning method decreases the memory footprint of model training by half for large data sets.
The broader impacts of our work included student training, public prototype applications, and the release of data sets and code to the research community. Multiple PhD, MS, and undergraduate students participated in our research and co-authored publications. We also delivered a public prototype demonstrating our table extraction research, called “WikiTables.” An additional public prototype of the “Atlasify” system, which uses our semantic relatedness research to create interactive visualizations query concepts (e.g. “nuclear power”) on familiar reference systems (e.g. the World Map or periodic table), is under development. We disseminated our work to the research community in the form of multiple papers at major conferences and workshops, and we released other resources (including a codebase for our SLM training technique, new datasets for SR and table extraction, and a scalable public API for computing SR). The papers, prototypes, and other research products are publicly available. For further information, please consult the project Web site: http://websail.cs.northwestern.edu/activelms/
Last Modified: 10/17/2013
Modified by: Douglas C Downey