
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | July 21, 2016 |
Latest Amendment Date: | July 21, 2016 |
Award Number: | 1617408 |
Award Instrument: | Standard Grant |
Program Manager: |
Wei-Shinn Ku
IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | July 15, 2016 |
End Date: | June 30, 2020 (Estimated) |
Total Intended Award Amount: | $515,994.00 |
Total Awarded Amount to Date: | $515,994.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
101 COMMONWEALTH AVE AMHERST MA US 01003-9252 (413)545-0698 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
MA US 01003-9264 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Info Integration & Informatics |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
This research program will investigate and implement SearchIE, a search-based approach to information "extraction." SearchIE will allow rapid, personalized, situational identification of types of objects or actions in text, where those types are likely to be useful for a complex search task. Modern search engines often provide some mechanism to indicate that a query keyword matches a document only if it occurs in the name of a person or in a location. To make that possible, annotators found and marked a large number of people names (for example) in text, a machine learning algorithm was applied to learn which low-level features are indicative of the name type, and then a resulting classifier for that type is run across the collection of documents. It is then possible to write a query that means "paris used as a person's name rather than a location." Unfortunately, the existing approaches do not serve searchers interested in novel, unanticipated types - for example, names of whaling ships, officers in Queen Victoria's navy, local watering holes. Such examples cannot be handled currently because the classifiers need to be trained and run ahead of time, an expensive data labeling process that is too daunting for many search tasks. Since on-line information gathering almost always starts with search and frequently involves identifying items of interest in the found text, bringing these two together has the potential to change both substantially. The SearchIE approach makes it possible for someone to build personalized extractors contextualized by their topical interests. The result is that the technology can radically improve online searching for lay persons as well as professionals by significantly reducing the time needed to focus queries into relevant information.
It does not appear that the information extraction task has ever been approached directly as a search task. SearchIE is unique in bringing an information retrieval (search) mindset to the extraction problem, providing new capabilities that are either impossible or extremely difficult in the traditional "annotate then detect" model of the problem. This project will investigate the fundamental issues raised by the SearchIE approach. What models can best integrate extraction and search in new settings where they can truly happen simultaneously? How can a searcher describe and edit a model for the types of interest? Can an interactively developed model be a springboard into a machine learned model and when is there enough information to do that? Does using topical context to limit the scope of extraction provide the expected accuracy gains using SearchIE's approach? What data structure modifications are needed to fully implement SearchIE so that it is efficient as well as effective? How well does this approach fare on additional standard test collections? Addressing the systems and algorithmic issues are fundamental problems that have the potential to greatly impact both search and extraction. For further information, see the project's web site at http://ciir.cs.umass.edu/research/searchie.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
The goal of the SearchIE project was to explore the use of search to solve problems where an annotation-based approach was overly resource- and time-consuming. For example, consider someone reading news and coming across an article listing a few winners of the Iditarod racing (dog sleds) and then wondering who else had won that race previously. A classic information extraction approach to this problem would be to mark a collection of documents with known winners, train a word classification system, and then run that classifier on an entire collection of documents. The time spent on all of those steps is not worthwhile for a passing curiosity or something with so few instances. In the SearchIE project we explore methods that allows the reader to mark a few interesting passages (the list of winners), perhaps be asked to highlight the actual names to make the need clear, and then allow the system to immediately find matches by searching based on underlying features rather than exhaustively scanning the entire collection.
The primary work in this vein considered how to add to a known set of entities, either by providing a handful of sample entities (e.g., winners of the Iditarod, presidents of the U.S.) or 1-2 sentences including an entity and context (e.g., a couple of sentences describing actions of a U.S. president). As highlighted above, our approach is to extract important features from entities or their context and use those to search a collection of documents for the additional entities -- rather than scanning the entire collection word-by-word to find instances that match. We considered this both in a completely automatic setting -- given some examples, find more -- and an interactive setting -- allowing the system to request additional clarifying information from the user. Over the life of this grant, we demonstrated that the types of features used for searching are not the same as those an exhaustive classifier might use and that a strict subset is equally effective, that some useful features are very hard to describe to a user so cannot be used in an interactive setting, and that a person is much more adept than an automatic system at identifying important features and their relative importance for this task.
We extended the ideas of this proposal to the domain of poetry, where classic classification algorithms almost uniformly fail because of atypical structure, punctuation, and capitalization. We showed that our approach can be used to find poems in a large collection of scanned books.
In the early stages of this work, we felt that modern neural approaches could not be applied because of a paucity of training data. Toward the end of the work we developed techniques that generated "weak" (less than 100% correct) training data at a sufficiently large scale that some neural approaches could be trained and evaluated; we found that those techniques successfully addressed this task. We note that that this use of data is routine in the field but had never been successfully used for this problem. We were able to design and successfully use modern "deep learning" approaches to address the task. We were also able to employ reinforcement learning approaches, where a final success/failure reward function is sufficient to determine the parameters of a complex algorithm.
This grant supported the training of five PhD students and indirectly helped train one postdoctoral fellow. It resulted in 14 publications, contributed directly to one completed PhD thesis as well as a handful still being written.
Last Modified: 10/28/2020
Modified by: James M Allan
Please report errors in award information by writing to: awardsearch@nsf.gov.