Award Abstract # 1617408
III: Small: Interactive Construction of Complex Query Models

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: UNIVERSITY OF MASSACHUSETTS
Initial Amendment Date: July 21, 2016
Latest Amendment Date: July 21, 2016
Award Number: 1617408
Award Instrument: Standard Grant
Program Manager: Wei-Shinn Ku
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: July 15, 2016
End Date: June 30, 2020 (Estimated)
Total Intended Award Amount: $515,994.00
Total Awarded Amount to Date: $515,994.00
Funds Obligated to Date: FY 2016 = $515,994.00
History of Investigator:
  • James Allan (Principal Investigator)
    allan@cs.umass.edu
Recipient Sponsored Research Office: University of Massachusetts Amherst
101 COMMONWEALTH AVE
AMHERST
MA  US  01003-9252
(413)545-0698
Sponsor Congressional District: 02
Primary Place of Performance: University of Massachusetts Amherst
MA  US  01003-9264
Primary Place of Performance
Congressional District:
02
Unique Entity Identifier (UEI): VGJHK59NMPK9
Parent UEI: VGJHK59NMPK9
NSF Program(s): Info Integration & Informatics
Primary Program Source: 01001617DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7364, 7923
Program Element Code(s): 736400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

This research program will investigate and implement SearchIE, a search-based approach to information "extraction." SearchIE will allow rapid, personalized, situational identification of types of objects or actions in text, where those types are likely to be useful for a complex search task. Modern search engines often provide some mechanism to indicate that a query keyword matches a document only if it occurs in the name of a person or in a location. To make that possible, annotators found and marked a large number of people names (for example) in text, a machine learning algorithm was applied to learn which low-level features are indicative of the name type, and then a resulting classifier for that type is run across the collection of documents. It is then possible to write a query that means "paris used as a person's name rather than a location." Unfortunately, the existing approaches do not serve searchers interested in novel, unanticipated types - for example, names of whaling ships, officers in Queen Victoria's navy, local watering holes. Such examples cannot be handled currently because the classifiers need to be trained and run ahead of time, an expensive data labeling process that is too daunting for many search tasks. Since on-line information gathering almost always starts with search and frequently involves identifying items of interest in the found text, bringing these two together has the potential to change both substantially. The SearchIE approach makes it possible for someone to build personalized extractors contextualized by their topical interests. The result is that the technology can radically improve online searching for lay persons as well as professionals by significantly reducing the time needed to focus queries into relevant information.

It does not appear that the information extraction task has ever been approached directly as a search task. SearchIE is unique in bringing an information retrieval (search) mindset to the extraction problem, providing new capabilities that are either impossible or extremely difficult in the traditional "annotate then detect" model of the problem. This project will investigate the fundamental issues raised by the SearchIE approach. What models can best integrate extraction and search in new settings where they can truly happen simultaneously? How can a searcher describe and edit a model for the types of interest? Can an interactively developed model be a springboard into a machine learned model and when is there enough information to do that? Does using topical context to limit the scope of extraction provide the expected accuracy gains using SearchIE's approach? What data structure modifications are needed to fully implement SearchIE so that it is efficient as well as effective? How well does this approach fare on additional standard test collections? Addressing the systems and algorithmic issues are fundamental problems that have the potential to greatly impact both search and extraction. For further information, see the project's web site at http://ciir.cs.umass.edu/research/searchie.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 17)
Cohen, D., Foley, J., Zamani, H., Allan, J. and Croft, W. B. "Universal Approximation Functions for Fast Learning to Rank" Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '18) , 2018 , p.1017 10.1145/3209978.3210137
Dalton, J., Naseri, S., Dietz, L. and Allan, J. "Local and Global Query Expansion for Hierarchical Complex Topics" Proceedings of the European Conference on Information Retrieval (ECIR 2019) , 2019 , p.290 10.1007/978-3-030-15712-8_19
Foley, J., OConnor, B. and Allan, J. "Improving Entity Ranking for Keyword Queries" Proceedings of The 25th ACM International Conference on Information and Knowledge Management (CIKM 2016) , 2016 , p.2061 10.1145/2983323.2983909
Foley, J., OConnor, B. and Allan, J. "Improving Entity Ranking for Keyword Queries," Proceedings of The 25th ACM International Conference on Information and Knowledge Management (CIKM 2016) , 2016 , p.2061 978-1-4503-4073-1
Foley, J., Sarwar, S.M., and Allan, J. "Named Entity Recognition with Extremely Limited Data" Proceedings of ACM SIGIR 2018 Workshop on Learning from Limited or Noisy Data (LND4IR ?18), Ann Arbor, Michigan, USA, July 12, 2018. , 2018
Montazeralghaem, Ali and Rahimi, Razieh and Allan, James "Term Discrimination Value for Cross-Language Information Retrieval" Proceedings of International Conference on the Theory of Information Retrieval Conference (ICTIR 2019) , 2019 10.1145/3341981.3344252 Citation Details
Montazeralghaem, Ali and Zamani, Hamed and Allan, James "A Reinforcement Learning Framework for Relevance Feedback" Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2020) , 2020 10.1145/3397271.3401099 Citation Details
Montazeralghaem, A., Rahimi, N. and Allan, J. "Relevance Ranking based on Query-Aware Context Analysis" Proceedings of the 42nd European Conference on Information Retrieval (ECIR 2020) , 2020 , p.446 10.1007/978-3-030-45439-5_30
Montazeralghaem, A., Rahimi, N. and Allan, J. "Term Discrimination Value for Cross-Language Information Retrieval" Proceedings of International Conference on the Theory of Information Retrieval Conference (ICTIR 2019) , 2020 , p.137 10.1145/3341981.3344252
Montazeralghaem, A., Zamani, H. and Allan, J. "A Reinforcement Learning Framework for Relevance Feedback" Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020) , 2020 , p.59 10.1145/3397271.3401099
Sarwar, S. and Allan, J. "SearchIE: A Retrieval Approach for Information Extraction" Proceedings of the International Conference on the Theory of Information Retrieval (ICTIR 2019) , 2019 , p.249 10.1145/3341981.3344248
(Showing: 1 - 10 of 17)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The goal of the SearchIE project was to explore the use of search to solve problems where an annotation-based approach was overly resource- and time-consuming. For example, consider someone reading news and coming across an article listing a few winners of the Iditarod racing (dog sleds) and then wondering who else had won that race previously. A classic information extraction approach to this problem would be to mark a collection of documents with known winners, train a word classification system, and then run that classifier on an entire collection of documents. The time spent on all of those steps is not worthwhile for a passing curiosity or something with so few instances. In the SearchIE project we explore methods that allows the reader to mark a few interesting passages (the list of winners), perhaps be asked to highlight the actual names to make the need clear, and then allow the system to immediately find matches by searching based on underlying features rather than exhaustively scanning the entire collection.

The primary work in this vein considered how to add to a known set of entities, either by providing a handful of sample entities (e.g., winners of the Iditarod, presidents of the U.S.) or 1-2 sentences including an entity and context (e.g., a couple of sentences describing actions of a U.S. president). As highlighted above, our approach is to extract important features from entities or their context and use those to search a collection of documents for the additional entities -- rather than scanning the entire collection word-by-word to find instances that match. We considered this both in a completely automatic setting -- given some examples, find more -- and an interactive setting -- allowing the system to request additional clarifying information from the user. Over the life of this grant, we demonstrated that the types of features used for searching are not the same as those an exhaustive classifier might use and that a strict subset is equally effective, that some useful features are very hard to describe to a user so cannot be used in an interactive setting, and that a person is much more adept than an automatic system at identifying important features and their relative importance for this task.

We extended the ideas of this proposal to the domain of poetry, where classic classification algorithms almost uniformly fail because of atypical structure, punctuation, and capitalization. We showed that our approach can be used to find poems in a large collection of scanned books.

In the early stages of this work, we felt that modern neural approaches could not be applied because of a paucity of training data. Toward the end of the work we developed techniques that generated "weak" (less than 100% correct) training data at a sufficiently large scale that some neural approaches could be trained and evaluated; we found that those techniques successfully addressed this task. We note that that this use of data is routine in the field but had never been successfully used for this problem. We were able to design and successfully use modern "deep learning" approaches to address the task. We were also able to employ reinforcement learning approaches, where a final success/failure reward function is sufficient to determine the parameters of a complex algorithm.

This grant supported the training of five PhD students and indirectly helped train one postdoctoral fellow. It resulted in 14 publications, contributed directly to one completed PhD thesis as well as a handful still being written.


Last Modified: 10/28/2020
Modified by: James M Allan

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page