Award Abstract # 1420667
RI: Small: Improving Crowd-Sourced Annotation by Autonomous Intelligent Agents

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: UNIVERSITY OF WASHINGTON
Initial Amendment Date: July 22, 2014
Latest Amendment Date: July 22, 2014
Award Number: 1420667
Award Instrument: Standard Grant
Program Manager: Weng-keen Wong
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: August 1, 2014
End Date: July 31, 2018 (Estimated)
Total Intended Award Amount: $460,000.00
Total Awarded Amount to Date: $460,000.00
Funds Obligated to Date: FY 2014 = $460,000.00
History of Investigator:
  • Daniel Weld (Principal Investigator)
    danw@allenai.org
Recipient Sponsored Research Office: University of Washington
4333 BROOKLYN AVE NE
SEATTLE
WA  US  98195-1016
(206)543-4043
Sponsor Congressional District: 07
Primary Place of Performance: University of Washington
Box 352350
Seattle
WA  US  98195-2350
Primary Place of Performance
Congressional District:
07
Unique Entity Identifier (UEI): HD1WMN6945W6
Parent UEI:
NSF Program(s): Robust Intelligence
Primary Program Source: 01001415DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7495, 7923
Program Element Code(s): 749500
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Supervised machine learning methods are arguably the greatest success story for Artificial Intellitence with a deep underlying theory and applications ranging from medical diagnosis and scientific data analysis to ecommerce recommender systems and credit-card fraud detection. Unfortunately, all these methods require labeled training data, which has been annotated by a human --- a time consuming and extremely expensive process. This project will use automated decision theory to control the annotation process, saving significant amounts of human labor and extending the practical use of machine learning to a much broader array of societal problems.

Specifically, the methods address the case where labeled data is crowd-sourced by a large number of human annotators whose skill and error rates are variable. The project develops new control algorithms that let the learner efficiently ask specific workers to label (or redundantly re-label) specific examples. To test the practicality of their methods, the PIs build and conduct studies with the Information Omnivore, a fully autonomous agent that optimizes the annotation of natural language processing (NLP) training data. By continuously posing questions to paid workers and volunteer citizen-scientists, the Omnivore 1) will learn which problems are hard and which are easy, 2) will learn about the skills of the various workers, 3) and will decide questions to ask which workers in order to maximize the accuracy of the learned model given scare human help. Besides contributing to the science of automated control, the Omnivore will generate labeled training data for two important NLP problems: named entity linking (NEL) and information extraction (IE), greatly helping the community of NLP researchers. Furthermore, the researchers plan a number of outreach efforts, including curriculum development, participation in the K12 Paws on Science program at the Pacific Science Center and interaction with the diverse students comprising the Washington STate Academic RedShirt (STARS) in Engineering program. The specific algorithms proposed by the PIs are notable in several respects. Their decision-theoretic optimization framework operationalizes intuitions like (1) one should assign more or better workers to hard problems and (2) one should redirect effort away from easy questions or from tasks that are too hard to solve. Automating this reasoning is hard because problem difficulty and worker skill are latent variables and thus the agent must confront an exploration / exploitation tradeoff as it balances actions that enable it to learn about the capabilities of workers with the ultimate goal of producing quality annotations. The PIs consider two cases: Task Allocation for Annotation Accuracy tries to maximize the overall annotation accuracy of a fixed size data set through batch assignment of workers to tasks. Re-Active Learning seeks instead to directly construct an accurate ML classifier through a balanced mix of annotator requests to re-label old or label new examples. In both cases they propose a model based on decision-theoretic methods (e.g., partially-observable Markov decision processes (POMDPs) and multi-armed bandits). The PIs propose to integrate their methods in the Information Omnivore, a long-lived software agent that integrates planning and execution, acts in the real world, and learns a model of its environment. The Omnivore will allow large-scale latitudinal studies of their algorithms, and as a byproduct will generate NLP training data that will greatly assist a large community of other researchers.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 13)
A. Liu, S. Soderland, J. Bragg, C.H. Lin, X. Ling, D.S. Weld "Effective Crowd Annotation for Relation Extraction" 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT-16), June 2016 , 2016
Bragg, Jonathan "Self-Improving Crowdsourcing: Near-Effortless Design of Adaptive Distributed Work" PhD Thesis , 2018 Citation Details
C.H. Lin, Mausam, D.S. Weld "Re-active Learning: ?Active Learning with Relabeling" AAAI Conference on Artificial Intelligence, February 2016 , 2016
C. Lin, Mausam "Active Learning with Unbalanced Classes & Example-Generated Queries" AAAI Conference on Human Computation , 2018 Citation Details
D.S. Weld, G. Bansal "Intelligible Artificial Intelligence" ArXiv e-prints , March 2018 , 2018 Citation Details
D. Weld & G. Bansal "The Challenge of Crafting Intelligible Intelligence" Communications of the ACM 2018 , 2018
D. Weld, G. Bansal "The Challenge of Crafting Intelligible Intelligence" Communications of ACM , 2018 Citation Details
G. Bansal, D.S. Weld "A Coverage-Based Utility Model for Identifying Unknown Unknowns" AAAI, 2018 , 2018 Citation Details
J. Bragg, Mausam "Sprout: Crowd-Powered Task Design for Crowdsourcing" ACM Symposium on User Interface Software and Technology (UIST '18) , 2018 Citation Details
J. Bragg, Mausam, D.S. Weld "Optimal Testing for Crowd Workers" 15th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2016), May 2016 , 2016
J. Ferguson, C. Lockard "Semi-Supervised Event Extraction with Paraphrase Clusters" NAACL-HLT, June 2018 , 2018 Citation Details
(Showing: 1 - 10 of 13)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Machine learning (ML) algorithms are transforming many areas of science, business and society. But before ML may be used, it must be trained by a data scientist using labeled training data. For example, if one wishes to use ML to recognize stop signs for use in a self-driving car, one must first have humans label thousands or millions of images, manually determining if a stop sign is present and drawing a boundary around the sign. Since this training data must typically be created by humans, the labeling process, which is called data annotation, has become a formidable bottleneck, slowing the use of machine learning in many areas.

 

Funded by this grant, we have developed a suite of methods that improve the annotation process, dramatically reducing the cost of annotation. Many of the methods are enhancements of a method, called active learning, which allows the computer to choose which examples should be labeled (rather than labeling a random selection) and how many times it should be labeled. Other methods optimize the training process by which annotators are taught how to annotate and tested to make sure that their annotations are of sufficient quality. Taken together, our methods can reduce the cost of annotation by a factor of four to five, while improving the quality of the resulting ML classifier.


Last Modified: 08/07/2018
Modified by: Daniel S Weld

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page