
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | August 14, 2020 |
Latest Amendment Date: | June 23, 2021 |
Award Number: | 2007836 |
Award Instrument: | Standard Grant |
Program Manager: |
Sylvia Spengler
sspengle@nsf.gov (703)292-7347 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | October 1, 2020 |
End Date: | September 30, 2025 (Estimated) |
Total Intended Award Amount: | $398,942.00 |
Total Awarded Amount to Date: | $414,942.00 |
Funds Obligated to Date: |
FY 2021 = $16,000.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
1500 SW JEFFERSON AVE CORVALLIS OR US 97331-8655 (541)737-4933 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
Corvallis OR US 97331-2140 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Info Integration & Informatics |
Primary Program Source: |
01002122DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Alongside the prosperity of deep learning, the demand for reliably labeled data is unprecedentedly high. Label acquisition is a highly nontrivial task---data labeling is tedious, labor-intensive, and prone to mistakes. Crowdsourcing techniques that integrate annotations from multiple annotators to improve accuracy have been essential for labeling large-scale data. However, existing crowdsourcing techniques face pressing challenges such as heavy workload of annotators, high computational cost, and a lack of strong theoretical guarantees. This project will develop a series of analytical and computational tools for accurately labeling massive datasets from noisy, incomplete, and crowdsourced annotations---with provable guarantees. Leveraging advanced nonnegative matrix factorization theory, this project will offer solutions that are efficient and effective under critical conditions. The outcomes are expected to have broad and substantial positive impacts on the currently label-hungry artificial intelligence industry and the data annotation workforce. For example, the algorithms designed for handling structured data (e.g., speech) will largely benefit timely applications, e.g., intelligent assistants such as Alexa and Siri. The ability of reliably working under largely incomplete data will help design new data dispatch schemes leading to significantly reduced annotator workload. The project will also offer many training opportunities for undergraduate students, with an emphasis on engaging those from underrepresented groups.
In terms of theory and methods, many aspects of crowdsourced data labeling (e.g., sample complexity, noise robustness, and identifiability of the underlying statistical model) are still poorly understood. This project will provide a suite of theoretical and computational tools that advance these aspects. To be specific, the first thrust will build up a coupled nonnegative matrix factorization (CNMF) framework that bridges the classic Dawid-Skene model for crowdsourcing and advanced nonnegative factor analysis theories. This will establish firm theoretical foundations for crowdsourcing under critical conditions, and lead to theory-backed algorithms to attain substantially improved sample complexity and noise/incomplete data robustness. The second thrust exploits domain-dependent knowledge, e.g., data structure and annotator dependence, to come up with situation-aware crowdsourcing techniques for enhanced performance. The third thrust designs stochastic optimization strategies to provide scalable implementations for the CNMF framework, and evaluates the proposed methods over a variety of real-world applications. The analytical and computational tools developed in this project will provide strong provable guarantees and refreshing algorithmic solutions for long-standing challenges in crowdsourced data labeling. In addition, the CNMF theory and algorithms are exciting new directions for computational linear algebra, whose impacts can go well beyond this project.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
Please report errors in award information by writing to: awardsearch@nsf.gov.