Skip to feedback

Award Abstract # 2007836
III: Small: Labeling Massive Data from Noisy, Incomplete and Crowdsourced Annotations

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: OREGON STATE UNIVERSITY
Initial Amendment Date: August 14, 2020
Latest Amendment Date: June 23, 2021
Award Number: 2007836
Award Instrument: Standard Grant
Program Manager: Sylvia Spengler
sspengle@nsf.gov
 (703)292-7347
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: October 1, 2020
End Date: September 30, 2025 (Estimated)
Total Intended Award Amount: $398,942.00
Total Awarded Amount to Date: $414,942.00
Funds Obligated to Date: FY 2020 = $398,942.00
FY 2021 = $16,000.00
History of Investigator:
  • Xiao Fu (Principal Investigator)
    xiao.fu@oregonstate.edu
Recipient Sponsored Research Office: Oregon State University
1500 SW JEFFERSON AVE
CORVALLIS
OR  US  97331-8655
(541)737-4933
Sponsor Congressional District: 04
Primary Place of Performance: Oregon State University
Corvallis
OR  US  97331-2140
Primary Place of Performance
Congressional District:
04
Unique Entity Identifier (UEI): MZ4DYXE1SL98
Parent UEI:
NSF Program(s): Info Integration & Informatics
Primary Program Source: 01002021DB NSF RESEARCH & RELATED ACTIVIT
01002122DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7364, 7923, 9251
Program Element Code(s): 736400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Alongside the prosperity of deep learning, the demand for reliably labeled data is unprecedentedly high. Label acquisition is a highly nontrivial task---data labeling is tedious, labor-intensive, and prone to mistakes. Crowdsourcing techniques that integrate annotations from multiple annotators to improve accuracy have been essential for labeling large-scale data. However, existing crowdsourcing techniques face pressing challenges such as heavy workload of annotators, high computational cost, and a lack of strong theoretical guarantees.  This project will develop a series of analytical and computational tools for accurately labeling massive datasets from noisy, incomplete, and crowdsourced annotations---with provable guarantees. Leveraging advanced nonnegative matrix factorization theory, this project will offer solutions that are efficient and effective under critical conditions. The outcomes are expected to have broad and substantial positive impacts on the currently label-hungry artificial intelligence industry and the data annotation workforce. For example, the algorithms designed for handling structured data (e.g., speech) will largely benefit timely applications, e.g., intelligent assistants such as Alexa and Siri. The ability of reliably working under largely incomplete data will help design new data dispatch schemes leading to significantly reduced annotator workload. The project will also offer many training opportunities for undergraduate students, with an emphasis on engaging those from underrepresented groups.

In terms of theory and methods, many aspects of crowdsourced data labeling (e.g., sample complexity, noise robustness, and identifiability of the underlying statistical model) are still poorly understood. This project will provide a suite of theoretical and computational tools that advance these aspects. To be specific, the first thrust will build up a coupled nonnegative matrix factorization (CNMF) framework that bridges the classic Dawid-Skene model for crowdsourcing and advanced nonnegative factor analysis theories. This will establish firm theoretical foundations for crowdsourcing under critical conditions, and lead to theory-backed algorithms to attain substantially improved sample complexity and noise/incomplete data robustness. The second thrust exploits domain-dependent knowledge, e.g., data structure and annotator dependence, to come up with situation-aware crowdsourcing techniques for enhanced performance. The third thrust designs stochastic optimization strategies to provide scalable implementations for the CNMF framework, and evaluates the proposed methods over a variety of real-world applications. The analytical and computational tools developed in this project will provide strong provable guarantees and refreshing algorithmic solutions for long-standing challenges in crowdsourced data labeling. In addition, the CNMF theory and algorithms are exciting new directions for computational linear algebra, whose impacts can go well beyond this project.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Ibrahim, Shahana and Fu, Xiao "Crowdsourcing via Annotator Co-occurrence Imputation and Provable Symmetric Nonnegative Matrix Factorization" Proceedings of the 38th International Conference on Machine Learning , v.139 , 2021 Citation Details
Ibrahim, Shahana and Fu, Xiao "Learning Mixed Membership from Adjacency Graph Via Systematic Edge Query: Identifiability and Algorithm" ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2021 https://doi.org/10.1109/ICASSP39728.2021.9413541 Citation Details
Ibrahim, Shahana and Fu, Xiao "Mixed Membership Graph Clustering via Systematic Edge Query" IEEE Transactions on Signal Processing , v.69 , 2021 https://doi.org/10.1109/TSP.2021.3109380 Citation Details
Ibrahim, Shahana and Nguyen, Tri and Fu, Xiao "Deep Learning From Crowdsourced Labels: Coupled Cross-Entropy Minimization, Identifiability, and Regularization" international conference on representation learning , 2023 Citation Details
Nguyen, Tri and Ibrahim, Shahana and Fu, Xiao "Deep Clustering with Incomplete Noisy Pairwise Annotations: A Geometric Regularization Approach" Proceedings of the 40th International Conference on Machine Learning , 2023 Citation Details
Wolnick, Daniel Grey and Ibrahim, Shahana and Marrinan, Tim and Fu, Xiao "Deep Learning from Noisy Labels via Robust Nonnegative Matrix Factorization-Based Design" , 2023 https://doi.org/10.1109/CAMSAP58249.2023.10403492 Citation Details

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page