Award Abstract # 2007941
III: Small: Collaborative Research: Algorithms, systems, and theories for exploiting data dependencies in crowdsourcing

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: IOWA STATE UNIVERSITY OF SCIENCE AND TECHNOLOGY
Initial Amendment Date: September 4, 2020
Latest Amendment Date: October 19, 2020
Award Number: 2007941
Award Instrument: Standard Grant
Program Manager: Sylvia Spengler
sspengle@nsf.gov
 (703)292-7347
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: October 1, 2020
End Date: September 30, 2024 (Estimated)
Total Intended Award Amount: $249,998.00
Total Awarded Amount to Date: $249,998.00
Funds Obligated to Date: FY 2020 = $249,998.00
History of Investigator:
  • Qi Li (Principal Investigator)
    qli@iastate.edu
Recipient Sponsored Research Office: Iowa State University
1350 BEARDSHEAR HALL
AMES
IA  US  50011-2103
(515)294-5225
Sponsor Congressional District: 04
Primary Place of Performance: Iowa State University
1138 Pearson
Ames
IA  US  50011-2207
Primary Place of Performance
Congressional District:
Unique Entity Identifier (UEI): DQDBM7FGJPC5
Parent UEI: DQDBM7FGJPC5
NSF Program(s): Info Integration & Informatics
Primary Program Source: 01002021DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7364, 7923
Program Element Code(s): 736400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Data are abundantly available to encode knowledge in many domains, such as biomedical research, online commerce, open government, education, and public health. Machine learning is a powerful tool to discover novel knowledge from data and to help individuals and organizations make informed decisions. However, machine learning needs to be bootstrapped by human-annotated knowledge, which can be expensive to obtain and also contain human errors. The team of researchers discovers and exploits the dependencies in the data, via novel methodologies to significantly reduce the cost and noises when providing critical knowledge for machine learning. The research outputs, including algorithms, systems, and theories, are sufficiently generic to benefit many domains where machine learning is applicable. By conducting the fundamental research, the team will train undergraduates and graduates for the STEM workforce in the nation.


The researchers will collaborate to develop algorithms, systems, and theories for reducing costs and noises when annotating dependent data, termed as ?structured annotations?, to provide supervision knowledge for machine learning. While the dependencies can make data annotations costly and error-prone, the researchers view the dependencies as a useful inductive bias for selective and accurate annotations. In particular, the research team proposes a human-in-the-loop system to aid the construction of proper probabilistic graphical models to encode the dependencies. The project team combines contextual and multi-armed bandits with scalable graph inference algorithms to reduce labeling costs. Based on the graphical bandits, the team addresses the budget allocation when querying labels of the same data point repetitively for robustness. With noisy human annotations, the team formulates optimization problems and algorithms to jointly infer the annotator competences and the ground truth labels of the data. From the theoretical perspective, the project will advance the active learning in crowdsourcing settings with more realistic noise distributions and will analyze the regrets in structured annotations. The project will result in datasets, algorithms, and a testbed system that benefit not only the core machine learning research community but also many domains that use machine learning.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 11)
Chakraborty, Mohna and Kulkarni, Adithya and Li, Qi "Open-Domain Aspect-Opinion Co-Mining with Double-Layer Span Extraction" Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , 2022 https://doi.org/10.1145/3534678.3539386 Citation Details
Fang, Minghong and Sun, Minghao and Li, Qi and Gong, Neil Zhenqiang and Tian, Jin and Liu, Jia "Data Poisoning Attacks and Defenses to Crowdsourcing Systems" Proceedings of the Web Conference 2021 , 2021 https://doi.org/10.1145/3442381.3450066 Citation Details
Guo, C and Yu, H and Liu, J and Chen, C and Li, Q and Xie, S and Zhang, X "Linear Uncertainty Quantification of Graphical Model Inference" , 2024 Citation Details
Kulkarni, Adithya and Chakraborty, Mohna and Xie, Sihong and Li, Qi "Optimal Budget Allocation for Crowdsourcing Labels for Graphs" Uncertainty in artificial intelligence , 2023 Citation Details
Kulkarni, Adithya and Sabetpour, Nasim and Markin, Alexey and Eulenstein, Oliver and Li, Qi "CPTAM: Constituency Parse Tree Aggregation Method" Proceedings of the SIAM International Conference on Data Mining , 2022 https://doi.org/10.1137/1.9781611977172.71 Citation Details
Qiao, Qiao and Li, Yuepei and Zhou, Kang and Li, Qi "Relation-Aware Network with Attention-Based Loss for Few-Shot Knowledge Graph Completion" 27th Pacific-Asia Conference on Knowledge Discovery and Data Mining , 2023 Citation Details
Sabetpour, Nasim and Kulkarni, Adithya and Li, Qi "OptSLA: an Optimization-Based Approach for Sequential Label Aggregation" Findings of the Association for Computational Linguistics: EMNLP 2020 , 2020 Citation Details
Sabetpour, Nasim and Kulkarni, Adithya and Xie, Sihong and Li, Qi "Truth Discovery in Sequence Labels from Crowds" IEEE International Conference on Data Mining (ICDM) , 2021 https://doi.org/10.1109/ICDM51629.2021.00065 Citation Details
Sium, Yonas and Kollias, Georgios and Idé, Tsuyoshi and Das, Payel and Abe, Naoki and Lozano, Aurélie and Li, Qi "Direction Aware Positional and Structural Encoding for Directed Graph Neural Networks" ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2023 https://doi.org/10.1109/ICASSP49357.2023.10094964 Citation Details
Sium, Yonas and Li, Qi and Varshney, Kush R "Individual Fairness in Graphs Using Local and Global Structural Information" Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , v.7 , 2024 https://doi.org/10.1609/aies.v7i1.31731 Citation Details
Wei, Ying and Li, Qi "SagDRE: Sequence-Aware Graph-Based Document-Level Relation Extraction with Adaptive Margin Loss" Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , 2022 https://doi.org/10.1145/3534678.3539304 Citation Details
(Showing: 1 - 10 of 11)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Supervised learning, such as deep networks, are trained on a large number of high quality labeled data. Obtaining such training data is a bottleneck to the accuracy of supervised learning. Structured predictions with complex (multi-layered, multi-typed, cross-modal, hierarchical, etc.) dependencies are frequently found helpful for natural language processing, computer vision, fraud detection, and graph data mining, while current crowdsourcing research is lagging behind as it only deals with much simpler dependencies. It is unknown to what extent will the dependencies hinder or help data labeling, how to design system, methodologies, and algorithms to crowdsource dependent labels and infer the true labels from the crowdsourced labels, and provide the more fundamental theoretical limit/guarantee in crowdsourcing efficacy.

In this project, PI investigated the “structured annotation” problem that address the complex dependencies in crowdsourced labeling tasks, and developed new methodologies and algorithms with theoretical guarantees for (1) selectively finding and querying data items with a fixed budget to improve annoation accuracy and (2) robust crowdsourcing ground truth inference with limited annotations, leveraging the complex data dependencies. The project lead to a set of efficient and effective methods, technologies and software systems for building reliable and sufficient training data for machine learning tasks in various domains. This project included training four PhD students. The project has resulted in over 10 publications at top conferences. Research results have been integrated into the PI’s courses and presented in the Midwest Big Data Summer School, Data Science Workshop, and other outreach activities.

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.


Last Modified: 01/29/2025
Modified by: Qi Li

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page