Award Abstract # 2239622
CAREER: Data Valuation in the Wild: Theories, Algorithms, and Applications

NSF Org: OAC
Office of Advanced Cyberinfrastructure (OAC)
Recipient: VIRGINIA POLYTECHNIC INSTITUTE & STATE UNIVERSITY
Initial Amendment Date: February 8, 2023
Latest Amendment Date: February 8, 2023
Award Number: 2239622
Award Instrument: Standard Grant
Program Manager: Victor Piotrowski
vpiotrow@nsf.gov
 (703)292-5141
OAC
 Office of Advanced Cyberinfrastructure (OAC)
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: February 1, 2023
End Date: January 31, 2028 (Estimated)
Total Intended Award Amount: $499,999.00
Total Awarded Amount to Date: $499,999.00
Funds Obligated to Date: FY 2023 = $499,999.00
History of Investigator:
  • Ruoxi Jia (Principal Investigator)
    ruoxijia@vt.edu
Recipient Sponsored Research Office: Virginia Polytechnic Institute and State University
300 TURNER ST NW
BLACKSBURG
VA  US  24060-3359
(540)231-5281
Sponsor Congressional District: 09
Primary Place of Performance: Virginia Polytechnic Institute and State University
300 TURNER ST NW
BLACKSBURG
VA  US  24060-3359
Primary Place of Performance
Congressional District:
09
Unique Entity Identifier (UEI): QDE5UHE5XD16
Parent UEI: X6KEFGLHSJX7
NSF Program(s): CAREER: FACULTY EARLY CAR DEV,
Info Integration & Informatics
Primary Program Source: 01002324DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 1045, 7364
Program Element Code(s): 104500, 736400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Data are essential ingredients for building machine learning (ML) applications. The ability to quantify and measure the value of data is critical to the entire ML lifecycle: from identifying useful data sources, to setting propriety over samples during training, and to interpreting the reason why certain behaviors of a model emerge during deployment. The potential of data valuation has been observed in many applications over the past few years. However, intermixed with these positive results is a vast array of applications for which existing data valuation techniques are not yet applicable, or too expensive to execute, or produce valuation results with substantial uncertainty. This project aims to enable data valuation to overcome applicability, scalability, and reproducibility challenges and transition to a practical and reliable tool for a data-centric future. This work will have a broad impact on society in terms of facilitating automated data quality management, designing incentives for data sharing, and improving the robustness of ML applications. This project will train undergraduate students to solve ML problems from both an algorithmic and a data quality perspective, while in the meantime creating useful school-age learning modules implemented at local, regional, and global scales.

The project consists of four research tasks to advance data valuation from different dimensions: 1) designing data valuation techniques that are robust to overcome the randomness in modern ML training algorithms; 2) developing new frameworks to determine the value of data samples given limited information about downstream learning tasks; 3) investigating principled methods to value heterogeneous and streaming data; and 4) creating and open-sourcing a unified multi-faceted evaluation platform to spur future advances in more complex data valuation. The proposed techniques are implemented and validated on a variety of high-impact real-world applications, including autonomous driving, energy-efficient buildings, and conversational artificial intelligence.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Chen, Si and Kang, Feiyang and Yu, Ning and Jia, Ruoxi "FASTTRACK: Reliable Fact Tracing via Clustering and LLM-Powered Evidence Validation" , 2024 Citation Details
Jahagirdar, Himanshu and Wang, Jiachen T and Jia, Ruoxi "Data Valuation in the Absence of a Reliable Validation Set" Transactions on Machine Learning Research , 2024 Citation Details
Just, Hoang Anh and Kang, Feiyang and Wang, Tianhao and Zeng, Yi and Ko, Myeongseob and Jin, Ming and Jia, Ruoxi "LAVA: Data Valuation without Pre-Specified Learning Algorithms" The Eleventh International Conference on Learning Representations , 2023 Citation Details
Wang, Jiachen T and Mittal, Prateek and Jia, Ruoxi "Efficient Data Shapley for Weighted Nearest Neighbor Algorithms" , 2024 Citation Details
Wang, Jiachen T and Yang, Tianji and Zou, James and Kwon, Yongchan and Jia, Ruoxi "Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits" , 2025 Citation Details
Wang, Jiachen T and Zhu, Yuqing and Wang, Yu-Xiang and Jia, Ruoxi and Mittal, Prateek "A Privacy-Friendly Approach to Data Valuation" , 2023 Citation Details

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page