
NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | June 29, 2020 |
Latest Amendment Date: | June 29, 2020 |
Award Number: | 2008688 |
Award Instrument: | Standard Grant |
Program Manager: |
Karl Wimmer
kwimmer@nsf.gov (703)292-2095 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering |
Start Date: | July 1, 2020 |
End Date: | June 30, 2024 (Estimated) |
Total Intended Award Amount: | $349,999.00 |
Total Awarded Amount to Date: | $349,999.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
201 PRESIDENTS CIR SALT LAKE CITY UT US 84112-9049 (801)581-6903 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
School of Computing, 50 S Centra Salt Lake City UT US 84112-9205 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
Special Projects - CCF, Algorithmic Foundations |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Modern machine-learning applications aim to solve difficult computational problems accurately, quickly and at scale. This has led to significant algorithmic challenges that are compounded by practical considerations like robustness to noise and the distributed nature of data. The goal of the project is to develop a formal understanding of when efficient learning is possible, and to develop novel algorithmic insights. The techniques developed in the project will lead to progress in approximation algorithms, optimization, sublinear algorithms, and computational complexity. The project includes a plan to develop courses that teach undergraduate and graduate students to formally reason about machine-learning systems and to understand their power and limitations. The courses will help train the next generation of the workforce, and will be offered at the University of Utah, with much of the material being publicly accessible.
The project aims to address two core questions in reasoning about machine learning. The first one is related to the computational hardness of learning problems. For problems such as sparse coding and learning low-depth neural networks, all the known algorithms require strong structural assumptions in order to obtain learning guarantees. The project will study methods (for these and other problems) that enable one to weaken these assumptions, while obtaining weaker yet practically relevant guarantees. The second question is related to learning in online arrival models, motivated by recommender systems and signal processing. Here, many of the known theoretical results fall short when data is noisy or is only partially observed, and the project will develop formal models and algorithmic results for these settings.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Clustering, dimension reduction, and "data reduction", are all fundamental tools in data analysis and machine learning, each serving as a cornerstone in understanding and organizing complex datasets. Clustering aims to group a collection of points such that points within the same group are "closer" (as measured by some defined metric) compared to points in different groups. This process is critical in various applications, including market segmentation, social network analysis, and image recognition. On the other hand, dimensionality reduction (often achieved via low-rank approximation) seeks to identify the underlying low-dimensional structure in a dataset, assuming such structure exists. This technique is central to applications such as data compression, visualization, and noise reduction. Data reduction, often called "coresets", aims to reduce the size of a dataset, while preserving its essential structure, thereby allowing one to build ML and other models more efficiently.
All three problems have been rigorously explored from the perspective of "approximation algorithms" in several settings. The goal is to design computationally efficient methods that produce solutions whose "cost" —defined based on metrics like within-cluster variance or reconstruction error— is comparable to that of an optimal solution. However, practical constraints often render classical algorithms less suitable for real-world scenarios. This project focused on advancing the state of the art by addressing several practical challenges:
1. Online Arrival Models: In many real-world settings, data points arrive sequentially, necessitating algorithms that make decisions upon the arrival of each point. This contrasts with offline models, where the entire dataset is available upfront. We developed algorithms tailored for this online model, ensuring robust and efficient performance under streaming conditions.
2. Generalized Low-Rank Approximation: Extending beyond classical settings, we addressed significant generalizations, including:
- Weighted versions: Where points or features carry varying levels of importance.
- Lp-norm approximations: Providing flexibility by considering alternative norms (e.g., L1 or L-infinity norms) to capture reconstruction error, accommodating a broader range of applications.
- Matrix completion: Tackling scenarios with missing data, ensuring accurate recovery of the underlying low-rank structure despite incomplete observations.
3. Fairness Constraints: Real-world datasets often contain subpopulations or demographic groups that require equitable treatment. We formulated and solved versions of clustering and low-rank approximation that explicitly incorporate fairness constraints. Here, the aim was to ensure that the "costs" (e.g., clustering or approximation errors) are distributed equitably across groups, mitigating biases inherent in traditional methods.
Beyond these central goals, the project yielded additional notable results:
- We established precise bounds on the size of "spanners"—compact representations of datasets preserving essential geometric properties. Spanners serve as an efficient and interpretable data summary, enabling faster downstream analysis.
- A significant offshoot of this work involved constructing "coresets"—reduced, representative subsets of the data—that satisfy both performance guarantees and fairness constraints. This aspect is particularly valuable for resource-constrained environments where processing the full dataset is infeasible.
The impact of this project is evident through its dissemination in prestigious academic conferences, including NeurIPS, ICML, AISTATS, and FAccT. These venues reflect the relevance of the research, addressing both theoretical foundations and practical implications.
Additionally, the project supported the academic growth of two PhD students, who played pivotal roles in advancing these research directions. They led the development of several publications, gaining expertise and contributing to the broader scientific community. The project also inspired ongoing collaborative efforts, aiming to deepen our understanding of clustering, matrix approximation, and fairness in data analysis. The project also supported curriculum development at the University of Utah, specifically for two courses: Theory of Machine Learning (Spring 2022), and Algorithms, Randomness, and Optimization (Spring 2024). Creating more advanced yet accessible courses has been a priority of the PI (and the Department) to help improve student training, and both of these courses helped achieve this goal.
Last Modified: 01/21/2025
Modified by: Aditya Bhaskara
Please report errors in award information by writing to: awardsearch@nsf.gov.