
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | September 7, 2018 |
Latest Amendment Date: | September 7, 2018 |
Award Number: | 1838071 |
Award Instrument: | Standard Grant |
Program Manager: |
Sylvia Spengler
sspengle@nsf.gov (703)292-7347 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | December 1, 2018 |
End Date: | November 30, 2023 (Estimated) |
Total Intended Award Amount: | $1,000,000.00 |
Total Awarded Amount to Date: | $1,000,000.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
77 MASSACHUSETTS AVE CAMBRIDGE MA US 02139-4301 (617)253-1000 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
77 Massachusetts Cambridge MA US 02139-4307 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Big Data Science &Engineering |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Current approaches to big data and accompanying computational methods have left behind critical applications where the data is not a collection of individual points, but rather whole geometric objects. Such applications include medical imaging, LiDAR for self-driving cars, and single-cell RNA sequencing, to name a few. Transferring the overwhelming success of simpler data processing and statistical techniques to this regime requires not only large datasets, but also suitable models and algorithms for analysis of this more general type of data. The theory of optimal transport has proven valuable to address these limitations thanks to recent advances on the computational front. Yet, understanding optimal transport as a statistical tool is still in its infancy. This project aims at developing a "geometric data analysis" toolbox based on optimal transport to tackle these new datasets. This proposal will help create a common language to interact and collaborate across disciplines. Much of this research will be integrated in this curriculum and made available through MIT OpenCourseWare. This proposal will also enable rich interdisciplinary training of PhD and undergraduate students.
The proposed methods are built around the rich mathematical theory of optimal transport (OT). This theory provides a framework for the development of new methods for geometric data analysis in addition to their rigorous statistical and computational analysis. The nascent theory of computational optimal transport is still largely dissociated from statistics, and many methods do not account properly for sampling and measurement noise. To avoid the pitfalls of overfitting, this proposal singularly and systematically takes a statistical approach to geometric data analysis. With an understanding of the theoretical advantages and drawbacks of OT for statistical modeling, it will lead to scalable OT algorithms with strong statistical guarantees. A tangible outcome of this proposal is a cohesive toolbox extending not only averaging but also regression, classification, clustering, and other notions from classical statistics in a fashion that captures global geometric features of data. It will have a direct impact on various applications in analysis of not only medical images but also point clouds gathered by LiDAR for self-driving cars, sequences of gene expressions produced by single-cell RNA sequencing, and other diverse yet large-scale sources of data. These datasets contain millions of entities but resist application of standard statistical procedures; current state-of-the-art techniques for their analysis are ad-hoc, not generalizable, and fail to reach the quality achieved by "big data" tools in other domains. Educational impact will be made by incorporating this work in new degree programs in statistics at MIT.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
The past decade has witnessed a fundamental transition in our approach to data-driven decision-making. This paradigm shift is fueled by easy access to vast computational, data collection, and storage resources. Standing in for hand-crafted and expensive statistical experiments, a deluge of data is available at a scale on which even straightforward statistical techniques yield remarkably nontrivial insight. But, current approaches to "big data" and accompanying computational resources have left behind critical applications in which the data is not expressed as vectors of integer values or points in Euclidean space. Transferring the overwhelming success of simpler data processing and statistical techniques to this regime requires not only large datasets but also suitable models and algorithms for the analysis of this more general type of data.
A theoretically well-posed and empirically strong alternative relevant to the applications above uses optimal transport and Wasserstein distances to construct notions of averages, variances, and other statistical constructions in a fashion that is compatible with high-dimensional and curved data. Yet, understanding optimal transport as a statistical tool is still in its infancy. Hence, this project aimed to extend the basic building blocks of optimal transport to a complete statistical theory of geometric data analysis, which can be used for rigorous and computationally efficient analysis of geometrically-structured but potentially heterogeneous data.
This project led to countless breakthroughs in statistical and computational optimal transport, as detailed in the annual reports. Some selected highlights include the following:
-
Efficient, scalable algorithms and optimization techniques for computing Wasserstein distances and solving derived problems (e.g., computing barycenters and splines in optimal transport spaces), which can be applied to high-dimensional settings in part thanks to new neural network machinery.
-
Fundamental statistical analysis of models for computing optimal transport distances/matchings, Wasserstein barycenters, Gromov-Wasserstein distances/matchings, variational approximations, and other related objects from samples.
-
Extensions of optimal transport to include latent structural assumptions like smoothness, low-rank, low-dimensionality, sparsity, robustness, and fairness, as well as new approaches to inverse problems in Wasserstein space.
-
Applications in computational biology, medical imaging, computer animation, Bayesian inference, computer vision, model fusion, sampling, machine learning, and many other target domains.
Results from these studies were published in numerous peer-reviewed journals and presented at internationally-recognized conferences and colloquia.
Beyond the research aims of this project, funding from this grant catalyzed collaboration between the PIs’ research groups, which interacted through joint research discussions and other means. The PIs and their team members also produced, presented, and distributed tutorials and other introductions to optimal transport and related topics, increasing the amount of approachable materials available in this growing field. Additional broader impacts included mentorship of students and interns and founding of a new summer program targeted at bringing students from underrepresented/underserved groups in computing and mathematics into the pool of graduate school applicants.
Last Modified: 01/03/2024
Modified by: Justin M Solomon
Please report errors in award information by writing to: awardsearch@nsf.gov.