NSF Award Search: Award # 1838071 - BIGDATA:F: Statistical and Computational Optimal Transport for Geometric Data Analysis

Award Abstract # 1838071

BIGDATA:F: Statistical and Computational Optimal Transport for Geometric Data Analysis

NSF Org:	IIS Division of Information & Intelligent Systems
Recipient:	MASSACHUSETTS INSTITUTE OF TECHNOLOGY
Initial Amendment Date:	September 7, 2018
Latest Amendment Date:	September 7, 2018
Award Number:	1838071
Award Instrument:	Standard Grant
Program Manager:	Sylvia Spengler sspengle@nsf.gov (703)292-7347 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering
Start Date:	December 1, 2018
End Date:	November 30, 2023 (Estimated)
Total Intended Award Amount:	$1,000,000.00
Total Awarded Amount to Date:	$1,000,000.00
Funds Obligated to Date:	FY 2018 = $1,000,000.00
History of Investigator:	Justin Solomon (Principal Investigator) jsolomon@mit.edu Philippe Rigollet (Co-Principal Investigator)
Recipient Sponsored Research Office:	Massachusetts Institute of Technology 77 MASSACHUSETTS AVE CAMBRIDGE MA US 02139-4301 (617)253-1000
Sponsor Congressional District:	07
Primary Place of Performance:	Massachusetts Institute of Technology 77 Massachusetts Cambridge MA US 02139-4307
Primary Place of Performance Congressional District:	07
Unique Entity Identifier (UEI):	E2NYLCDML6V1
Parent UEI:	E2NYLCDML6V1
NSF Program(s):	Big Data Science &Engineering
Primary Program Source:	01001819DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	062Z, 8083
Program Element Code(s):	808300
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

Current approaches to big data and accompanying computational methods have left behind critical applications where the data is not a collection of individual points, but rather whole geometric objects. Such applications include medical imaging, LiDAR for self-driving cars, and single-cell RNA sequencing, to name a few. Transferring the overwhelming success of simpler data processing and statistical techniques to this regime requires not only large datasets, but also suitable models and algorithms for analysis of this more general type of data. The theory of optimal transport has proven valuable to address these limitations thanks to recent advances on the computational front. Yet, understanding optimal transport as a statistical tool is still in its infancy. This project aims at developing a "geometric data analysis" toolbox based on optimal transport to tackle these new datasets. This proposal will help create a common language to interact and collaborate across disciplines. Much of this research will be integrated in this curriculum and made available through MIT OpenCourseWare. This proposal will also enable rich interdisciplinary training of PhD and undergraduate students.

The proposed methods are built around the rich mathematical theory of optimal transport (OT). This theory provides a framework for the development of new methods for geometric data analysis in addition to their rigorous statistical and computational analysis. The nascent theory of computational optimal transport is still largely dissociated from statistics, and many methods do not account properly for sampling and measurement noise. To avoid the pitfalls of overfitting, this proposal singularly and systematically takes a statistical approach to geometric data analysis. With an understanding of the theoretical advantages and drawbacks of OT for statistical modeling, it will lead to scalable OT algorithms with strong statistical guarantees. A tangible outcome of this proposal is a cohesive toolbox extending not only averaging but also regression, classification, clustering, and other notions from classical statistics in a fashion that captures global geometric features of data. It will have a direct impact on various applications in analysis of not only medical images but also point clouds gathered by LiDAR for self-driving cars, sequences of gene expressions produced by single-cell RNA sequencing, and other diverse yet large-scale sources of data. These datasets contain millions of entities but resist application of standard statistical procedures; current state-of-the-art techniques for their analysis are ad-hoc, not generalizable, and fail to reach the quality achieved by "big data" tools in other domains. Educational impact will be made by incorporating this work in new degree programs in statistics at MIT.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 69)

Show All

Abulnaga, S. Mazdak and Stein, Oded and Golland, Polina and Solomon, Justin "Symmetric Volume Maps: Order-invariant Volumetric Mesh Correspondence with Free Boundary" ACM Transactions on Graphics , v.42 , 2023 https://doi.org/10.1145/3572897 Citation Details

Beugnot, Gaspard and Genevay, Aude and Greenewald, Kristjan and Solomon, Justin "Improving Approximate Optimal Transport Distances using Quantization" Uncertainty in Artificial Intelligence , 2021 Citation Details

Boix-Adsera, Enric and Lawrence, Hannah and Stepaniants, George and Rigollet, Philippe "GULP: a prediction-based metric between representation" Advances in neural information processing systems , v.35 , 2022 Citation Details

Breeur, M and Stepaniants, G and Rigollet, P and Viallon, V "Optimal transport for automatic alignment of untargeted metabolomic data" eLife , 2024 Citation Details

Chewi, Sinho and "Optimal dimension dependence of the Metropolis-Adjusted Langevin Algorithm" Proceedings of Thirty Fourth Conference on Learning Theory , v.134 , 2021 Citation Details

Chewi, Sinho and Clancy, Julien and Le Gouic, Thibaut and Rigollet, Philippe and Stepaniants, George and Stromme, Austin "Fast and Smooth Interpolation on Wasserstein Space" Proceedings of Machine Learning Research , v.130 , 2021 https://doi.org/ Citation Details

Chewi, Sinho and Gerber, Patrik and Rigollet, Philippe and Turner, Paxton "Gaussian discrepancy: A probabilistic relaxation of vector balancing" Discrete Applied Mathematics , v.322 , 2022 https://doi.org/10.1016/j.dam.2022.08.007 Citation Details

Chewi, Sinho and Gerber, Patrik R. and Lu, Chen and Le Gouic, Thibaut and Rigollet, Philippe "The query complexity of sampling from strongly log-concave distributions in one dimension" Proceedings of Thirty Fifth Conference on Learning Theory , v.178 , 2022 Citation Details

Claici, Sebastian and Yurochkin, Mikhail and Ghosh, Soumya and Solomon, Justin "Model Fusion with Kullback-Leibler Divergence" International Conference on Machine Learning , 2020 https://doi.org/ Citation Details

DeFord, Daryl and Duchin, Moon and Solomon, Justin "A Computational Approach to Measuring Vote Elasticity and Competitiveness" Statistics and Public Policy , v.7 , 2020 https://doi.org/10.1080/2330443X.2020.1777915 Citation Details

DeFord, Daryl and Duchin, Moon and Solomon, Justin "Recombination: A Family of Markov Chains for Redistricting" Harvard Data Science Review , 2020 https://doi.org/10.1162/99608f92.eb30390f Citation Details

(Showing: 1 - 10 of 69)

Show All

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The past decade has witnessed a fundamental transition in our approach to data-driven decision-making. This paradigm shift is fueled by easy access to vast computational, data collection, and storage resources. Standing in for hand-crafted and expensive statistical experiments, a deluge of data is available at a scale on which even straightforward statistical techniques yield remarkably nontrivial insight. But, current approaches to "big data" and accompanying computational resources have left behind critical applications in which the data is not expressed as vectors of integer values or points in Euclidean space. Transferring the overwhelming success of simpler data processing and statistical techniques to this regime requires not only large datasets but also suitable models and algorithms for the analysis of this more general type of data.

A theoretically well-posed and empirically strong alternative relevant to the applications above uses optimal transport and Wasserstein distances to construct notions of averages, variances, and other statistical constructions in a fashion that is compatible with high-dimensional and curved data. Yet, understanding optimal transport as a statistical tool is still in its infancy. Hence, this project aimed to extend the basic building blocks of optimal transport to a complete statistical theory of geometric data analysis, which can be used for rigorous and computationally efficient analysis of geometrically-structured but potentially heterogeneous data.

This project led to countless breakthroughs in statistical and computational optimal transport, as detailed in the annual reports. Some selected highlights include the following:

Efficient, scalable algorithms and optimization techniques for computing Wasserstein distances and solving derived problems (e.g., computing barycenters and splines in optimal transport spaces), which can be applied to high-dimensional settings in part thanks to new neural network machinery.
Fundamental statistical analysis of models for computing optimal transport distances/matchings, Wasserstein barycenters, Gromov-Wasserstein distances/matchings, variational approximations, and other related objects from samples.
Extensions of optimal transport to include latent structural assumptions like smoothness, low-rank, low-dimensionality, sparsity, robustness, and fairness, as well as new approaches to inverse problems in Wasserstein space.
Applications in computational biology, medical imaging, computer animation, Bayesian inference, computer vision, model fusion, sampling, machine learning, and many other target domains.

Results from these studies were published in numerous peer-reviewed journals and presented at internationally-recognized conferences and colloquia.

Beyond the research aims of this project, funding from this grant catalyzed collaboration between the PIs’ research groups, which interacted through joint research discussions and other means. The PIs and their team members also produced, presented, and distributed tutorials and other introductions to optimal transport and related topics, increasing the amount of approachable materials available in this growing field. Additional broader impacts included mentorship of students and interns and founding of a new summer program targeted at bringing students from underrepresented/underserved groups in computing and mathematics into the pool of graduate school applicants.

Last Modified: 01/03/2024
Modified by: Justin M Solomon

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error