
NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | January 30, 2019 |
Latest Amendment Date: | January 30, 2019 |
Award Number: | 1850052 |
Award Instrument: | Standard Grant |
Program Manager: |
Tracy Kimbrel
CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering |
Start Date: | July 1, 2019 |
End Date: | June 30, 2021 (Estimated) |
Total Intended Award Amount: | $174,074.00 |
Total Awarded Amount to Date: | $174,074.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
1400 WASHINGTON AVE ALBANY NY US 12222-0100 (518)437-4974 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
NY US 12222-1000 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Algorithmic Foundations |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
With Big Data comes the need for succinct data summaries that are easy to visualize and understand. Topological Data Analysis (TDA) offers a particular suite of computational tools for summarizing and visualizing shape in high-dimensional data sets. These tools have been used in recent years to guide new discoveries in cancer research, neuroscience, materials science, image analysis, and many areas where shape, broadly construed, is of importance. However, by summarizing data in a succinct way certain important differences between data sets can often go undetected. This project provides a new mathematical framework for precisely measuring how lossy the most popular methods in TDA are. By developing new ways for quantifying how large-scale differences go undetected, the project offers enrichments of current topological methods to obtain novel data science tools with greater distinguishing power. The awarded funds will go primarily to fund a graduate student to aid the PI in basic research and development of these enriched topological summaries. Computational experiments on data sets of public interest, e.g. time-varying socio-economic indicators and weather data, will be carried out to test performance of these tools against current state of the art methods. Additionally, the PI will incorporate these novel methods into the data science curriculum at the PI's host institution and recruit students from historically under-represented groups to be trained as the nation's next generation of data scientists.
This project is an ambitious extension of earlier work undertaken by the PI and provides a targeted attack on the inverse problem for the main objects of study in topological data analysis: the merge tree, which tracks how connected components of the sub-level set of a function evolves; Reeb graphs, which tracks connected components of the fiber of a function; and the barcode/persistence diagram, which is a collection of intervals/points in the plane that represent an algorithmic pairing of critical points of a function on a space. The PI showed in earlier work how merge trees determine the associated barcode and provided a precise enumeration of how many distinct merge trees have the same barcode. By identifying functions on the real line that are related by an orientation preserving coordinate transformation, the PI showed how a novel enrichment of the merge tree---the chiral merge tree---faithfully captures these equivalence classes of functions and offers exponential distinguishing power over the barcode for time-series analysis. This project aims to carry out similar analysis for functions on surfaces, with an eye toward improving the classification performance of persistent homology in image analysis, and for Reeb graphs, to better integrate with level-set persistent homology. By carrying out a careful study of inverse problems in each of these settings, novel enrichments analogous to the chiral merge tree will be developed. Additionally, to make these enriched topological summaries useful for data classification tasks, novel metrics will be defined for comparing each of these summaries and algorithms will be developed for the efficient computation of these metrics.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Topological Data Analysis (TDA) has grown tremendously as a rich toolkit for analyzing data sets where shape is an important feature. The success of TDA comes largely from its ability to take a complicated, potentially high-dimensional, data set and summarize it using topological descriptors known as merge trees, Reeb graphs, or persistence barcodes (also known as persistence diagrams). These descriptors are compact, lightweight summaries of a data set that are stable with respect to perturbations and certain types of noise. The passage from data to these summaries is sometimes called the ?TDA pipeline?. This project emerged from the observation that qualitatively different data sets can produce identical merge trees and barcodes, when passed through the TDA pipeline; see Figure 1 for a cartoon example. The goals of the project were to produce new enriched topological summaries (ETS) that stand upstream of traditional topological summaries (TS) that can distinguish these data sets as well data sets of practical interest, all without sacrificing stability or computability; see Figure 2.
This project was successful in several regards. First, the PI and their collaborators were able to identify several enrichments of traditional TDA by first counting how many different data sets produced the same merge tree or barcode. The notion of ?data set? and ?different? was varied across a broad class of examples, including Morse functions and embeddings of shapes in three dimensions. The project advanced knowledge of the theory underlying TDA methods and produced new connections between TDA, (co)sheaf theory, combinatorics and geometric group theory. Additionally, one of the first combinatorial results that distinguishes merge trees from phylogenetic trees was proved as a result of this work, thereby opening the door to further avenues of exploration in the geometry of the space of all possible merge trees. Preliminary results on doing statistics on the space of merge trees and barcodes were also provided.
Perhaps most significantly, one of the main ETS developed by this project, the Decorated Merge Tree (DMT), was identified as a new lightweight and stable summary of data sets where features such as clusters and circles can be tracked in tandem. Both theoretical and computational definitions were provided and their implementation in Python is now freely available on GitHub. As an illustration of this new tool see Figure 3. In that figure, three synthetic point cloud examples are shown to the left, their persistence diagrams are shown in the middle---which are nearly identical for the three point clouds, and their decorated merge trees are shown on the right. Variations on this example are shown in Figure 4. The PI and their collaborators were able to define several distances on these DMTs of varying computability and have demonstrated their improved clustering power. Figure 5 shows how the six data sets from Figure 4 would be clustered using traditional methods, which are most clearly grouped using our new method. These methods also provide feature localization when studying time series, such as heart rate data (Figure 6), and in image data, such as in fMRI data.
Several graduate students received support from this grant to augment their mentorship as highly skilled mathematicians and programmers. At least one student has written a PhD thesis based on the work performed on this project and several postdocs at other institutions were able to augment their training through collaboration with the PI and their graduate students. The PI continues to teach data science courses at their host institution where these methods are brought into the classroom so that undergraduate and masters students are exposed to the latest methods in TDA.
Last Modified: 10/27/2021
Modified by: Justin M Curry
Please report errors in award information by writing to: awardsearch@nsf.gov.