
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | July 22, 2021 |
Latest Amendment Date: | July 22, 2021 |
Award Number: | 2136744 |
Award Instrument: | Standard Grant |
Program Manager: |
Hector Munoz-Avila
IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | October 1, 2021 |
End Date: | September 30, 2023 (Estimated) |
Total Intended Award Amount: | $180,000.00 |
Total Awarded Amount to Date: | $180,000.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
6823 SAINT CHARLES AVE NEW ORLEANS LA US 70118-5665 (504)865-4000 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
6823 St Charles Avenue New Orleans LA US 70118-5698 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Info Integration & Informatics |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Cutting-edge science relies on scientists? ability to sift through and access the massive amounts of data that are being produced by the latest research. Much of that data is stored in online databases and is searchable only by using specific, scientific terms, like keywords, tags, or descriptions. If someone doesn?t know exactly the right terms to use, they often can?t access all the data that might be useful for their research. By using mathematical approaches for information retrieval in a new way, this project will study whether a powerful search tool, called content-based search, can be modified for scientific data. If successful, this project will free data users from needing to know exactly which keywords to use, transforming how scientists are able to access and share data and creating new opportunities for scientists with vastly different expertise to work together.
One particularly promising way to describe the content of scientific data is through a dataset?s topology. Therefore, this project will develop approaches to compute topological similarity that are smaller, faster, and more scalable than previously thought possible, with the goal of creating a method for cross-cutting, content-based search of scientific data. Specifically, the investigators will develop a learned-hash function to convert a dataset?s persistence diagram - the common encoding of its topology - to a simple binary code. This hash will be trained such that the bitwise distance between codes will maintain a measure of topological similarity between datasets. This will convert topological comparisons from the current state of an expensive bottleneck to one with nominal processing costs that can scale to large database queries. Initially, this project will focus on binary codes that maintain clusters and neighborhoods, ultimately developing codes that are rank or semi-metric preserving. The investigators will also explore strategies for training a learned-hash function on synthetic data, with the goal of developing a fully domain-oblivious approach to content-based search.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
This project gauged the feasibility of comparing scientific datasets based on their fundamental connectedness and structure, often called a data's topology. Previous techniques to compare datasets by this structure are too computationally expensive to be practical. Therefore, this project developed new technologies for fast topological comparisons. Using advanced techniques in machine learning, we cast a dataset's topological features into a single number, wherein comparing two of these numbers gave a similar measure as the comparison of topological differences. This 'hash' of the topological structure scales to massive datasets. It also reduces comparison times by orders of magnitude down to only milliseconds. In addition, using state-of-the-art deep learning, we improved the discriminative capabilities of topological comparisons. By learning what features are important, our approach outperformed other approaches regarding classification accuracy. In addition, using advanced approaches in explainable machine learning, we, for the first time, presented a measure of the importance of a topological feature. When the dataset is an image, this importance can be highlighted directly in the data, illuminating the features that drive an image's classification. The advances of this project have applications in accurate, automatic medical image classification, classification of underwater features detected with high-resolution sonar, and detection of extreme wind events in wind farm simulations.
Last Modified: 01/30/2024
Modified by: Brian Summa
Please report errors in award information by writing to: awardsearch@nsf.gov.