Award Abstract # 2136744
EAGER: Scalable, Content-Based, Domain-Agnostic Search of Scientific Data through Concise Topological Representations

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: THE ADMINISTRATORS OF TULANE EDUCATIONAL FUND
Initial Amendment Date: July 22, 2021
Latest Amendment Date: July 22, 2021
Award Number: 2136744
Award Instrument: Standard Grant
Program Manager: Hector Munoz-Avila
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: October 1, 2021
End Date: September 30, 2023 (Estimated)
Total Intended Award Amount: $180,000.00
Total Awarded Amount to Date: $180,000.00
Funds Obligated to Date: FY 2021 = $180,000.00
History of Investigator:
  • Brian Summa (Principal Investigator)
    bsumma@tulane.edu
Recipient Sponsored Research Office: Tulane University
6823 SAINT CHARLES AVE
NEW ORLEANS
LA  US  70118-5665
(504)865-4000
Sponsor Congressional District: 01
Primary Place of Performance: Tulane University
6823 St Charles Avenue
New Orleans
LA  US  70118-5698
Primary Place of Performance
Congressional District:
01
Unique Entity Identifier (UEI): XNY5ULPU8EN6
Parent UEI: XNY5ULPU8EN6
NSF Program(s): Info Integration & Informatics
Primary Program Source: 01002122DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7364, 7916, 9150
Program Element Code(s): 736400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Cutting-edge science relies on scientists? ability to sift through and access the massive amounts of data that are being produced by the latest research. Much of that data is stored in online databases and is searchable only by using specific, scientific terms, like keywords, tags, or descriptions. If someone doesn?t know exactly the right terms to use, they often can?t access all the data that might be useful for their research. By using mathematical approaches for information retrieval in a new way, this project will study whether a powerful search tool, called content-based search, can be modified for scientific data. If successful, this project will free data users from needing to know exactly which keywords to use, transforming how scientists are able to access and share data and creating new opportunities for scientists with vastly different expertise to work together.

One particularly promising way to describe the content of scientific data is through a dataset?s topology. Therefore, this project will develop approaches to compute topological similarity that are smaller, faster, and more scalable than previously thought possible, with the goal of creating a method for cross-cutting, content-based search of scientific data. Specifically, the investigators will develop a learned-hash function to convert a dataset?s persistence diagram - the common encoding of its topology - to a simple binary code. This hash will be trained such that the bitwise distance between codes will maintain a measure of topological similarity between datasets. This will convert topological comparisons from the current state of an expensive bottleneck to one with nominal processing costs that can scale to large database queries. Initially, this project will focus on binary codes that maintain clusters and neighborhoods, ultimately developing codes that are rank or semi-metric preserving. The investigators will also explore strategies for training a learned-hash function on synthetic data, with the goal of developing a fully domain-oblivious approach to content-based search.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Ashman, Kimberly and Zhuge, Huimin and Shanley, Erin and Fox, Sharon and Halat, Shams and Sholl, Andrew and Summa, Brian and Brown, J. Quincy "Whole slide image data utilization informed by digital diagnosis patterns" Journal of Pathology Informatics , v.13 , 2022 https://doi.org/10.1016/j.jpi.2022.100113 Citation Details
Qin, Yu and Fasy, Brittany Terese and Wenk, Carola and Summa, Brian "A Domain-Oblivious Approach for Learning Concise Representations of Filtered Topological Spaces for Clustering" IEEE Transactions on Visualization and Computer Graphics , 2021 https://doi.org/10.1109/TVCG.2021.3114872 Citation Details
Qin, Yu and Johnson, Graham and Summa, Brian "Topological Guided Detection of Extreme Wind Phenomena: Implications for Wind Energy" , 2023 Citation Details
Qin, Yu and Terese Fasy, Brittany and Wenk, Carola and Summa, Brian "Visualizing Topological Importance: A Class-Driven Approach" , 2023 Citation Details
Richard, Thomas and Chastagnier, Yan and Szabo, Vivien and Chalard, Kevin and Summa, Brian and Thiery, Jean-Marc and Boubekeur, Tamy and Faraj, Noura "Eurographics Workshop on Visual Computing for Biology and Medicine" , 2022 https://doi.org/10.2312/vcbm.20221191 Citation Details
Zhuge, Huimin and Summa, Brian and Hamm, Jihun and Brown, J_Quincy "Deep learning 2D and 3D optical sectioning microscopy using cross-modality Pix2Pix cGAN image translation" Biomedical Optics Express , v.12 , 2021 https://doi.org/10.1364/BOE.439894 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This project gauged the feasibility of comparing scientific datasets based on their fundamental connectedness and structure, often called a data's topology. Previous techniques to compare datasets by this structure are too computationally expensive to be practical. Therefore, this project developed new technologies for fast topological comparisons. Using advanced techniques in machine learning, we cast a dataset's topological features into a single number, wherein comparing two of these numbers gave a similar measure as the comparison of topological differences. This 'hash' of the topological structure scales to massive datasets. It also reduces comparison times by orders of magnitude down to only milliseconds. In addition, using state-of-the-art deep learning, we improved the discriminative capabilities of topological comparisons. By learning what features are important, our approach outperformed other approaches regarding classification accuracy. In addition, using advanced approaches in explainable machine learning, we, for the first time, presented a measure of the importance of a topological feature. When the dataset is an image, this importance can be highlighted directly in the data, illuminating the features that drive an image's classification. The advances of this project have applications in accurate, automatic medical image classification, classification of underwater features detected with high-resolution sonar, and detection of extreme wind events in wind farm simulations.


Last Modified: 01/30/2024
Modified by: Brian Summa

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page