NSF Award Search: Award # 2238693 - CAREER: Statistically-Sound Knowledge Discovery from Data

Award Abstract # 2238693

CAREER: Statistically-Sound Knowledge Discovery from Data

NSF Org:	IIS Division of Information & Intelligent Systems
Recipient:	AMHERST COLLEGE, TRUSTEES OF
Initial Amendment Date:	June 13, 2023
Latest Amendment Date:	July 16, 2024
Award Number:	2238693
Award Instrument:	Continuing Grant
Program Manager:	Sorin Draghici sdraghic@nsf.gov (703)292-2232 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering
Start Date:	October 1, 2023
End Date:	September 30, 2028 (Estimated)
Total Intended Award Amount:	$600,322.00
Total Awarded Amount to Date:	$319,654.00
Funds Obligated to Date:	FY 2023 = $205,324.00 FY 2024 = $114,330.00
History of Investigator:	Matteo Riondato (Principal Investigator) mriondato@amherst.edu
Recipient Sponsored Research Office:	Amherst College 155 S PLEASANT ST AMHERST MA US 01002-2234 (413)542-2804
Sponsor Congressional District:	02
Primary Place of Performance:	Amherst College 155 S PLEASANT ST AMHERST MA US 01002-2234
Primary Place of Performance Congressional District:	02
Unique Entity Identifier (UEI):	KDRLUT71AFM5
Parent UEI:
NSF Program(s):	Info Integration & Informatics
Primary Program Source:	01002324DB NSF RESEARCH & RELATED ACTIVIT 01002425DB NSF RESEARCH & RELATED ACTIVIT 01002526DB NSF RESEARCH & RELATED ACTIVIT 01002627DB NSF RESEARCH & RELATED ACTIVIT 01002728DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	1045, 7364
Program Element Code(s):	736400
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

Methods for knowledge discovery from data (e.g., for extracting patterns or finding anomalies) have found their way to research labs in life and biological sciences, and in industries such as cybersecurity. In these fields, the statistical validity of the results produced by these methods is paramount: false discoveries cannot be tolerated. Current methods do not offer such stringent statistical guarantees. This project develops algorithms for statistically-sound Knowledge Discovery from Data. It transforms the field by shifting the goal of the Knowledge Discovery process from extracting information about the available data to gaining new understanding of the noisy, random process that generates the data. The proposed methods contribute towards a faster and higher-throughput scientific pipeline, by allowing scientists and practitioners to efficiently analyze rich large datasets and to trust the results of the analysis. Researchers can then focus on their discipline-specific research tasks without worrying about computational or statistical considerations. The project includes collaborations with a local museum and a local public library, to analyze data about their collections of historic materials, and with a cybersecurity company to develop methods for fast detection of network attacks with few false positives. A diverse cohort of undergraduate students will be involved in the research and educational components of the project.

Research in knowledge discovery has mostly focused on understanding the available data, rather than the process that generated it. In the few cases where hypothesis testing was used to assess the results (mostly for simple patterns), only simplistic null models were considered, and the testing employed low-statistical-power approaches (e.g., the Bonferroni correction) to control only for one measure of false discovery, the Family-Wise Error Rate. This project is transformative because it will develop efficient methods for evaluating a wide variety of results (e.g., patterns, anomalies, graph/vertex/edge properties, and more) obtained from large rich datasets (e.g., transactional datasets, graphs, and time series), using realistic null models which are more appropriate for these tasks, and better encode available knowledge of the data generating process. We will create novel efficient procedures to sample from such models, both approximate (e.g., Markov-Chain Monte Carlo) and exact, and combine them with modern resampling- based multiple testing methods, in a multiple-hypothesis first approach that also controls the (marginal) False Discovery Rate.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Preti, Giulia and De_Francisci_Morales, Gianmarco and Riondato, Matteo "Alice and the Caterpillar: A more descriptive null model for assessing data mining results" Knowledge and Information Systems , v.66 , 2024 https://doi.org/10.1007/s10115-023-02001-6 Citation Details

Preti, Giulia and De_Francisci_Morales, Gianmarco and Riondato, Matteo "Impossibility result for Markov chain Monte Carlo sampling from microcanonical bipartite graph ensembles" Physical Review E , v.109 , 2024 https://doi.org/10.1103/PhysRevE.109.L053301 Citation Details

Please report errors in award information by writing to: awardsearch@nsf.gov.