Award Abstract # 2109988
EAGER:High Performance Algorithms for Interactive Data Science at Scale

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: NEW JERSEY INSTITUTE OF TECHNOLOGY
Initial Amendment Date: February 23, 2021
Latest Amendment Date: March 12, 2025
Award Number: 2109988
Award Instrument: Standard Grant
Program Manager: Almadena Chtchelkanova
achtchel@nsf.gov
 (703)292-7498
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: March 1, 2021
End Date: December 31, 2026 (Estimated)
Total Intended Award Amount: $187,447.00
Total Awarded Amount to Date: $2,400,492.00
Funds Obligated to Date: FY 2021 = $257,444.00
FY 2022 = $933,297.00

FY 2023 = $932,401.00

FY 2025 = $277,350.00
History of Investigator:
  • David Bader (Principal Investigator)
    bader@njit.edu
Recipient Sponsored Research Office: New Jersey Institute of Technology
323 DR MARTIN LUTHER KING JR BLVD
NEWARK
NJ  US  07102-1824
(973)596-5275
Sponsor Congressional District: 10
Primary Place of Performance: New Jersey Institute of Technology
University Heights
Newark
NJ  US  07102-1982
Primary Place of Performance
Congressional District:
10
Unique Entity Identifier (UEI): SGBMHQ7VXNH5
Parent UEI:
NSF Program(s): Software & Hardware Foundation
Primary Program Source: 01002223DB NSF RESEARCH & RELATED ACTIVIT
01002223RB NSF RESEARCH & RELATED ACTIVIT

01002324RB NSF RESEARCH & RELATED ACTIVIT

01002526RB NSF RESEARCH & RELATED ACTIVIT

01002122RB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7916, 7942, 8237, 9237
Program Element Code(s): 779800
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

A real-world challenge in data science is to develop interactive methods for quickly analyzing new and novel data sets that are potentially of massive scale. This award will design and implement fundamental algorithms for high performance computing solutions that enable the interactive large-scale data analysis of massive data sets. Based on the widely-used data types and structures of strings, sets, matrices and graphs, this methodology will produce efficient and scalable software for three classes of fundamental algorithms that will drastically improve the performance on a wide range of real-world queries or directly realize frequent queries. These innovations will allow the broad community to move massive-scale data exploration from time-consuming batch processing to interactive analyses that give a data analyst the ability to comprehensively, deeply and efficiently explore the insights and science in real world data sets. By enabling the increasing number of developers to easily manipulate large data sets, this will greatly enlarge the data science community and find much broader use in new communities. Materials from this project will be included in graduate and undergraduate course curriculum. Especially, women, high school students and other underrepresented groups in STEM areas will be encouraged to participate in this research activity.

This project focuses on these three important data structures for data analytics: 1) suffix array construction, 2) 'treap' construction and 3) distributed memory join algorithms, useful for analyzing large scale strings, implementing random search in large string data sets, and generating new relations, respectively. These fundamental algorithms serve as the cornerstone to support interactive data science at scale. Based on the theoretical achievements and systematic algorithm design, a novel symbiotic optimization methodology that can combine the theoretical analysis, data structure features, and typical data distribution features together as a whole will be developed to significantly improve the practical performance of the proposed algorithms. To evaluate and show the effectiveness of the proposed algorithms, these algorithms will be implemented in and contribute to an open source NumPy-like software framework that aims to provide productive data discovery tools on massive, dozens-of-terabytes data sets by bringing together the productivity of Python with world-class high performance computing.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 35)
Bader, David "Fast Triangle Counting" , 2023 Citation Details
Bader, David A "Fast Triangle Counting" , 2023 https://doi.org/10.1109/HPEC58863.2023.10363539 Citation Details
Bader, David A. and Burkhardt, Paul "A Simple and Efficient Algorithm for Finding Minimum Spanning Tree Replacement Edges" Journal of Graph Algorithms and Applications , v.26 , 2022 https://doi.org/10.7155/jgaa.00609 Citation Details
Bader, David A and Li, Fuhuan and Ganeshan, Anya and Gundogdu, Ahmet and Lew, Jason and Rodriguez, Oliver Alvarado and Du, Zhihui "Triangle Counting Through Cover-Edges" , 2023 https://doi.org/10.1109/HPEC58863.2023.10363465 Citation Details
Bader, David and Li, Fuhuan and Ganeshan, Anya and Gundogdu, Ahmet and Lew, Jason and Alvarado Rodriguez, Oliver and Du, Zhihui "Triangle Counting Through Cover-Edges" , 2023 Citation Details
Buschmann, Fernando Vera and Du, Zhihui and Bader, David A "Enhanced Knowledge Graph Attention Networks for Efficient Graph Learning" , 2024 Citation Details
Buschmann, Fernando Vera and Pauliuchenka, Palina and Oh, Ethan and Kao, Bai Chien and DiValentin, Louis and Bader, David A "Graph-Based Profiling of Dependency Vulnerability Remediation" , 2025 Citation Details
Cappelletti, Luca and Fontana, Tommaso and Green, Oded and Bader, David "Parallel Triangles and Squares Count for Multigraphs Using Vertex Covers" Lecture notes in computer science , v.10476 , 2023 Citation Details
Cappelletti, Luca and Fontana, Tommaso and Reese, Justin and Bader, David A. "Billion-scale Detection of Isomorphic Nodes" , 2023 https://doi.org/10.1109/IPDPSW59300.2023.00046 Citation Details
Dindoost, Mohammad and Rodriguez, Oliver Alvarado and Bagchi, Sounak and Pauliuchenka, Palina and Du, Zhihui and Bader, David A "VF2-PS: Parallel and Scalable Subgraph Monomorphism in Arachne" , 2024 Citation Details
Du, Zhihui and Alvarado Rodriguez, Oliver and Li, Fuhuan and Dindoost, Mohammad and Bader, David "Contour Algorithm for Connectivity" , 2023 Citation Details
(Showing: 1 - 10 of 35)

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page