Award Abstract # 2109988
EAGER:High Performance Algorithms for Interactive Data Science at Scale

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: NEW JERSEY INSTITUTE OF TECHNOLOGY
Initial Amendment Date: February 23, 2021
Latest Amendment Date: March 12, 2025
Award Number: 2109988
Award Instrument: Standard Grant
Program Manager: Almadena Chtchelkanova
achtchel@nsf.gov
 (703)292-7498
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: March 1, 2021
End Date: December 31, 2026 (Estimated)
Total Intended Award Amount: $187,447.00
Total Awarded Amount to Date: $2,400,492.00
Funds Obligated to Date: FY 2021 = $257,444.00
FY 2022 = $933,297.00

FY 2023 = $932,401.00

FY 2025 = $277,350.00
History of Investigator:
  • David Bader (Principal Investigator)
    bader@njit.edu
Recipient Sponsored Research Office: New Jersey Institute of Technology
323 DR MARTIN LUTHER KING JR BLVD
NEWARK
NJ  US  07102-1824
(973)596-5275
Sponsor Congressional District: 10
Primary Place of Performance: New Jersey Institute of Technology
University Heights
Newark
NJ  US  07102-1982
Primary Place of Performance
Congressional District:
10
Unique Entity Identifier (UEI): SGBMHQ7VXNH5
Parent UEI:
NSF Program(s): Software & Hardware Foundation
Primary Program Source: 01002122RB NSF RESEARCH & RELATED ACTIVIT
01002223DB NSF RESEARCH & RELATED ACTIVIT

01002324RB NSF RESEARCH & RELATED ACTIVIT

01002223RB NSF RESEARCH & RELATED ACTIVIT

01002526RB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 9237, 7942, 8237, 7916
Program Element Code(s): 779800
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

A real-world challenge in data science is to develop interactive methods for quickly analyzing new and novel data sets that are potentially of massive scale. This award will design and implement fundamental algorithms for high performance computing solutions that enable the interactive large-scale data analysis of massive data sets. Based on the widely-used data types and structures of strings, sets, matrices and graphs, this methodology will produce efficient and scalable software for three classes of fundamental algorithms that will drastically improve the performance on a wide range of real-world queries or directly realize frequent queries. These innovations will allow the broad community to move massive-scale data exploration from time-consuming batch processing to interactive analyses that give a data analyst the ability to comprehensively, deeply and efficiently explore the insights and science in real world data sets. By enabling the increasing number of developers to easily manipulate large data sets, this will greatly enlarge the data science community and find much broader use in new communities. Materials from this project will be included in graduate and undergraduate course curriculum. Especially, women, high school students and other underrepresented groups in STEM areas will be encouraged to participate in this research activity.

This project focuses on these three important data structures for data analytics: 1) suffix array construction, 2) 'treap' construction and 3) distributed memory join algorithms, useful for analyzing large scale strings, implementing random search in large string data sets, and generating new relations, respectively. These fundamental algorithms serve as the cornerstone to support interactive data science at scale. Based on the theoretical achievements and systematic algorithm design, a novel symbiotic optimization methodology that can combine the theoretical analysis, data structure features, and typical data distribution features together as a whole will be developed to significantly improve the practical performance of the proposed algorithms. To evaluate and show the effectiveness of the proposed algorithms, these algorithms will be implemented in and contribute to an open source NumPy-like software framework that aims to provide productive data discovery tools on massive, dozens-of-terabytes data sets by bringing together the productivity of Python with world-class high performance computing.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 35)
Green, Oded and Du, Zhihui and Patel, Sanyamee and Xie, Zehui and Liu, Hang Liu and Bader, David A. "Anti-Section Transitive Closure" The 28th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC) , 2021 Citation Details
Joseph Patchett and Zhihui Du and Fuhuan Li and David A. Bader "Triangle Centrality in Arkouda" The 26th Annual IEEE High Performance Extreme Computing Conference (HPEC) , 2022 Citation Details
Joseph Patchett and Zhihui Du and Oliver Alvarado Rodriguez and David A. Bader "Scalable K-Truss Implementation in Arkouda" New Jersey Big Data Alliance (NJBDA) Symposium , 2022 Citation Details
Li, Fuhuan and Bader, David A. "A GraphBLAS Implementation of Triangle Centrality" The 25th Annual IEEE High Performance Extreme Computing Conference (HPEC) , 2021 https://doi.org/10.1109/HPEC49654.2021.9622806 Citation Details
Oliver Alvarado Rodriguez and Zhihui Du and Joseph Patchett and Fuhuan Li and David A. Bader "Arachne: An Arkouda Package for Large-Scale Graph Analytics" The 26th Annual IEEE High Performance Extreme Computing Conference (HPEC) , 2022 Citation Details
Patchett, Joseph T. and Du, Zhihui and Bader, David A. "K-Truss Implementation in Arkouda (Extended Abstract)" The 25th Annual IEEE High Performance Extreme Computing Conference (HPEC) , 2021 Citation Details
Rodriguez, Oliver Alvarado and Buschmann, Fernando Vera and Du, Zhihui and Bader, David A "Property Graphs in Arachne" , 2023 https://doi.org/10.1109/HPEC58863.2023.10363498 Citation Details
Rodriguez, Oliver Alvarado and Du, Zhihui and Bader, David "Property Graphs in Arachne" , 2023 Citation Details
Szarnyas, Gabor and Bader, David A. and Davis, Timothy A. and Kitchen, James and Mattson, Timothy G. and McMillan, Scott and Welch, Erik "LAGraph: Linear Algebra, Network Analysis Libraries, and the Study of Graph Algorithms" 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) , 2021 https://doi.org/10.1109/IPDPSW52791.2021.00046 Citation Details
Vahidi, Soroush and Schieber, Baruch and Du, Zhihui and Bader, David "Parallel Longest Common SubSequence Analysis In Chapel" , 2023 Citation Details
Vahidi, Soroush and Schieber, Baruch and Du, Zhihui and Bader, David A "Parallel Longest Common SubSequence Analysis In Chapel" , 2023 https://doi.org/10.1109/HPEC58863.2023.10363472 Citation Details
(Showing: 1 - 10 of 35)

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page