Award Abstract # 1443054
CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science

NSF Org: OAC
Office of Advanced Cyberinfrastructure (OAC)
Recipient: TRUSTEES OF INDIANA UNIVERSITY
Initial Amendment Date: September 5, 2014
Latest Amendment Date: June 2, 2020
Award Number: 1443054
Award Instrument: Standard Grant
Program Manager: Amy Walton
awalton@nsf.gov
 (703)292-4538
OAC
 Office of Advanced Cyberinfrastructure (OAC)
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: October 1, 2014
End Date: September 30, 2021 (Estimated)
Total Intended Award Amount: $5,000,000.00
Total Awarded Amount to Date: $5,283,170.00
Funds Obligated to Date: FY 2014 = $5,000,000.00
FY 2015 = $51,969.00

FY 2016 = $49,894.00

FY 2017 = $52,395.00

FY 2018 = $53,785.00

FY 2019 = $55,127.00

FY 2020 = $20,000.00
History of Investigator:
  • Geoffrey Fox (Principal Investigator)
    vxj6mb@virginia.edu
  • Madhav Marathe (Co-Principal Investigator)
  • Shantenu Jha (Co-Principal Investigator)
  • Judy Fox (Co-Principal Investigator)
  • Fusheng Wang (Co-Principal Investigator)
Recipient Sponsored Research Office: Indiana University
107 S INDIANA AVE
BLOOMINGTON
IN  US  47405-7000
(317)278-3473
Sponsor Congressional District: 09
Primary Place of Performance: Indiana University
901 E. 10th Street
Bloomington
IN  US  47408-3912
Primary Place of Performance
Congressional District:
09
Unique Entity Identifier (UEI): YH86RTW2YVJ4
Parent UEI:
NSF Program(s): Tribal College & Univers Prog,
EDUCATION AND WORKFORCE,
Data Cyberinfrastructure
Primary Program Source: 01001415DB NSF RESEARCH & RELATED ACTIVIT
01001516DB NSF RESEARCH & RELATED ACTIVIT

01001617DB NSF RESEARCH & RELATED ACTIVIT

01001718DB NSF RESEARCH & RELATED ACTIVIT

01001819DB NSF RESEARCH & RELATED ACTIVIT

01001920DB NSF RESEARCH & RELATED ACTIVIT

04002021DB NSF Education & Human Resource
Program Reference Code(s): 7433, 8048, 9251
Program Element Code(s): 174400, 736100, 772600
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Many scientific problems depend on the ability to analyze and compute on large amounts of data. This analysis often does not scale well; its effectiveness is hampered by the increasing volume, variety and rate of change (velocity) of big data. This project will design, develop and implement building blocks that enable a fundamental improvement in the ability to support data intensive analysis on a broad range of cyberinfrastructure, including that supported by NSF for the scientific community. The project will integrate features of traditional high-performance computing, such as scientific libraries, communication and resource management middleware, with the rich set of capabilities found in the commercial Big Data ecosystem. The latter includes many important software systems such as Hadoop, available from the Apache open source community. A collaboration between university teams at Arizona, Emory, Indiana (lead), Kansas, Rutgers, Virginia Tech, and Utah provides the broad expertise needed to design and successfully execute the project. The project will engage scientists and educators with annual workshops and activities at discipline-specific meetings, both to gather requirements for and feedback on its software. It will include under-represented communities with summer experiences, and will develop curriculum modules that include demonstrations built as 'Data Analytics as a Service.'

The project will design and implement a software Middleware for Data-Intensive Analytics and Science (MIDAS) that will enable scalable applications with the performance of HPC (High Performance Computing) and the rich functionality of the commodity Apache Big Data Stack. Further, this project will design and implement a set of cross-cutting high-performance data-analysis libraries; SPIDAL (Scalable Parallel Interoperable Data Analytics Library) will support new programming and execution models for data-intensive analysis in a wide range of science and engineering applications. The project addresses major data challenges in seven different communities: Biomolecular Simulations, Network and Computational Social Science, Epidemiology, Computer Vision, Spatial Geographical Information Systems, Remote Sensing for Polar Science, and Pathology Informatics. The project libraries will have the same beneficial impact on data analytics that scientific libraries such as PETSc, MPI and ScaLAPACK have had for supercomputer simulations. These libraries will be implemented to be scalable and interoperable across a range of computing systems including clouds, clusters and supercomputers.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 92)
Abeykoon, Vibhatha and Fox, Geoffrey and Kim, Minje and Ekanayake, Saliya and Kamburugamuve, Supun and Govindarajan, Kannan and Wickramasinghe, Pulasthi and Perera, Niranda and Widanage, Chathura and Uyar, Ahmet a "Stochastic Gradient Descent Based Support Vector Machines Training Optimization on Big Data and {HPC} Frameworks" Concurrency \& Computation: Practice \& Experience , 2019
Adiga, A and Venkatramanan, S and Schlitt, J and Peddireddy, A and Dickerman, A and Bura, A and Warren, A and Klahn, B and Mao, C and Xie, D and Machi, D and Raymond, E and Meng, F and Barrow, G and Mortveit, H and Chen, J and Walke, J and Goldstein, J an "Evaluating the impact of international airline suspensions on the early global spread of COVID-19" medRxiv , 2020 https://doi.org/10.1101/2020.02.20.20025882 Citation Details
Adiga, A and Wang, L and Sadilek, A and Tendulkar, A and Venkatramanan, S and Vullikanti, A and Aggarwal, G and Talekar, A and Ben, X and Chen, J and Lewis, B and Swarup, S and Tambe, M and Marathe, M "Interplay of global multi-scale human mobility, social distancing, government interventions, and COVID-19 dynamics" medRxiv , 2020 https://doi.org/ Citation Details
Adiga, Abhijin and Kuhlman, Chris and Marathe, Madhav and Ravi, S. and Rosenkranz, Daniel and Stearns, Richard and Vullikanti, Anil "Bounds and Complexity Results for Learning Coalition-Based Interaction Functions in Networked Social Systems" Proceedings of the AAAI Conference on Artificial Intelligence , v.34 , 2020 https://doi.org/10.1609/aaai.v34i04.5710 Citation Details
Adiga, Abhijin and Kuhlman, Chris J and Marathe, Madhav V and Mortveit, Henning S and Ravi, S S and Vullikanti, Anil "Graphical dynamical systems and their applications to bio-social systems" International Journal of Advances in Engineering Sciences and Applied Mathematics , v.11 , 2019 , p.153--171 10.1007/s12572-018-0237-6
Adiga, Aniruddha and Chen, Jiangzhuo and Marathe, Madhav and Mortveit, Henning and Venkatramanan, Srinivasan and Vullikanti, Anil "Data-Driven Modeling for Different Stages of Pandemic Response" Journal of the Indian Institute of Science , v.100 , 2020 https://doi.org/10.1007/s41745-020-00206-0 Citation Details
Adiga, Aniruddha and Dubhashi, Devdatt and Lewis, Bryan and Marathe, Madhav and Venkatramanan, Srinivasan and Vullikanti, Anil "Mathematical Models for COVID-19 Pandemic: A Comparative Analysis" Journal of the Indian Institute of Science , v.100 , 2020 https://doi.org/10.1007/s41745-020-00200-6 Citation Details
A. Luckow, P. Mantha, S. Jha "Pilot-Abstraction: A Valid Abstraction for Data-Intensive Applicationson HPC, Hadoop and Cloud Infrastructures?" IPDPS , 2015
Arinjoy Basak, B E and Cadena, Jose and Marathe, Achla and Vullikanti, Anil "Detection of Spatiotemporal Prescription Opioid Hot Spots With Network Scan Statistics: Multistate Analysis" publichealth.jmir.org , 2019
Asch, M and Moore, T and Badia, R and Beck, M and Beckman, P and Bidot, T and Bodin, F and Cappello, F and Choudhary, A and de Supinski, B and {Others} "Big data and extreme-scale computing: Pathways to {Convergence-Toward} a shaping strategy for a future software and data ecosystem for scientific inquiry" The international journal of high performance computing applications , v.32 , 2018 , p.435--479 1094-3420
Baig, Furqan and Gao, Chao and Teng, Dejun and Kong, Jun and Wang, Fusheng "Accelerating Spatial {Cross-Matching} on {CPU-GPU} Hybrid Platform With {CUDA} and {OpenACC}" Frontiers in big data , v.3 , 2020 10.3389/fdata.2020.00014
(Showing: 1 - 10 of 92)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

 

The SPIDAL (Scalable Parallel Interoperable Data Analytics Library) project  begun Fall 2014 and completed Fall 2020 with outreach activities continuing in 2021. The perspectives summary [1] summarizes the 2020 status with previous work through September 2018  summarized in a book chapter [3] with extensive references. This builds on our 21-month report [4]. Our project-wide workshop paper [2] was an early identification of the importance of AI surrogates for simulations. Institutions and key people involved were Arizona State (Beckstein), Indiana (Fox, Qiu, von Laszewski), Kansas (Paden), Rutgers (Jha), Stony Brook (Wang), Virginia (Marathe, Vullikanti), and Utah (Cheatham).

​Architecture

The project was built around community-driven High Performance Big Data biophysical applications using HPC, distributed systems, network science, GIS, and machine/deep learning. It involved cyberinfrastructure, algorithms, and applications with seven participating organizations. The project has an overall architecture built around the twin concepts of HPC-ABDS (High-Performance Computing Enhanced Apache Big Data Stack) software and classification of Big data applications, the Ogres, that defined the key qualities exhibited by applications and required to be supported in software. These ideas led to a sophisticated discussion of Big Data ? Big Simulation and HPC-Cloud convergence. The original big data Ogres work was a collaboration between Indiana University and the NIST Public Big Data Working Group that collected 54 use cases ? each with 26 properties. The Ogres are a set of 50 features that categorized applications and allowed one to identify common classes such as Global GML and Local LML Machine Learning. GML is highly suitable for HPC systems while the very common LML and MapReduce categories also perform well on more commodity systems. Again, the ?Streaming? feature appeared in 80% of the NIST applications.

Cyberinfrastructure

Our approach to data-intensive applications relies on Apache Big Data stack ABDS for the core software building blocks adding an interface layer MIDAS ? the Middleware for Data-Intensive Analytics and Science, that enables scalable applications with the performance of HPC (High-Performance Computing) and the rich functionality of the commodity ABDS (Apache Big Data Stack). Here we developed major HPC enhancements to the ABDS software including Harp based on Hadoop and Cylon/Twister2 based on Heron, Spark, and Flink for both batch and streaming scenarios. Pilot jobs from Rutgers were very successful in resource management and scheduling for high throughput parallel computing on NSF and DoE systems. We contributed with new techniques to get high performance across C++, Java and Python coded systems. MIDAS allows our libraries to be scalable and interoperable across a range of computing systems including clouds, clusters, and supercomputers. We also recognized [2] and contributed to two important broad categories HPCforML (CIforAI) or MLforHPC (AIforCI).

Community Applications and Algorithms

Another major project product was a cross-cutting high-performance data-analysis library ? SPIDAL (Scalable Parallel Interoperable Data Analytics Library). The library has 4 components: a) a core library covering well-established functionality such as optimization and clustering; b) parallel graph and network algorithms; c) analysis of biomolecular simulations (high-performance versions of existing libraries from Utah and Arizona State) and d) image processing in both Polar Science and Pathology. 

The project has also led to significant algorithmic advances in machine learning methods for networks, including motif detection, anomaly detection, explainability of clustering, deep learning for epidemic forecasting (TDEFSI in MLforHPC category), and the foundations of dynamical systems on networks.  We supported the mitigation of the Coronavirus outbreak with the simulation of different spreading scenarios and possible interventions. For Polar Science, we developed operational ML/DL to locate ice sheet boundaries and snow layers from radar data. In Public Health GIS, we researched and implemented spatial big data query for opioid epidemic prevention and intervention while for pathology, we developed DL based image analysis tools for image segmentation, 3D registration, reconstruction, and spatial analysis. For the major Biomolecular Simulation community, SPIDAL developed PMDA which parallelizes the widely used MDAnalysis Python package for MD (Molecular Dynamics) trajectory analysis. In this area, recent MLforHPC research by us has shown surrogates that improve molecular dynamics simulation performance by very large factors for both short times (using recurrent neural nets) and long time scales (with fully connected networks). This broad impact was enhanced by the over 50 REU undergraduate students who were mentored by our project over its full duration. 

 

Project-wide References

[1] ?Summary Perspectives of the SPIDAL Project NSF #1443054 from 2014-2020,? http://dx.doi.org/10.13140/RG.2.2.16245.65764

[2] ?Learning Everywhere: Pervasive Machine Learning for Effective High-Performance Computation,? in HPDC Workshop at IPDPS 2019, Rio de Janeiro, 2019 https://arxiv.org/abs/1902.10810

[3] ?Contributions to High-Performance Big Data Computing,? in Future Trends of HPC in a Disruptive Scenario, Grandinetti, L., Joubert, G.R., Michielsen, K., Mirtaheri, S.L., Taufer, M., Yokota, R., Ed. IOS, 2019 http://dx.doi.org/10.13140/RG.2.2.25192.11528

[4] ?Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science NSF14-43054 Progress Report. A 21 month Project Report,? Sep. 2016 http://dx.doi.org/10.13140/RG.2.2.23559.47524

 

 

 


Last Modified: 02/02/2022
Modified by: Geoffrey C Fox

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page