
NSF Org: |
OAC Office of Advanced Cyberinfrastructure (OAC) |
Recipient: |
|
Initial Amendment Date: | September 5, 2014 |
Latest Amendment Date: | June 2, 2020 |
Award Number: | 1443054 |
Award Instrument: | Standard Grant |
Program Manager: |
Amy Walton
awalton@nsf.gov (703)292-4538 OAC Office of Advanced Cyberinfrastructure (OAC) CSE Directorate for Computer and Information Science and Engineering |
Start Date: | October 1, 2014 |
End Date: | September 30, 2021 (Estimated) |
Total Intended Award Amount: | $5,000,000.00 |
Total Awarded Amount to Date: | $5,283,170.00 |
Funds Obligated to Date: |
FY 2015 = $51,969.00 FY 2016 = $49,894.00 FY 2017 = $52,395.00 FY 2018 = $53,785.00 FY 2019 = $55,127.00 FY 2020 = $20,000.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
107 S INDIANA AVE BLOOMINGTON IN US 47405-7000 (317)278-3473 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
901 E. 10th Street Bloomington IN US 47408-3912 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
Tribal College & Univers Prog, EDUCATION AND WORKFORCE, Data Cyberinfrastructure |
Primary Program Source: |
01001516DB NSF RESEARCH & RELATED ACTIVIT 01001617DB NSF RESEARCH & RELATED ACTIVIT 01001718DB NSF RESEARCH & RELATED ACTIVIT 01001819DB NSF RESEARCH & RELATED ACTIVIT 01001920DB NSF RESEARCH & RELATED ACTIVIT 04002021DB NSF Education & Human Resource |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Many scientific problems depend on the ability to analyze and compute on large amounts of data. This analysis often does not scale well; its effectiveness is hampered by the increasing volume, variety and rate of change (velocity) of big data. This project will design, develop and implement building blocks that enable a fundamental improvement in the ability to support data intensive analysis on a broad range of cyberinfrastructure, including that supported by NSF for the scientific community. The project will integrate features of traditional high-performance computing, such as scientific libraries, communication and resource management middleware, with the rich set of capabilities found in the commercial Big Data ecosystem. The latter includes many important software systems such as Hadoop, available from the Apache open source community. A collaboration between university teams at Arizona, Emory, Indiana (lead), Kansas, Rutgers, Virginia Tech, and Utah provides the broad expertise needed to design and successfully execute the project. The project will engage scientists and educators with annual workshops and activities at discipline-specific meetings, both to gather requirements for and feedback on its software. It will include under-represented communities with summer experiences, and will develop curriculum modules that include demonstrations built as 'Data Analytics as a Service.'
The project will design and implement a software Middleware for Data-Intensive Analytics and Science (MIDAS) that will enable scalable applications with the performance of HPC (High Performance Computing) and the rich functionality of the commodity Apache Big Data Stack. Further, this project will design and implement a set of cross-cutting high-performance data-analysis libraries; SPIDAL (Scalable Parallel Interoperable Data Analytics Library) will support new programming and execution models for data-intensive analysis in a wide range of science and engineering applications. The project addresses major data challenges in seven different communities: Biomolecular Simulations, Network and Computational Social Science, Epidemiology, Computer Vision, Spatial Geographical Information Systems, Remote Sensing for Polar Science, and Pathology Informatics. The project libraries will have the same beneficial impact on data analytics that scientific libraries such as PETSc, MPI and ScaLAPACK have had for supercomputer simulations. These libraries will be implemented to be scalable and interoperable across a range of computing systems including clouds, clusters and supercomputers.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
The SPIDAL (Scalable Parallel Interoperable Data Analytics Library) project begun Fall 2014 and completed Fall 2020 with outreach activities continuing in 2021. The perspectives summary [1] summarizes the 2020 status with previous work through September 2018 summarized in a book chapter [3] with extensive references. This builds on our 21-month report [4]. Our project-wide workshop paper [2] was an early identification of the importance of AI surrogates for simulations. Institutions and key people involved were Arizona State (Beckstein), Indiana (Fox, Qiu, von Laszewski), Kansas (Paden), Rutgers (Jha), Stony Brook (Wang), Virginia (Marathe, Vullikanti), and Utah (Cheatham).
Architecture
The project was built around community-driven High Performance Big Data biophysical applications using HPC, distributed systems, network science, GIS, and machine/deep learning. It involved cyberinfrastructure, algorithms, and applications with seven participating organizations. The project has an overall architecture built around the twin concepts of HPC-ABDS (High-Performance Computing Enhanced Apache Big Data Stack) software and classification of Big data applications, the Ogres, that defined the key qualities exhibited by applications and required to be supported in software. These ideas led to a sophisticated discussion of Big Data ? Big Simulation and HPC-Cloud convergence. The original big data Ogres work was a collaboration between Indiana University and the NIST Public Big Data Working Group that collected 54 use cases ? each with 26 properties. The Ogres are a set of 50 features that categorized applications and allowed one to identify common classes such as Global GML and Local LML Machine Learning. GML is highly suitable for HPC systems while the very common LML and MapReduce categories also perform well on more commodity systems. Again, the ?Streaming? feature appeared in 80% of the NIST applications.
Cyberinfrastructure
Our approach to data-intensive applications relies on Apache Big Data stack ABDS for the core software building blocks adding an interface layer MIDAS ? the Middleware for Data-Intensive Analytics and Science, that enables scalable applications with the performance of HPC (High-Performance Computing) and the rich functionality of the commodity ABDS (Apache Big Data Stack). Here we developed major HPC enhancements to the ABDS software including Harp based on Hadoop and Cylon/Twister2 based on Heron, Spark, and Flink for both batch and streaming scenarios. Pilot jobs from Rutgers were very successful in resource management and scheduling for high throughput parallel computing on NSF and DoE systems. We contributed with new techniques to get high performance across C++, Java and Python coded systems. MIDAS allows our libraries to be scalable and interoperable across a range of computing systems including clouds, clusters, and supercomputers. We also recognized [2] and contributed to two important broad categories HPCforML (CIforAI) or MLforHPC (AIforCI).
Community Applications and Algorithms
Another major project product was a cross-cutting high-performance data-analysis library ? SPIDAL (Scalable Parallel Interoperable Data Analytics Library). The library has 4 components: a) a core library covering well-established functionality such as optimization and clustering; b) parallel graph and network algorithms; c) analysis of biomolecular simulations (high-performance versions of existing libraries from Utah and Arizona State) and d) image processing in both Polar Science and Pathology.
The project has also led to significant algorithmic advances in machine learning methods for networks, including motif detection, anomaly detection, explainability of clustering, deep learning for epidemic forecasting (TDEFSI in MLforHPC category), and the foundations of dynamical systems on networks. We supported the mitigation of the Coronavirus outbreak with the simulation of different spreading scenarios and possible interventions. For Polar Science, we developed operational ML/DL to locate ice sheet boundaries and snow layers from radar data. In Public Health GIS, we researched and implemented spatial big data query for opioid epidemic prevention and intervention while for pathology, we developed DL based image analysis tools for image segmentation, 3D registration, reconstruction, and spatial analysis. For the major Biomolecular Simulation community, SPIDAL developed PMDA which parallelizes the widely used MDAnalysis Python package for MD (Molecular Dynamics) trajectory analysis. In this area, recent MLforHPC research by us has shown surrogates that improve molecular dynamics simulation performance by very large factors for both short times (using recurrent neural nets) and long time scales (with fully connected networks). This broad impact was enhanced by the over 50 REU undergraduate students who were mentored by our project over its full duration.
Project-wide References
[1] ?Summary Perspectives of the SPIDAL Project NSF #1443054 from 2014-2020,? http://dx.doi.org/10.13140/RG.2.2.16245.65764
[2] ?Learning Everywhere: Pervasive Machine Learning for Effective High-Performance Computation,? in HPDC Workshop at IPDPS 2019, Rio de Janeiro, 2019 https://arxiv.org/abs/1902.10810
[3] ?Contributions to High-Performance Big Data Computing,? in Future Trends of HPC in a Disruptive Scenario, Grandinetti, L., Joubert, G.R., Michielsen, K., Mirtaheri, S.L., Taufer, M., Yokota, R., Ed. IOS, 2019 http://dx.doi.org/10.13140/RG.2.2.25192.11528
[4] ?Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science NSF14-43054 Progress Report. A 21 month Project Report,? Sep. 2016 http://dx.doi.org/10.13140/RG.2.2.23559.47524
Last Modified: 02/02/2022
Modified by: Geoffrey C Fox
Please report errors in award information by writing to: awardsearch@nsf.gov.