NSF Award Search: Award # 1741040

Award Abstract # 1741040

BIGDATA: IA: Collaborative Research: In Situ Data Analytics for Next Generation Molecular Dynamics Workflows

NSF Org:	IIS Division of Information & Intelligent Systems
Recipient:	UNIVERSITY OF SOUTHERN CALIFORNIA
Initial Amendment Date:	August 30, 2017
Latest Amendment Date:	August 26, 2021
Award Number:	1741040
Award Instrument:	Standard Grant
Program Manager:	Almadena Chtchelkanova achtchel@nsf.gov (703)292-7498 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering
Start Date:	October 1, 2017
End Date:	September 30, 2022 (Estimated)
Total Intended Award Amount:	$516,000.00
Total Awarded Amount to Date:	$616,000.00
Funds Obligated to Date:	FY 2017 = $516,000.00 FY 2021 = $100,000.00
History of Investigator:	Ewa Deelman (Principal Investigator) deelman@isi.edu Rafael Ferreira da Silva (Former Principal Investigator) Ewa Deelman (Former Co-Principal Investigator)
Recipient Sponsored Research Office:	University of Southern California 3720 S FLOWER ST FL 3 LOS ANGELES CA US 90033 (213)740-7762
Sponsor Congressional District:	34
Primary Place of Performance:	University of Southern California 4676 Admiralty Way, Suite 1001 Marina del Rey CA US 90292-6611
Primary Place of Performance Congressional District:	36
Unique Entity Identifier (UEI):	G88KLJR3KYT5
Parent UEI:
NSF Program(s):	Software & Hardware Foundation, Big Data Science &Engineering
Primary Program Source:	01001718DB NSF RESEARCH & RELATED ACTIVIT 01002122DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7433, 7924, 7942, 8083
Program Element Code(s):	779800, 808300
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

Molecular dynamics simulations studying the classical time evolution of a molecular system at atomic resolution are widely recognized in the fields of chemistry, material sciences, molecular biology and drug design; these simulations are one of the most common simulations on supercomputers. Next-generation supercomputers will have dramatically higher performance than do current systems, generating more data that needs to be analyzed (i.e., in terms of number and length of molecular dynamics trajectories). The coordination of data generation and analysis cannot rely on manual, centralized approaches as it does now. This interdisciplinary project integrates research from various areas across programs such as computer science, structural molecular biosciences, and high performance computing to transform the centralized nature of the molecular dynamics analysis into a distributed approach that is predominantly performed in situ. Specifically, this effort combines machine learning and data analytics approaches, workflow management methods, and high performance computing techniques to analyze molecular dynamics data as it is generated, save to disk only what is really needed for future analysis, and annotate molecular dynamics trajectories to drive the next steps in increasingly complex simulations' workflows.

The investigators tackle the data challenge of data analysis of molecular dynamics simulations on the next-generation supercomputers by (1) creating new in situ methods to trace molecular events such as conformational changes, phase transitions, or binding events in molecular dynamics simulations at runtime by locally reducing knowledge on high-dimensional molecular organization into a set of relevant structural molecular properties; (2) designing new data representations and extend unsupervised machine learning techniques to accurately and efficiently build an explicit global organization of structural and temporal molecular properties; (3) integrating simulation and analytics into complex workflows for runtime detection of changes in structural and temporal molecular properties; and (4) developing new curriculum material, online courses, and online training material targeting data analytics. The project's harnessed knowledge of molecular structures' transformations at runtime can be used to steer simulations to more promising areas of the simulation space, identify the data that should be written to congested parallel file systems, and index generated data for retrieval and post-simulation analysis. Supported by this knowledge, molecular dynamics workflows such as replica exchange simulations, Markov state models, and the string method with swarms of trajectories can be executed ?from the outside? (i.e., without reengineering the molecular dynamics code).

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Do, Tu M. and Pottier, L. and Thomas, S. and Ferreira da Silva, R. and Cuendet, M. A. and Weinstein, H. and Estrada, T. and Taufer, M. and Deelman, E. "A Novel Metric to Evaluate In Situ Workflows" Lecture notes in computer science , v.12137 , 2020 https://doi.org/10.1007/978-3-030-50371-0_40 Citation Details

Estrada, Trilce and Benson, Jeremy and Carrillo-Cabada, Hector and Razavi, Asghar M. and Cuendet, Michel A. and Weinstein, Harel and Deelman, Ewa and Taufer, Michela "Graphic Encoding of Macromolecules for Efficient High-Throughput Analysis" BCB '18 Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics , 2018 10.1145/3233547.3233607 Citation Details

Ferreira da Silva, Rafael and Callaghan, Scott and Do, Tu Mai and Papadimitriou, George and Deelman, Ewa "Measuring the impact of burst buffers on data-intensive scientific workflows" Future Generation Computer Systems , v.101 , 2019 10.1016/j.future.2019.06.016 Citation Details

Pottier, Loic and Ferreira da Silva, Rafael and Casanova, Henri and Deelman, Ewa "Modeling the Performance of Scientific Workflow Executions on HPC Platforms with Burst Buffers" 2020 IEEE International Conference on Cluster Computing (CLUSTER) , 2020 https://doi.org/10.1109/CLUSTER49012.2020.00019 Citation Details

Taufer, Michela and Thomas, Stephen and Wyatt, Michael and Anh Do, Tu Mai and Pottier, Loic and da Silva, Rafael Ferreira and Weinstein, Harel and Cuendet, Michel A. and Estrada, Trilce and Deelman, Ewa "Characterizing In Situ and In Transit Analytics of Molecular Dynamics Simulations for Next-Generation Supercomputers" 2019 15th International Conference on eScience (eScience) , 2019 10.1109/eScience.2019.00027 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Molecular Dynamics (MD) simulations are widely recognized in chemistry, material sciences, molecular biology, and drug design. The system sizes and time scales accessible to MD simulations have been steadily increasing}. Today MD simulations are the most common simulations running on petascale machines. For example, a survey of usage of NSF computing resources over six months of 2022 shows how biomolecular codes, predominantly MD codes, use 25.7% of these resources. The transition from petascale to exascale computing brings unprecedented computing capability to MD simulations. The new generation of high-performance computing (HPC) systems have more significant computing power. This increased computing capability directly translates into the ability to execute many more and more extended simulations. For MD simulations, this, in turn, translates to more data that needs to be analyzed. The analysis must co-occur to keep up with the simulations' pace.

This project transformed the centralized nature of the MD analysis into a distributed approach that is performed in situ and supports a broad range of MD codes. It can enable on-the-fly tuning of MD workflows. Contrary to traditional MD data analytics that uses centralized data analysis (i.e., first generates and saves all the trajectory data to storage and then relies on the post-simulation analysis), the project calculated advanced collective variables to analyze data as they are generated and annotates MD outputs to steer the next steps in increasingly complex MD workflows.

The project designed an in situ data analytics approach for the most commonly used MD codes. The targeted workflows did not require the recompilation of any single MD code nor the redesign of any MD script. Instead, the new solutions captured outputs in memory at runtime as they were generated. The project demonstrated these new capabilities in the context of enhanced adaptive sampling. It enabled exploring the conformational space of simple peptides and complex molecular systems such as ribosomes. The proposed solution modeled the execution of an ensemble of trajectories starting from random unfolded states and analyzed the overall throughput obtained using in situ methods and the MD framework on supercomputers. Using annotation-based early termination, scientists can now obtain more extensive coverage of the studied reference conformational space with fewer MD steps otherwise used for a traditional execution of the MD simulation (i.e., without any early termination or steering).

Last Modified: 01/30/2023
Modified by: Ewa Deelman

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error