Award Abstract # 1740990
BIGDATA: IA: Collaborative Research: In Situ Data Analytics for Next Generation Molecular Dynamics Workflows

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: WEILL MEDICAL COLLEGE OF CORNELL UNIVERSITY
Initial Amendment Date: August 30, 2017
Latest Amendment Date: August 27, 2021
Award Number: 1740990
Award Instrument: Standard Grant
Program Manager: Almadena Chtchelkanova
achtchel@nsf.gov
 (703)292-7498
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: October 1, 2017
End Date: September 30, 2023 (Estimated)
Total Intended Award Amount: $497,056.00
Total Awarded Amount to Date: $547,056.00
Funds Obligated to Date: FY 2017 = $497,056.00
FY 2021 = $50,000.00
History of Investigator:
  • Harel Weinstein (Principal Investigator)
    haw2002@med.cornell.edu
  • Michel Cuendet (Co-Principal Investigator)
  • Harel Weinstein (Former Principal Investigator)
  • Michel Cuendet (Former Principal Investigator)
  • Harel Weinstein (Former Co-Principal Investigator)
Recipient Sponsored Research Office: Joan and Sanford I. Weill Medical College of Cornell University
575 LEXINGTON AVE FL 9
NEW YORK
NY  US  10022-6145
(646)962-8290
Sponsor Congressional District: 12
Primary Place of Performance: Joan and Sanford I. Weill Medical College of Cornell University
1300 York Avenue
New York
NY  US  10065-4896
Primary Place of Performance
Congressional District:
12
Unique Entity Identifier (UEI): YNT8TCJH8FQ8
Parent UEI: QV1RJ11H58C4
NSF Program(s): Software & Hardware Foundation,
Big Data Science &Engineering
Primary Program Source: 01001718DB NSF RESEARCH & RELATED ACTIVIT
01002122DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7433, 7924, 7942, 8083
Program Element Code(s): 779800, 808300
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Molecular dynamics simulations studying the classical time evolution of a molecular system at atomic resolution are widely recognized in the fields of chemistry, material sciences, molecular biology and drug design; these simulations are one of the most common simulations on supercomputers. Next-generation supercomputers will have dramatically higher performance than do current systems, generating more data that needs to be analyzed (i.e., in terms of number and length of molecular dynamics trajectories). The coordination of data generation and analysis cannot rely on manual, centralized approaches as it does now. This interdisciplinary project integrates research from various areas across programs such as computer science, structural molecular biosciences, and high performance computing to transform the centralized nature of the molecular dynamics analysis into a distributed approach that is predominantly performed in situ. Specifically, this effort combines machine learning and data analytics approaches, workflow management methods, and high performance computing techniques to analyze molecular dynamics data as it is generated, save to disk only what is really needed for future analysis, and annotate molecular dynamics trajectories to drive the next steps in increasingly complex simulations' workflows.

The investigators tackle the data challenge of data analysis of molecular dynamics simulations on the next-generation supercomputers by (1) creating new in situ methods to trace molecular events such as conformational changes, phase transitions, or binding events in molecular dynamics simulations at runtime by locally reducing knowledge on high-dimensional molecular organization into a set of relevant structural molecular properties; (2) designing new data representations and extend unsupervised machine learning techniques to accurately and efficiently build an explicit global organization of structural and temporal molecular properties; (3) integrating simulation and analytics into complex workflows for runtime detection of changes in structural and temporal molecular properties; and (4) developing new curriculum material, online courses, and online training material targeting data analytics. The project's harnessed knowledge of molecular structures' transformations at runtime can be used to steer simulations to more promising areas of the simulation space, identify the data that should be written to congested parallel file systems, and index generated data for retrieval and post-simulation analysis. Supported by this knowledge, molecular dynamics workflows such as replica exchange simulations, Markov state models, and the string method with swarms of trajectories can be executed ?from the outside? (i.e., without reengineering the molecular dynamics code).

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Choi, Annette and Kots, Ekaterina D. and Singleton, Deanndria T. and Weinstein, Harel and Whittaker, Gary R. "Analysis of the molecular determinants for furin cleavage of the spike protein S1/S2 site in defined strains of the prototype coronavirus murine hepatitis virus (MHV)" Virus Research , v.340 , 2024 https://doi.org/10.1016/j.virusres.2023.199283 Citation Details
Estrada, Trilce and Benson, Jeremy and Carrillo-Cabada, Hector and Razavi, Asghar M. and Cuendet, Michel A. and Weinstein, Harel and Deelman, Ewa and Taufer, Michela "Graphic Encoding of Macromolecules for Efficient High-Throughput Analysis" BCB '18 Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics , 2018 10.1145/3233547.3233607 Citation Details
Plante, Ambrose and Shore, Derek M. and Morra, Giulia and Khelashvili, George and Weinstein, Harel "A Machine Learning Approach for the Discovery of Ligand-Specific Functional Mechanisms of GPCRs" Molecules , v.24 , 2019 10.3390/molecules24112097 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Molecular Dynamics (MD) simulations are widely recognized in various fields, such as chemistry, materials science, molecular biology, and drug design. The scope and duration of MD simulations have consistently expanded, making them the most common simulations on petascale computers. For instance, a six-month study in 2022 of the National Science Foundation's computing resources indicated that biomolecular codes, mainly MD codes, accounted for 25.7% of their usage. The shift from petascale to exascale computing has brought unparalleled computational power to MD simulations, allowing the new high-performance computing systems to perform more extensive and longer simulations. This increased capability results in larger datasets from MD simulations, necessitating analysis that keeps pace with the simulations.

This project transforms MD analysis from a centralized to a distributed in situ approach, accommodating a wide array of MD codes and enabling real-time adjustments of MD workflows. Unlike conventional centralized data analytics, which save all trajectory data for post-simulation analysis, this project implements advanced collective variable calculation and annotates MD outputs to guide subsequent stages in complex MD workflows.

The project delivers an in situ data analytics method compatible with popular MD codes without requiring recompilation or script redesign. It captures output in memory in real-time, enhancing adaptive sampling. This allows for exploring conformational spaces in simple peptides and complex systems like ribosomes. The solution assesses the efficiency of ensemble trajectories and in situ methods on supercomputers. Annotation-based early termination enables scientists to cover more conformational space with fewer MD steps than traditional methods without such termination or steering.

 


Last Modified: 01/29/2024
Modified by: Harel Weinstein

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page