Award Abstract # 2212465
OAC Core: Interpretable Resilience Analysis Platform for Scientific Workflow Applications

NSF Org: OAC
Office of Advanced Cyberinfrastructure (OAC)
Recipient: KENT STATE UNIVERSITY
Initial Amendment Date: June 29, 2022
Latest Amendment Date: May 19, 2024
Award Number: 2212465
Award Instrument: Standard Grant
Program Manager: Varun Chandola
OAC
 Office of Advanced Cyberinfrastructure (OAC)
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2022
End Date: August 31, 2025 (Estimated)
Total Intended Award Amount: $574,640.00
Total Awarded Amount to Date: $584,640.00
Funds Obligated to Date: FY 2022 = $574,640.00
FY 2024 = $10,000.00
History of Investigator:
  • Qiang Guan (Principal Investigator)
    qguan@kent.edu
Recipient Sponsored Research Office: Kent State University
1500 HORNING RD
KENT
OH  US  44242-0001
(330)672-2070
Sponsor Congressional District: 14
Primary Place of Performance: Kent State University
OFFICE OF THE COMPTROLLER
KENT
OH  US  44242-0001
Primary Place of Performance
Congressional District:
14
Unique Entity Identifier (UEI): KXNVA7JCC5K6
Parent UEI:
NSF Program(s): OAC-Advanced Cyberinfrast Core
Primary Program Source: 01002223DB NSF RESEARCH & RELATED ACTIVIT
01002425DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 026Z, 7923, 9251
Program Element Code(s): 090Y00
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

For years, scientists have continued to improve the performance of the simulations where resilience was neglected. This approach was driven by a lack of understanding of the "cause and effect" in resilience analysis. While the current resilience analysis tool continues to lack transparency and interpretability, it is critical that the importance of resilience analysis is promoted and that scientists are educated on its criticality. This project's novelties are redefining the resilience analysis in terms of interpretability and explainability. The approach is significantly different from existing endeavors. It can explain or identify the logic behind these predictions and differentiate the functions and usages of the existing tools built on different theories. The project's impacts include designing a new resilience assessment system using visualization and DevOps to enable transparent resilience analysis, vulnerability positioning, and automation of resilience continuous integration. The project work with NSF and DoE-sponsored supercomputing centers to adopt the system with proven success. Graduate and undergraduate students, especially from underrepresented groups, will be trained in multiple disciplines that will enable them to have successful careers in computing/scientific research areas that are becoming increasingly interdisciplinary.

This project builds upon existing knowledge to create a new insightful approach that enables the resilience property of scientific applications to be assessed under the inevitable existence of surging soft errors in next-generation high-performance computing systems. This project will bring further clarity, insight, and understanding into how systems behave while running high-performance computing scientific workloads composed of parallel simulations for data generation, big data analytics, and machine learning to extract data insights in scientific research. The project proposes 1.) the design and implementation of an error propagation analysis platform, which creates interpretable visualization of the critical paths and critical sections of the codes; 2.) analytics to allow domain scientists to compare and contrast the different resilience models on the simulation codes; 3.) a continuous resilience assessment (Resilience CI) that can be integrated into a standard continuous integration to automate the procedure; whereby the resilience property between committed versions will be delivered to developers as a standard report and to support the DevOps of exa-scale scientific applications; and 4.) quantum chemistry workflow will participate in the evaluation as the driver applications. The project's outcomes, such as tutorials, collected data, and the visualization software system, can encourage the application developers to incorporate cost-effective fault tolerance strategies. In addition, the investigators will incorporate research outcomes in new courses and tutorials for the workforce training. The project will engage and advance the partnership with the industry for commercialization.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Hood, Kendric and Shen, Hao and Mao, Hanbin and Guan, Qiang "Machine Learning Applied to Single-Molecule Activity Prediction" SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing , 2023 https://doi.org/10.1145/3624062.3626083 Citation Details
Jiang, Hailong and Ruan, Shaolun and Fang, Bo and Wang, Yong and Guan, Qiang "Visilience: An Interactive Visualization Framework for Resilience Analysis using Control-Flow Graph" , 2023 https://doi.org/10.1109/prdc59308.2023.00041 Citation Details
Yang, Yuxin and Wang, Zixu and Ahadian, Pegah and Jerger, Abby and Zucker, Jeremy and Feng, Song and Cheng, Feixiong and Guan, Qiang "A Deep Multimodal Representation Learning Framework for Accurate Molecular Properties Prediction" , 2024 https://doi.org/10.1145/3649476.3660377 Citation Details

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page