NSF Award Search: Award # 2135309

Award Abstract # 2135309

Collaborative Research: SHF: Small: Learning Fault Tolerance at Scale

NSF Org:	CCF Division of Computing and Communication Foundations
Recipient:	VANDERBILT UNIVERSITY
Initial Amendment Date:	August 31, 2021
Latest Amendment Date:	August 31, 2021
Award Number:	2135309
Award Instrument:	Standard Grant
Program Manager:	Almadena Chtchelkanova achtchel@nsf.gov (703)292-7498 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering
Start Date:	January 1, 2022
End Date:	December 31, 2024 (Estimated)
Total Intended Award Amount:	$300,000.00
Total Awarded Amount to Date:	$300,000.00
Funds Obligated to Date:	FY 2021 = $300,000.00
History of Investigator:	Padma Raghavan (Principal Investigator) padma.raghavan@vanderbilt.edu Hongyang Sun (Co-Principal Investigator)
Recipient Sponsored Research Office:	Vanderbilt University 110 21ST AVE S NASHVILLE TN US 37203-2416 (615)322-2631
Sponsor Congressional District:	05
Primary Place of Performance:	Vanderbilt University Sponsored Programs Administratio Nashville TN US 37235-0002
Primary Place of Performance Congressional District:	07
Unique Entity Identifier (UEI):	GTNBNWXJ12D5
Parent UEI:	K9AHBDTKCB55
NSF Program(s):	Software & Hardware Foundation
Primary Program Source:	01002122DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7942, 7923, 9102
Program Element Code(s):	779800
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

In computer-aided design and analysis of engineered systems such as automobiles or semiconductor chips, computational models are simulated on high-performance computers to characterize and evaluate key attributes. The sheer scale of such high-performance computing systems, e.g., over 20 billion transistors in Summit (one of the world's fastest supercomputers), increases the likelihood of transient hardware faults from events such as cosmic radiation or processor-chip voltage fluctuations. The likelihood of such errors and their negative impacts are further increased as such simulations are typically long running, and the corruption of a single data field or variable may require weeks to months of re-computations before critical decisions can be made. This project will develop automated approaches that bring fault tolerance to hardware faults for such applications which are widely used not only across multiple industrial sectors but to also increase the predictive power of climate or weather models to aid critical decision making.

Traditional fault-tolerant schemes can be either application-specific, requiring significant programmer effort to redesign or customize large-scale software, or application-agnostic where all or most data are redundantly stored periodically to allow for recovery, thus limiting their scalability due to their significant memory and processing overheads. This project seeks to address these limitations by providing a theoretical foundation for a new class of fault-tolerant schemes that are suitable for the broad array of applications based on iterative numerical simulations that evolve over time on discretized spatial domains. This project is based on the premise that in such physics-based applications, the rate of change of the solution vector components across time steps (iterations) and spatial domains is a key metric to automatically identifying the critical computational variables, monitoring their evolution, and dynamically selecting the type of safeguarding techniques that should be applied. The investigators will pursue three key directions: (i) characterizing the intrinsic resiliency of the application by developing resiliency gradient metrics, (ii) developing and testing fault-tolerance schemes that adapt the level and type of protection to the resiliency gradient with the goal of reducing computational overheads and increasing scalability, and (iii) constructing an automatic online decision-based learning framework for adaptively selecting fault-tolerance methods in relation to the system's ability to use approximate computing and co-scheduling techniques. The investigators will also work closely with application and runtime system developers to seek broader use of this fault tolerance framework, develop specialized undergraduate and graduate curriculum for student training, and offer research experiences to high school students.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Chen, Zizhao and Verrecchia, Thomas and Sun, Hongyang and Booth, Joshua and Raghavan, Padma "Dynamic Selective Protection of Sparse Iterative Solvers via ML Prediction of Soft Error Impacts" , 2023 https://doi.org/10.1145/3624062.3624117 Citation Details

Johnson, Daniel Ryley and Sun, Hongyang and Booth, Joshua Dennis and Raghavan, Padma "To Protect or Not To Protect: Probability-Aware Selective Protection for Sparse Iterative Solvers" , 2024 https://doi.org/10.1109/SBAC-PAD63648.2024.00028 Citation Details

Perotin, Lucas and Kandaswamy, Sandhya and Sun, Hongyang and Raghavan, Padma "Multi-resource scheduling of moldable workflows" Journal of Parallel and Distributed Computing , v.184 , 2024 https://doi.org/10.1016/j.jpdc.2023.104792 Citation Details

Perotin, Lucas and Sun, Hongyang "Improved Online Scheduling of Moldable Task Graphs under Common Speedup Models" ACM Transactions on Parallel Computing , v.11 , 2024 https://doi.org/10.1145/3630052 Citation Details

Bautista-Gomez, Leonardo and Benoit, Anne and Di, Sheng and Herault, Thomas and Robert, Yves and Sun, Hongyang "A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?" Future Generation Computer Systems , v.161 , 2024 https://doi.org/10.1016/j.future.2024.07.022 Citation Details

Benoit, Anne and Perotin, Lucas and Robert, Yves and Sun, Hongyang "Online Scheduling of Moldable Task Graphs under Common Speedup Models" , 2022 https://doi.org/10.1145/3545008.3545049 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The major goal of this project is to create large-scale scientific applications on high-performance computers, aka, supercomputers, that are tolerant to hard faults (e.g., hardware malfunctioning, crashes) and silent errors (e.g., undetectable double bit flips). As today’s supercomputers reach unprecedented scales, these faults and errors are becoming increasingly common. Ensuring fault tolerance is essential for maintaining the reliability, accuracy, and efficiency of scientific simulations and computations. By developing robust fault-tolerant techniques, we can enable applications to continue running despite failures, minimizing data loss and reducing the need for costly recomputations.

Traditional fault tolerance techniques often rely on replicating the entire computation to ensure detection and/or correction of faults, which incur significant overhead. In the project, we seek intelligent solutions that leverage application-specific characteristics to decide what needs to be safeguarded in the presence of faults, thereby substantially reducing the overhead. The research outcome has established a resilience framework by which scientific applications can more efficiently tolerate and recover from hard faults and silent errors. Some of our important findings include the following:

We identified new metrics, particularly for sparse linear solvers – an important routine in scientific applications – that can capture the impacts of faults on application performance. Leveraging these metrics while performing selective protection for the critical components of iterative solvers can significantly reduce the resilience overhead.
We provided strong evidence that effective fault tolerance at the system level can be achieved through a combined algorithmic approach and machine learning predictions. This opens new research opportunities in several promising directions for other types of scientific simulations and machine-learning domains.
Our results related to multi-precision and online resource management demonstrate that we can reduce application runtime using lower-precision calculations for replication or through more effective task scheduling mechanisms to achieve a high level of fault tolerance and competitive runtime performance.

Our research findings will significantly influence the design and implementation of future fault tolerance and resource management algorithms. They will also contribute to increasing the energy efficiency of supercomputers, enhancing system resilience, and lowering the costs associated with running large-scale scientific simulations or training large AI models.

The project has led to multiple peer-reviewed publications in top venues related to high-performance computing and has provided research opportunities for several graduate and undergraduate students. The results have been disseminated through seminars and conference presentations, as well as integrated into the courses we teach at our institutions.

Last Modified: 04/03/2025
Modified by: Hongyang Sun

Please report errors in award information by writing to: awardsearch@nsf.gov.