
NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | August 31, 2021 |
Latest Amendment Date: | August 31, 2021 |
Award Number: | 2135309 |
Award Instrument: | Standard Grant |
Program Manager: |
Almadena Chtchelkanova
achtchel@nsf.gov (703)292-7498 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering |
Start Date: | January 1, 2022 |
End Date: | December 31, 2024 (Estimated) |
Total Intended Award Amount: | $300,000.00 |
Total Awarded Amount to Date: | $300,000.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
110 21ST AVE S NASHVILLE TN US 37203-2416 (615)322-2631 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
Sponsored Programs Administratio Nashville TN US 37235-0002 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Software & Hardware Foundation |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
In computer-aided design and analysis of engineered systems such as automobiles or semiconductor chips, computational models are simulated on high-performance computers to characterize and evaluate key attributes. The sheer scale of such high-performance computing systems, e.g., over 20 billion transistors in Summit (one of the world's fastest supercomputers), increases the likelihood of transient hardware faults from events such as cosmic radiation or processor-chip voltage fluctuations. The likelihood of such errors and their negative impacts are further increased as such simulations are typically long running, and the corruption of a single data field or variable may require weeks to months of re-computations before critical decisions can be made. This project will develop automated approaches that bring fault tolerance to hardware faults for such applications which are widely used not only across multiple industrial sectors but to also increase the predictive power of climate or weather models to aid critical decision making.
Traditional fault-tolerant schemes can be either application-specific, requiring significant programmer effort to redesign or customize large-scale software, or application-agnostic where all or most data are redundantly stored periodically to allow for recovery, thus limiting their scalability due to their significant memory and processing overheads. This project seeks to address these limitations by providing a theoretical foundation for a new class of fault-tolerant schemes that are suitable for the broad array of applications based on iterative numerical simulations that evolve over time on discretized spatial domains. This project is based on the premise that in such physics-based applications, the rate of change of the solution vector components across time steps (iterations) and spatial domains is a key metric to automatically identifying the critical computational variables, monitoring their evolution, and dynamically selecting the type of safeguarding techniques that should be applied. The investigators will pursue three key directions: (i) characterizing the intrinsic resiliency of the application by developing resiliency gradient metrics, (ii) developing and testing fault-tolerance schemes that adapt the level and type of protection to the resiliency gradient with the goal of reducing computational overheads and increasing scalability, and (iii) constructing an automatic online decision-based learning framework for adaptively selecting fault-tolerance methods in relation to the system's ability to use approximate computing and co-scheduling techniques. The investigators will also work closely with application and runtime system developers to seek broader use of this fault tolerance framework, develop specialized undergraduate and graduate curriculum for student training, and offer research experiences to high school students.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
The major goal of this project is to create large-scale scientific applications on high-performance computers, aka, supercomputers, that are tolerant to hard faults (e.g., hardware malfunctioning, crashes) and silent errors (e.g., undetectable double bit flips). As today’s supercomputers reach unprecedented scales, these faults and errors are becoming increasingly common. Ensuring fault tolerance is essential for maintaining the reliability, accuracy, and efficiency of scientific simulations and computations. By developing robust fault-tolerant techniques, we can enable applications to continue running despite failures, minimizing data loss and reducing the need for costly recomputations.
Traditional fault tolerance techniques often rely on replicating the entire computation to ensure detection and/or correction of faults, which incur significant overhead. In the project, we seek intelligent solutions that leverage application-specific characteristics to decide what needs to be safeguarded in the presence of faults, thereby substantially reducing the overhead. The research outcome has established a resilience framework by which scientific applications can more efficiently tolerate and recover from hard faults and silent errors. Some of our important findings include the following:
-
We identified new metrics, particularly for sparse linear solvers – an important routine in scientific applications – that can capture the impacts of faults on application performance. Leveraging these metrics while performing selective protection for the critical components of iterative solvers can significantly reduce the resilience overhead.
-
We provided strong evidence that effective fault tolerance at the system level can be achieved through a combined algorithmic approach and machine learning predictions. This opens new research opportunities in several promising directions for other types of scientific simulations and machine-learning domains.
-
Our results related to multi-precision and online resource management demonstrate that we can reduce application runtime using lower-precision calculations for replication or through more effective task scheduling mechanisms to achieve a high level of fault tolerance and competitive runtime performance.
Our research findings will significantly influence the design and implementation of future fault tolerance and resource management algorithms. They will also contribute to increasing the energy efficiency of supercomputers, enhancing system resilience, and lowering the costs associated with running large-scale scientific simulations or training large AI models.
The project has led to multiple peer-reviewed publications in top venues related to high-performance computing and has provided research opportunities for several graduate and undergraduate students. The results have been disseminated through seminars and conference presentations, as well as integrated into the courses we teach at our institutions.
Last Modified: 04/03/2025
Modified by: Hongyang Sun
Please report errors in award information by writing to: awardsearch@nsf.gov.