Award Abstract # 1058779
SHF: Small: RESYST: Resilience via Synergistic Redundancy and Fault Tolerance for High-End Computing

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: NORTH CAROLINA STATE UNIVERSITY
Initial Amendment Date: September 22, 2010
Latest Amendment Date: August 31, 2015
Award Number: 1058779
Award Instrument: Standard Grant
Program Manager: Almadena Chtchelkanova
achtchel@nsf.gov
 (703)292-7498
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: October 1, 2010
End Date: September 30, 2016 (Estimated)
Total Intended Award Amount: $376,219.00
Total Awarded Amount to Date: $376,219.00
Funds Obligated to Date: FY 2010 = $376,219.00
History of Investigator:
  • Frank Mueller (Principal Investigator)
    fmuelle@ncsu.edu
Recipient Sponsored Research Office: North Carolina State University
2601 WOLF VILLAGE WAY
RALEIGH
NC  US  27695-0001
(919)515-2444
Sponsor Congressional District: 02
Primary Place of Performance: North Carolina State University
2601 WOLF VILLAGE WAY
RALEIGH
NC  US  27695-0001
Primary Place of Performance
Congressional District:
02
Unique Entity Identifier (UEI): U3NVH931QJJ3
Parent UEI: U3NVH931QJJ3
NSF Program(s): HIGH-PERFORMANCE COMPUTING
Primary Program Source: 01001011DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 9218, HPCC
Program Element Code(s): 794200
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

In High-End Computing (HEC), faults have become the norm rather than the exception for parallel computation on clusters with 10s/100s of thousands of cores. As the core count increases, so does the overhead for fault-tolerant techniques relying on checkpoint/restart(C/R) mechanisms. At 50% overheads, redundancy is a viable alternative to fault recovery and actually scales, which makes the approach attractive for HEC.

The objective of this work to the develop a synergistic approach by combining C/R-based fault tolerance with redundancy in HEC installations to achieve high levels of resilience.

This work alleviates scalability limitations of current fault tolerant practices. It contributes to fault modeling as well as fault detection and recovery in significantly advancing existing techniques by controlling levels of redundancy and checkpointing intervals in the presence of faults. It is transformative in providing a model where users select a target failure probability at the price of using additional resources.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 12)
A Tunable, Software-based DRAM Error Detection and Correction Library for HPC "D. Fiala, K. Ferreira, F. Mueller, C. Engelmann" Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids , 2011
C. Wang, F. Mueller, C. Engelmann and S. Scott "Proactive Process-Level Live Migration and Back Migration in HPCEnvironments" Journal of Parallel and Distributed Computing , v.72 , 2012 , p.254-267
David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira "Detection and Correction of Silent Data Corruption for Large-Scale High-Performance" poster at Supercomputing , 2011
David Fiala, Kurt Ferreira, Frank Mueller, Christian Engelmann "A Tunable, Software-based DRAM ErrorDetection and Correction Library for HPC" poster at Supercomputing , 2011
D. Fiala, F. Mueller, C. Engelmann, K. Ferreira, R. Brightwell, R. Riesen "Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing" TR 2012-5, Dept. of Computer Science, North Carolina State University , 2012
D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, R. Brightwell "Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing" Proceedings of the 25th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) , 2012
D. Fiala, K. Ferreira, F. Mueller, C. Engelmann "A Tunable, Software-based DRAM Error Detection and Correction Library for HPC" Workshop on Resiliency in High Performance Computing (Resilience) in Clusters , 2011
J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira, C. Engelmann "Combining Partial Redundancy and Checkpointing for HPC" International Conference on Distributed Computing Systems , 2012
K. Kharbas, D. Kim, K. KC, T. Hoefler and F. Mueller "Failure Detection within MPI Jobs: Periodic Outperforms Sporadic" TR 2011-13, Dept. of Computer Science, North Carolina State University , 2011
K. Kharbas, D. Kim, T. Hoefler and F. Mueller "Assessing HPC Failure Detectors for MPI Jobs" Euromicro International Conference on Parallel, Distributed and Network-Based Computing , 2012
M. Vasavada, F. Mueller, P. Hargrove "Comparing different approaches for Incremental Checkpointing: The Showdown" Linux Symposium , 2011
(Showing: 1 - 10 of 12)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

In High-End Computing (HEC), faults have become the norm rather than
the exception for parallel computation on clusters with 10s/100s of
thousands of cores. As the core count increases, so does the overhead
for fault-tolerant techniques relying on checkpoint/restart (C/R)
mechanisms. At 50% overheads, redundancy is a viable alternative to
fault recovery and actually scales, which makes the approach
attractive for HEC.

The objective of this work is to develop a synergistic approach by
combining C/R-based fault tolerance with redundancy in HEC
installations to achieve high levels of resilience.

This work alleviates scalability limitations of current fault tolerant
practices. It contributes to fault modeling as well as fault detection
and recovery in significantly advancing existing techniques by
controlling levels of redundancy and checkpointing intervals in the
presence of faults. It is transformative in providing a model where
users select a target failure probability at the price of using
additional resources.

Our work shows that redundancy-based fault tolerance can be used in
synergy with checkpoint/restart-based fault tolerance to achieve
better application performance for large-scale HPC applications than
can be achieved by any of the two techniques alone, which has been
analytically modeled and experimentally confirmed.

We further assessed the feasibility and effectiveness of SDC detection
and correction at the MPI layer via redundancy.  We develped two
consistency protocols, explored the unique challenges in creating a
deterministic MPI environment for replication purposes, investigated
the effects of fault injection in to our framework, analyzed the
costs and showed the benefits of SDC protection via redundancy.

We also studied Single Event Upsets (SEUs) in floating-point data. We
show that SEUs produce predictable, non-uniform errors that can be
bounded using analytical modeling of perturbed dot-products for
elementary linear algebra constructs, and by analyzing convergence
theory of first-order (stationary) iterative linear solvers.
Convergence for stationary iterative methods is provable, and the
performance impact (increased iteration count) of an SEU in data is
predictable with low error.


Last Modified: 11/10/2016
Modified by: Frank Mueller

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page