
NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | September 22, 2010 |
Latest Amendment Date: | August 31, 2015 |
Award Number: | 1058779 |
Award Instrument: | Standard Grant |
Program Manager: |
Almadena Chtchelkanova
achtchel@nsf.gov (703)292-7498 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering |
Start Date: | October 1, 2010 |
End Date: | September 30, 2016 (Estimated) |
Total Intended Award Amount: | $376,219.00 |
Total Awarded Amount to Date: | $376,219.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
2601 WOLF VILLAGE WAY RALEIGH NC US 27695-0001 (919)515-2444 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
2601 WOLF VILLAGE WAY RALEIGH NC US 27695-0001 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | HIGH-PERFORMANCE COMPUTING |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
In High-End Computing (HEC), faults have become the norm rather than the exception for parallel computation on clusters with 10s/100s of thousands of cores. As the core count increases, so does the overhead for fault-tolerant techniques relying on checkpoint/restart(C/R) mechanisms. At 50% overheads, redundancy is a viable alternative to fault recovery and actually scales, which makes the approach attractive for HEC.
The objective of this work to the develop a synergistic approach by combining C/R-based fault tolerance with redundancy in HEC installations to achieve high levels of resilience.
This work alleviates scalability limitations of current fault tolerant practices. It contributes to fault modeling as well as fault detection and recovery in significantly advancing existing techniques by controlling levels of redundancy and checkpointing intervals in the presence of faults. It is transformative in providing a model where users select a target failure probability at the price of using additional resources.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
In High-End Computing (HEC), faults have become the norm rather than
the exception for parallel computation on clusters with 10s/100s of
thousands of cores. As the core count increases, so does the overhead
for fault-tolerant techniques relying on checkpoint/restart (C/R)
mechanisms. At 50% overheads, redundancy is a viable alternative to
fault recovery and actually scales, which makes the approach
attractive for HEC.
The objective of this work is to develop a synergistic approach by
combining C/R-based fault tolerance with redundancy in HEC
installations to achieve high levels of resilience.
This work alleviates scalability limitations of current fault tolerant
practices. It contributes to fault modeling as well as fault detection
and recovery in significantly advancing existing techniques by
controlling levels of redundancy and checkpointing intervals in the
presence of faults. It is transformative in providing a model where
users select a target failure probability at the price of using
additional resources.
Our work shows that redundancy-based fault tolerance can be used in
synergy with checkpoint/restart-based fault tolerance to achieve
better application performance for large-scale HPC applications than
can be achieved by any of the two techniques alone, which has been
analytically modeled and experimentally confirmed.
We further assessed the feasibility and effectiveness of SDC detection
and correction at the MPI layer via redundancy. We develped two
consistency protocols, explored the unique challenges in creating a
deterministic MPI environment for replication purposes, investigated
the effects of fault injection in to our framework, analyzed the
costs and showed the benefits of SDC protection via redundancy.
We also studied Single Event Upsets (SEUs) in floating-point data. We
show that SEUs produce predictable, non-uniform errors that can be
bounded using analytical modeling of perturbed dot-products for
elementary linear algebra constructs, and by analyzing convergence
theory of first-order (stationary) iterative linear solvers.
Convergence for stationary iterative methods is provable, and the
performance impact (increased iteration count) of an SEU in data is
predictable with low error.
Last Modified: 11/10/2016
Modified by: Frank Mueller
Please report errors in award information by writing to: awardsearch@nsf.gov.