
NSF Org: |
CNS Division Of Computer and Network Systems |
Recipient: |
|
Initial Amendment Date: | October 31, 2019 |
Latest Amendment Date: | March 3, 2022 |
Award Number: | 2001124 |
Award Instrument: | Continuing Grant |
Program Manager: |
Marilyn McClure
mmcclure@nsf.gov (703)292-5197 CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | August 12, 2019 |
End Date: | December 31, 2024 (Estimated) |
Total Intended Award Amount: | $406,744.00 |
Total Awarded Amount to Date: | $406,744.00 |
Funds Obligated to Date: |
FY 2020 = $220,344.00 FY 2022 = $112,980.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
2550 NORTHWESTERN AVE # 1100 WEST LAFAYETTE IN US 47906-1332 (765)494-1055 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
155 S. Grant Street West Lafayette IN US 47907-2114 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
CSR-Computer Systems Research, Special Projects - CNS |
Primary Program Source: |
01002021DB NSF RESEARCH & RELATED ACTIVIT 01002021DB NSF RESEARCH & RELATED ACTIVIT 01002122DB NSF RESEARCH & RELATED ACTIVIT 01002223DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Resilience is one of the key exascale research challenges in high-performance
computing (HPC). Due to much high error rates, exascale supercomputers could
make little progress in computations, or might generate incorrect results due to
failures, rendering the exascale performance useless. The
challenge is how to achieve a complete HPC resilience at exascale in a way that
does not increase the performance overhead, the power consumption, and the
complexity of underlying hardware. To this end, this research project designs
and develops low-cost hardware/software cooperative techniques for HPC
resilience in the exascale era.
This project involves four research goals: (1) low-cost soft error resilience
for CPUs; intelligent compiler-architecture interaction can validate the lack of
errors and performs fine-grained recovery, thus eliminating SDC. (2)
compiler-directed soft error resilience for commodity GPUs; it can remove the
power-hungry error-correcting code (ECC) logic from the GPU register files
without compromising their resilience. (3) lightweight nonvolatile memory (NVM)
persistence; it can mitigate the overhead of traditional heavyweight HPC
checkpointing and support whole-system persistence for applications without
irrevocable operations. (4) low-cost timing error resilience for aggressive
voltage scaling to maximize the energy-efficiency with program correctness
guarantee.
The resulting artifacts and technologies are expected to contribute to the
nation's competitiveness by addressing the challenge of building reliable HPC
systems. The research outcome impacts a broad range of any disciplines that
need correct computation results thus requiring reliable computing systems
covering from embedded systems to HPC cloud. Consequently, use of the proposed
techniques will make the execution of current and emerging applications much
more reliable, and therefore directly affect our way of life.
There will be three types of data generated from this research project: (1)
algorithms and models, (2) software prototype, (3) testing infrastructure
including simulators and evaluation benchmarks and their traces, (4) educational
materials. All of our software tools will be open source and made available to
the public, laboratories and industry.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
Please report errors in award information by writing to: awardsearch@nsf.gov.