Award Abstract # 2001124
CAREER: Rethinking HPC Resilience in the Exascale Era

NSF Org: CNS
Division Of Computer and Network Systems
Recipient: PURDUE UNIVERSITY
Initial Amendment Date: October 31, 2019
Latest Amendment Date: March 3, 2022
Award Number: 2001124
Award Instrument: Continuing Grant
Program Manager: Marilyn McClure
mmcclure@nsf.gov
 (703)292-5197
CNS
 Division Of Computer and Network Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: August 12, 2019
End Date: December 31, 2024 (Estimated)
Total Intended Award Amount: $406,744.00
Total Awarded Amount to Date: $406,744.00
Funds Obligated to Date: FY 2019 = $73,420.00
FY 2020 = $220,344.00

FY 2022 = $112,980.00
History of Investigator:
  • Changhee Jung (Principal Investigator)
Recipient Sponsored Research Office: Purdue University
2550 NORTHWESTERN AVE # 1100
WEST LAFAYETTE
IN  US  47906-1332
(765)494-1055
Sponsor Congressional District: 04
Primary Place of Performance: Purdue University
155 S. Grant Street
West Lafayette
IN  US  47907-2114
Primary Place of Performance
Congressional District:
04
Unique Entity Identifier (UEI): YRXVL4JYCEF5
Parent UEI: YRXVL4JYCEF5
NSF Program(s): CSR-Computer Systems Research,
Special Projects - CNS
Primary Program Source: 01001920DB NSF RESEARCH & RELATED ACTIVIT
01002021DB NSF RESEARCH & RELATED ACTIVIT

01002021DB NSF RESEARCH & RELATED ACTIVIT

01002122DB NSF RESEARCH & RELATED ACTIVIT

01002223DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 1045
Program Element Code(s): 735400, 171400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Resilience is one of the key exascale research challenges in high-performance
computing (HPC). Due to much high error rates, exascale supercomputers could
make little progress in computations, or might generate incorrect results due to
failures, rendering the exascale performance useless. The
challenge is how to achieve a complete HPC resilience at exascale in a way that
does not increase the performance overhead, the power consumption, and the
complexity of underlying hardware. To this end, this research project designs
and develops low-cost hardware/software cooperative techniques for HPC
resilience in the exascale era.

This project involves four research goals: (1) low-cost soft error resilience
for CPUs; intelligent compiler-architecture interaction can validate the lack of
errors and performs fine-grained recovery, thus eliminating SDC. (2)
compiler-directed soft error resilience for commodity GPUs; it can remove the
power-hungry error-correcting code (ECC) logic from the GPU register files
without compromising their resilience. (3) lightweight nonvolatile memory (NVM)
persistence; it can mitigate the overhead of traditional heavyweight HPC
checkpointing and support whole-system persistence for applications without
irrevocable operations. (4) low-cost timing error resilience for aggressive
voltage scaling to maximize the energy-efficiency with program correctness
guarantee.

The resulting artifacts and technologies are expected to contribute to the
nation's competitiveness by addressing the challenge of building reliable HPC
systems. The research outcome impacts a broad range of any disciplines that
need correct computation results thus requiring reliable computing systems
covering from embedded systems to HPC cloud. Consequently, use of the proposed
techniques will make the execution of current and emerging applications much
more reliable, and therefore directly affect our way of life.

There will be three types of data generated from this research project: (1)
algorithms and models, (2) software prototype, (3) testing infrastructure
including simulators and evaluation benchmarks and their traces, (4) educational
materials. All of our software tools will be open source and made available to
the public, laboratories and industry.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 16)
Zhang, Tong and Lee, Dongyoon and Jung, Changhee "BOGO: Buy Spatial Memory Safety, Get Temporal Memory Safety (Almost) Free" Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems , 2019 10.1145/3297858.3304017 Citation Details
Zeng, Jianping and Kim, Hongjune and Lee, Jaejin and Jung, Changhee "Turnpike: Lightweight Soft Error Resilience for In-Order Cores" 54th Annual IEEE/ACM International Symposium on Microarchitecture , 2021 https://doi.org/10.1145/3466752.3480042 Citation Details
Zeng, Jianping and Jeong, Jungi and Jung, Changhee "Persistent Processor Architecture" IEEE/ACM International Symposium on Microarchitecture , 2023 https://doi.org/10.1145/3613424.3623772 Citation Details
Zeng, Jianping and Choi, Jongouk and Fu, Xinwei and Shreepathi, Ajay Paddayuru and Lee, Dongyoon and Min, Changwoo and Jung, Changhee "ReplayCache: Enabling Volatile Cachesfor Energy Harvesting Systems" 54th Annual IEEE/ACM International Symposium on Microarchitecture , 2021 https://doi.org/10.1145/3466752.3480102 Citation Details
Kim, Hongjune and Zeng, Jianping and Liu, Qingrui and Abdel-Majeed, Mohammad and Lee, Jaejin and Jung, Changhee "Compiler-directed soft error resilience for lightweight GPU register file protection" 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI 20) , 2020 https://doi.org/10.1145/3385412.3386033 Citation Details
Jeong, Jungi and Zeng, Jianping and Jung, Changhee "Capri: Compiler and Architecture Support for Whole-System Persistence" Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing , 2022 https://doi.org/10.1145/3502181.3531474 Citation Details
Jeong, Jungi and Jung, Changhee "PMEM-spec: persistent memory speculation (strict persistency can trump relaxed persistency)" 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems , 2021 https://doi.org/10.1145/3445814.3446698 Citation Details
Jeong, Jungi and Hong, Jaewan and Maeng, Seungryoul and Jung, Changhee and Kwon, Youngjin "Unbounded Hardware Transactional Memory for a Hybrid DRAM/NVM Memory System" 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) , 2020 https://doi.org/10.1109/MICRO50266.2020.00051 Citation Details
Choi, Jongouk and Zeng, Jianping and Lee, Dongyoon and Min, Changwoo and Jung, Changhee "Write-Light Cache for Energy Harvesting Systems" Annual International Symposium on Computer Architecture , 2023 https://doi.org/10.1145/3579371.3589098 Citation Details
Choi, Jongouk and Liu, Qingrui and Jung, Changhee "CoSpec: Compiler Directed Speculative Intermittent Computation" 52nd Annual IEEE/ACM International Symposium on Microarchitecture , 2019 https://doi.org/10.1145/3352460.3358279 Citation Details
Choi, Jongouk and Kittinger, Larry and Liu, Qingrui and Jung, Changhee "Compiler-Directed High-Performance Intermittent Computation with Power Failure Immunity" IEEE 28th Real-Time and Embedded Technology and Applications Symposium , 2022 https://doi.org/10.1109/RTAS54340.2022.00012 Citation Details
(Showing: 1 - 10 of 16)

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page