Award Abstract # 1527463
SHF: Small: Compiler and Architectural Techniques for Soft Error Resilience

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: VIRGINIA POLYTECHNIC INSTITUTE & STATE UNIVERSITY
Initial Amendment Date: July 1, 2015
Latest Amendment Date: May 11, 2016
Award Number: 1527463
Award Instrument: Standard Grant
Program Manager: Almadena Chtchelkanova
achtchel@nsf.gov
 (703)292-7498
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: July 1, 2015
End Date: June 30, 2018 (Estimated)
Total Intended Award Amount: $450,000.00
Total Awarded Amount to Date: $466,000.00
Funds Obligated to Date: FY 2015 = $450,000.00
FY 2016 = $16,000.00
History of Investigator:
  • Changhee Jung (Principal Investigator)
    chjung@purdue.edu
  • Dongyoon Lee (Co-Principal Investigator)
Recipient Sponsored Research Office: Virginia Polytechnic Institute and State University
300 TURNER ST NW
BLACKSBURG
VA  US  24060-3359
(540)231-5281
Sponsor Congressional District: 09
Primary Place of Performance: Virginia Polytechnic Institute and State University
VA  US  24061-0001
Primary Place of Performance
Congressional District:
09
Unique Entity Identifier (UEI): QDE5UHE5XD16
Parent UEI: X6KEFGLHSJX7
NSF Program(s): Software & Hardware Foundation
Primary Program Source: 01001516DB NSF RESEARCH & RELATED ACTIVIT
01001617DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7923, 7942, 9251
Program Element Code(s): 779800
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Due to technology scaling, electronic circuits are becoming more susceptible to radiation-induced soft errors also known as transient faults. Soft errors may lead to application crash or even worse, silent data corruptions (SDC) that are not caught by the error detection logic but may cause the application to produce incorrect output. Another serious problem is the rise of detected unrecoverable errors (DUE) that often directly impact the reliability of any computer applications. The challenge is to achieve soft error resilience in a way that does not significantly increase the performance overhead, power consumption, and complexity of underlying hardware. To this end, this project designs and develops low-cost hardware/software cooperative techniques for soft error resilience. The resulting artifacts and technologies are expected to contribute to the nation's competitiveness by addressing the challenge of building reliable computing systems in the presence of soft errors.
This research involves three intermediate research goals: design novel microarchitecture, that dynamically verifies the correctness of the processor core execution based on sensor-based soft error detection, to achieve soft error resilience at low cost; design a compiler that forms verifiable and recoverable regions in the presence of soft errors and provides relevant program analysis techniques; and design and develop compiler optimization and microarchitectural techniques that significantly reduce the verification overhead. This project will create tools and technologies for realization of soft error resilient computing systems, contributing fundamentally to the fault tolerance research community. Adoption of the resulting compiler and microarchitectural techniques will impact a broad range of any disciplines that need correct computation results thus requiring reliable computing systems, covering from mobile devices to high-performance large-scale computing systems. Consequently, use of the resulting technologies will make the execution of current and emerging applications much more reliable, and therefore directly affect our way of life.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Qingrui Liu and Changhee Jung "Lightweight Hardware Support for Transparent Consistency-Aware Checkpointing in Intermittent Energy-Harvesting systems" IEEE Non-Volatile Memory Systems and Applications Symposium (NVMSA) , 2016
Qingrui Liu, Changhee Jung, Dongyoon Lee, and Devesh Tiwari "Compiler-Directed Lightweight Checkpointing for Fine-Grained Guaranteed Soft Error Recovery" ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC) , 2016
Qingrui Liu, Changhee Jung, Dongyoon Lee, and Devesh Tiwari "Compiler Directed Soft Error Detection and Recovery to Avoid DUE and SDC via Tail-DMR" ACM Transactions on Embedded Computing Systems (TECS) , v.16 , 2016
Qingrui Liu, Changhee Jung, Dongyoon Lee, and Devesh Tiwari "Low-Cost Soft Error Resilience with Unified Data Verification and Fine-Grained Recovery" IEEE/ACM International Symposium on Microarchitecture (MICRO) , 2016
Qingrui Liu, Xiaolong Wu, Larry Kittinger, Markus Levy, and Changhee Jung "BenchPrime: Effective Building of a Hybrid Benchmark Suite" ACM SIGBED Conference on Embedded Software (EMSOFT) , 2017
Tong Zhang, Changhee Jung, and Dongyoon Lee "ProRace: Practical Data Race Detection for Production Use" InternationalConference on Architectural Support for Programming Languages and OperatingSystems (ASPLOS) , 2017
Xinwei Fu, Dongyoon Lee, and Changhee Jung "nAdroid: Statically Detecting Ordering Violations in Android Applications" IEEE/ACM International Symposium on Code Generation and Optimization (CGO) , 2018

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Due to technology scaling, circuits are becoming more susceptible toparticle-strike induced soft errors that may cause system crash or silent datacorruption (SDC). Our take on this is that the resilience must be achieved in anenergy-efficient way unlike traditional hardware approaches. To this end, wehave endeavored to design and implement compiler-directed soft error resiliencetechniques.
When physicists developed sensor based soft error detection, computer architectstook it as a key technol- ogy for SDC-freedom due to the ability to senseparticle strikes and proposed a coarse-grained hardware checkpointing forrecovery. However, since not all the strikes lead to soft errors, we believedthat a fine- grained recovery such as idempotent processing must be used totolerate the false positives. In fact, it was an open problem how to combinesensor-based detection and idempotence-based recovery with no detecteduncorrectable error (DUE). The crux of the problem is that the detection latency(sensing time) makes it pos- sible for soft errors to escape the idempotentregion where they occur. To prevent such DUEs by ensuring region-level errorcontainment, we created tail-DMR, a selective instruction duplication scheme.The compiler identifies the DUE-vulnerable instructions at the tail of eachidempotent region and duplicates them for in- stant error detection so thattheir errors are contained in the region. This research lays the groundwork forother compiler-directed error resilience schemes.
Later, we found out that the region-level error containment can be obviated byredefining the idempotent processing from the processor’s point of view tocontain the errors within the core. The key idea is to regard the committedstores of each region as unverified, holding them in a gated store queue (GSQ)until they become error-free, i.e., no sensor raises the alarm during thedetection latency period after the boundary (end) of the region; a new logiccalled region boundary buffer (RBB) precisely gates/releases the GSQ by trackingthe end time of regions. That way the processor core never merges unverifiedstores to cache, and the recovery can be made by flushing the GSQ on error andrestarting from the most recently verified region boundary. That is, theregion-level error containment is relaxed to core-level containment! The upshotis that it requires neither extra error detection such as the tail-DMR norcomplex hardware modification as in prior work that replicatesmicroarchitectural components and modifies cache coherence protocols. Thisresearch has inspired the design and use of the GSQ-driven recovery fordifferent problem domains such as nonvolatile processors used in energyharvesting systems.
The irony of the original idempotent processing is that although it is to beused for lightweight soft error recovery, it requires expensive hardware support(e.g., 100% soft-error tolerant register file) and incurs significant executiontime overhead. Considering that soft errors rarely occur (e.g., ≈ one per day),no one would be willing to adopt the idempotent processing for such rare errorcorrection at the cost of paying the significant performance overhead all day.To this end, we developed two effective compiler optimization techniques, i.e.,eager checkpointing and checkpoint pruning. To remove the need of the expensivehardware support, the compiler protects the register inputs of each idempotentregion during their entire liveness period by eagerly checkpointing the inputvalues right after they are defined.  To minimize the checkpoint overhead, wecreated a novel program analysis that identifies and eliminates thosecheckpoints whose value can be safely reconstructed by other values of existingcheckpoints. The beauty of this approach is that it shifts the runtime overheadof the soft-error free execution to that of the error recovery execution withoutcompromising the recovery guarantee, thereby promoting the wide use ofidempotent processing!

Last Modified: 07/04/2018
Modified by: Changhee Jung

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page