
NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | July 1, 2015 |
Latest Amendment Date: | May 11, 2016 |
Award Number: | 1527463 |
Award Instrument: | Standard Grant |
Program Manager: |
Almadena Chtchelkanova
achtchel@nsf.gov (703)292-7498 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering |
Start Date: | July 1, 2015 |
End Date: | June 30, 2018 (Estimated) |
Total Intended Award Amount: | $450,000.00 |
Total Awarded Amount to Date: | $466,000.00 |
Funds Obligated to Date: |
FY 2016 = $16,000.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
300 TURNER ST NW BLACKSBURG VA US 24060-3359 (540)231-5281 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
VA US 24061-0001 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Software & Hardware Foundation |
Primary Program Source: |
01001617DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Due to technology scaling, electronic circuits are becoming more susceptible to radiation-induced soft errors also known as transient faults. Soft errors may lead to application crash or even worse, silent data corruptions (SDC) that are not caught by the error detection logic but may cause the application to produce incorrect output. Another serious problem is the rise of detected unrecoverable errors (DUE) that often directly impact the reliability of any computer applications. The challenge is to achieve soft error resilience in a way that does not significantly increase the performance overhead, power consumption, and complexity of underlying hardware. To this end, this project designs and develops low-cost hardware/software cooperative techniques for soft error resilience. The resulting artifacts and technologies are expected to contribute to the nation's competitiveness by addressing the challenge of building reliable computing systems in the presence of soft errors.
This research involves three intermediate research goals: design novel microarchitecture, that dynamically verifies the correctness of the processor core execution based on sensor-based soft error detection, to achieve soft error resilience at low cost; design a compiler that forms verifiable and recoverable regions in the presence of soft errors and provides relevant program analysis techniques; and design and develop compiler optimization and microarchitectural techniques that significantly reduce the verification overhead. This project will create tools and technologies for realization of soft error resilient computing systems, contributing fundamentally to the fault tolerance research community. Adoption of the resulting compiler and microarchitectural techniques will impact a broad range of any disciplines that need correct computation results thus requiring reliable computing systems, covering from mobile devices to high-performance large-scale computing systems. Consequently, use of the resulting technologies will make the execution of current and emerging applications much more reliable, and therefore directly affect our way of life.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
When physicists developed sensor based soft error detection, computer architectstook it as a key technol- ogy for SDC-freedom due to the ability to senseparticle strikes and proposed a coarse-grained hardware checkpointing forrecovery. However, since not all the strikes lead to soft errors, we believedthat a fine- grained recovery such as idempotent processing must be used totolerate the false positives. In fact, it was an open problem how to combinesensor-based detection and idempotence-based recovery with no detecteduncorrectable error (DUE). The crux of the problem is that the detection latency(sensing time) makes it pos- sible for soft errors to escape the idempotentregion where they occur. To prevent such DUEs by ensuring region-level errorcontainment, we created tail-DMR, a selective instruction duplication scheme.The compiler identifies the DUE-vulnerable instructions at the tail of eachidempotent region and duplicates them for in- stant error detection so thattheir errors are contained in the region. This research lays the groundwork forother compiler-directed error resilience schemes.
Later, we found out that the region-level error containment can be obviated byredefining the idempotent processing from the processor’s point of view tocontain the errors within the core. The key idea is to regard the committedstores of each region as unverified, holding them in a gated store queue (GSQ)until they become error-free, i.e., no sensor raises the alarm during thedetection latency period after the boundary (end) of the region; a new logiccalled region boundary buffer (RBB) precisely gates/releases the GSQ by trackingthe end time of regions. That way the processor core never merges unverifiedstores to cache, and the recovery can be made by flushing the GSQ on error andrestarting from the most recently verified region boundary. That is, theregion-level error containment is relaxed to core-level containment! The upshotis that it requires neither extra error detection such as the tail-DMR norcomplex hardware modification as in prior work that replicatesmicroarchitectural components and modifies cache coherence protocols. Thisresearch has inspired the design and use of the GSQ-driven recovery fordifferent problem domains such as nonvolatile processors used in energyharvesting systems.
The irony of the original idempotent processing is that although it is to beused for lightweight soft error recovery, it requires expensive hardware support(e.g., 100% soft-error tolerant register file) and incurs significant executiontime overhead. Considering that soft errors rarely occur (e.g., ≈ one per day),no one would be willing to adopt the idempotent processing for such rare errorcorrection at the cost of paying the significant performance overhead all day.To this end, we developed two effective compiler optimization techniques, i.e.,eager checkpointing and checkpoint pruning. To remove the need of the expensivehardware support, the compiler protects the register inputs of each idempotentregion during their entire liveness period by eagerly checkpointing the inputvalues right after they are defined. To minimize the checkpoint overhead, wecreated a novel program analysis that identifies and eliminates thosecheckpoints whose value can be safely reconstructed by other values of existingcheckpoints. The beauty of this approach is that it shifts the runtime overheadof the soft-error free execution to that of the error recovery execution withoutcompromising the recovery guarantee, thereby promoting the wide use ofidempotent processing!
Last Modified: 07/04/2018
Modified by: Changhee Jung
Please report errors in award information by writing to: awardsearch@nsf.gov.