
NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | June 27, 2013 |
Latest Amendment Date: | June 27, 2013 |
Award Number: | 1320263 |
Award Instrument: | Standard Grant |
Program Manager: |
Tao Li
CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering |
Start Date: | July 1, 2013 |
End Date: | June 30, 2017 (Estimated) |
Total Intended Award Amount: | $492,844.00 |
Total Awarded Amount to Date: | $492,844.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
2550 NORTHWESTERN AVE # 1100 WEST LAFAYETTE IN US 47906-1332 (765)494-1055 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
IN US 47907-2017 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | COMPUTER ARCHITECTURE |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
In future technology generations, smaller and more transistors
operating at low supply voltages and high clock speeds will be
increasingly susceptible to many different resiliency problems, such
as soft errors, wear-out issues, hard errors, and off- and on-chip bus
bit errors. These errors may cause silent data corruption, application
aborts, or system crashes in high-performance microprocessors and
computer systems. Previous techniques for addressing these errors
incur significant performance and power overheads despite
optimizations, and often require invasive changes that incur high
implementation complexity.
In this research project, the investigators propose a novel,
light-weight, yet highly-effective architectural approach to processor
reliability that incurs much lower overheads than existing approaches
by leveraging key architectural observations about the problems.
This project's innovative approach for the detection of soft errors,
wear-out, and hard errors is based on detecting execution anomalies
that are triggered by errors, without using redundant execution. By
exploiting the notion of value locality, this project generalizes
anomalies to include unexpected values as well as conditions (e.g.,
memory access exceptions) and provides significant coverage which
includes the most problematic cases of silent data corruption. For
recovery from soft errors, the project's investigators propose a
retry-based scheme that avoids adding any hardware overhead to achieve
recovery by using existing spare speculative resources in the
processor. For off-chip bus bit errors, the investigators propose a
novel bit interleaving scheme that reduces the chances of multiple
bits in a single error correcting code (ECC)-protected data unit being
corrupted undetectably or uncorrectably. Like the other schemes, this
interleaving imposes minimal power, performance, and complexity overhead.
This project targets achieving reliability while keeping power,
performance, and hardware overheads low, an important goal for the
U.S. microprocessor and computer hardware industry. The project's
investigators are committed to releasing the research artifacts as
open-source software to be used by the research community. The
graduate students working on this project will be trained in
architecture and reliability issues and will be well-positioned to
join the U.S. computer hardware industry. This project will also
support educational activities such as homework and term projects in
undergraduate and graduate courses as well as outreach activities of
various centers at Purdue with which the investigators are involved.
With a woman as one of the investigators, the project will act as a
basis for encouraging women to join graduate programs in electrical
and computer engineering.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Soft error susceptibility is a growing concern with continued CMOS scaling in modern, high-performance, general-purpose microprocessors. Smaller transistors and lower voltages achieve by contiued scaling exacerbate the soft error problem. Previous work explores full- and partial-redundancy schemes in hardware and software for soft-fault tolerance. However, full-redundancy schemes incur high performance and energy overheads whereas partial-redundancy schemes achieve low coverage.
Value locality is well-known where values generated in a computation fall within small neighborhoods in the value space, rather than be spread arbitrarily over the entire value space. Previous value-locality efforts have attempted to improve performance by exploiting value localityto predict values faster than they can be computed due to inherent long latencies such as cache misses. This project exploits value locaity to detect soft errors which would usually perturb a value making it fall outside value locality neighborhoods. Such detection does not require redundancy and therefore can avoid the corresponding overheads. To this end, the authors employ hardware filters that capture value locality neighborhoods by capturing which bit positions are unchanging 0's or 1's and which positions change so that a new value matching in the unchanging positions does not flag an error and does otherwise. Upon detection, the authors leverage modern processor pipeline's in-built abiltiy to roll-back computation to try to correct the error. Because the roll-back (i.e., redundant execution) occurs only when an error is flagged or on a false postiive (which are about 2% of all instructions), the proposed method acheives lower power and performance overheads than previous full-redundancy approaches.
In contrast to value prediction for performance where the prediciton has to exactly match the actual value, exploiting value locality to detect soft errors has to be accurate enough to detect value perturbation without prediction the value with 100% accuracy. An initial study, called Perturbation Based Fault Screening (PBFS), explored exploiting value locality to provide hints
of soft faults whenever a value falls outside its neighborhood. However, PBFS achieves low coverage; straightforwardly improving the coverage results in high false-positive rates, and performance and energy overheads.
The authors propose FaultHound, a value-locality-based soft-fault tolerance scheme, which employs five novel mechanisms to address PBFS’s limitations: (1) a scheme to cluster the filters via an inverted organization of the filter tables to reinforce learning and reduce the false-positive rates; (2) a learning scheme for ignoring the delinquent bit positions that raise repeated false alarms, to reduce further the false- positive rate; (3) a light-weight predecessor replay scheme instead of a full rollback to reduce the performance and energy penalty of the remaining false positives; (4) a simple scheme to distinguish rename faults, which require rollback instead of replay for recovery, from false positives to avoid unnecessary rollback penalty; and (5) a detection scheme, which avoids rollback, for the load-store queue which is not covered by the replay. Using simulations, the authors show that while PBFS achieves either low coverage (30%), or high false-positive rates (8%) with high performance overheads (97%), FaultHound achieves higher coverage (75%) and lower false-positive rates (3%) with lower performance and energy overheads (10% and 25%). Further, full-redundancy schemes scaled to achieve equal coverage of 75% result in 13% performance loss and 57% energy overheads. These results are for a broad range of benchmarks which include commercial workloads (TPC-C-ilke online transaction processing, SPECjbb, and apache webserver), SPECint, SPECFP, and SPLASH.
Considering all the three metrics of coverage, performance and energy overheads, FaultHound performs better than previous redundancy-based and value-locality-based schemes which perform well against only one or two of these metrics. Because of this attractive combination of features, FaultHound will likely be important in the path of continued CMOS scaling in modern microprocessors.
Last Modified: 09/28/2017
Modified by: T. N Vijaykumar
Please report errors in award information by writing to: awardsearch@nsf.gov.