Award Abstract # 1320263
SHF: Small: Light-weight Architectural Schemes for Resilient High-performance Microprocessors

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: PURDUE UNIVERSITY
Initial Amendment Date: June 27, 2013
Latest Amendment Date: June 27, 2013
Award Number: 1320263
Award Instrument: Standard Grant
Program Manager: Tao Li
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: July 1, 2013
End Date: June 30, 2017 (Estimated)
Total Intended Award Amount: $492,844.00
Total Awarded Amount to Date: $492,844.00
Funds Obligated to Date: FY 2013 = $492,844.00
History of Investigator:
  • Terani Vijaykumar (Principal Investigator)
  • Irith Pomeranz (Co-Principal Investigator)
Recipient Sponsored Research Office: Purdue University
2550 NORTHWESTERN AVE # 1100
WEST LAFAYETTE
IN  US  47906-1332
(765)494-1055
Sponsor Congressional District: 04
Primary Place of Performance: Purdue University
IN  US  47907-2017
Primary Place of Performance
Congressional District:
04
Unique Entity Identifier (UEI): YRXVL4JYCEF5
Parent UEI: YRXVL4JYCEF5
NSF Program(s): COMPUTER ARCHITECTURE
Primary Program Source: 01001314DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7923, 7941
Program Element Code(s): 794100
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

In future technology generations, smaller and more transistors
operating at low supply voltages and high clock speeds will be
increasingly susceptible to many different resiliency problems, such
as soft errors, wear-out issues, hard errors, and off- and on-chip bus
bit errors. These errors may cause silent data corruption, application
aborts, or system crashes in high-performance microprocessors and
computer systems. Previous techniques for addressing these errors
incur significant performance and power overheads despite
optimizations, and often require invasive changes that incur high
implementation complexity.

In this research project, the investigators propose a novel,
light-weight, yet highly-effective architectural approach to processor
reliability that incurs much lower overheads than existing approaches
by leveraging key architectural observations about the problems.

This project's innovative approach for the detection of soft errors,
wear-out, and hard errors is based on detecting execution anomalies
that are triggered by errors, without using redundant execution. By
exploiting the notion of value locality, this project generalizes
anomalies to include unexpected values as well as conditions (e.g.,
memory access exceptions) and provides significant coverage which
includes the most problematic cases of silent data corruption. For
recovery from soft errors, the project's investigators propose a
retry-based scheme that avoids adding any hardware overhead to achieve
recovery by using existing spare speculative resources in the
processor. For off-chip bus bit errors, the investigators propose a
novel bit interleaving scheme that reduces the chances of multiple
bits in a single error correcting code (ECC)-protected data unit being
corrupted undetectably or uncorrectably. Like the other schemes, this
interleaving imposes minimal power, performance, and complexity overhead.

This project targets achieving reliability while keeping power,
performance, and hardware overheads low, an important goal for the
U.S. microprocessor and computer hardware industry. The project's
investigators are committed to releasing the research artifacts as
open-source software to be used by the research community. The
graduate students working on this project will be trained in
architecture and reliability issues and will be well-positioned to
join the U.S. computer hardware industry. This project will also
support educational activities such as homework and term projects in
undergraduate and graduate courses as well as outreach activities of
various centers at Purdue with which the investigators are involved.
With a woman as one of the investigators, the project will act as a
basis for encouraging women to join graduate programs in electrical
and computer engineering.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Irith Pomeranz "FOLD: Extreme Static Test Compaction by Folding of Functional Test Sequences" ACM Transactions on Design Automation , v.20 , 2015 , p.57:1
I. Pomeranz "Clock Sequences for Increasing the Fault Coverage of Functional Test Sequences" IEEE Transactions on Computer-Aided Design , 2017 , p.1231
Irith Pomeranz "A Generalized Definition of Unnecessary Test Vectors in Functional Test Sequences" ACM Transactions on Design Automation , v.20 , 2015 , p.29:1
Irith Pomeranz "Modeling a Set of Functional Test Sequences as a Single Sequence for Test Compaction" IEEE Transactions on VLSI Systems , 2015 , p.2629
Irith Pomeranz "Sequential Test Generation Based on Preferred Primary Input Cubes" IEEE Transactions on Computer-Aided Design , 2017 , p.351
Irith Pomeranz "Static Test Compaction for Functional Test Sequences with Restoration of Functional Switching Activity" IEEE Transactions on Computer-Aided Design , 2016 , p.1755
Irith Pomeranz "Test Compaction by Sharing of Functional Test Sequences among Logic Blocks" IEEE Transactions on VLSI Systems , 2015 , p.3006
Irith Pomeranz "Test Vector Omission for Fault Coverage Improvement of Functional Test Sequences" IEEE Transactions on Computers , 2015 , p.3317
Irith Pomeranz "Test Vector Omission with Minimal Sets of Simulated Faults" VLSI Test Symposium , 2015
Irith Pomeranz "Two-Dimensional Static Test Compaction for Functional Test Sequences" IEEE Transactions on Computers , 2015 , p.3009

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Soft error susceptibility is a growing concern with continued CMOS scaling in modern, high-performance, general-purpose microprocessors. Smaller transistors and lower voltages achieve by contiued scaling  exacerbate the soft error problem. Previous work explores full- and partial-redundancy schemes in hardware and software for soft-fault tolerance. However, full-redundancy schemes incur high performance and energy overheads whereas partial-redundancy schemes achieve low coverage.

Value locality is well-known where values generated  in a computation fall within small neighborhoods in the value space, rather than be  spread arbitrarily   over the entire value space. Previous value-locality efforts have attempted to improve performance by  exploiting  value localityto predict values faster than they can be  computed due to inherent long latencies such as cache misses. This project  exploits value locaity to detect soft errors which would usually perturb a value making it fall outside value locality neighborhoods. Such detection does not require redundancy and therefore can avoid the corresponding overheads. To this end, the authors  employ hardware filters that capture value locality neighborhoods by capturing which bit positions are unchanging  0's or 1's and which positions change so that a new value matching in the unchanging positions does not flag an error and does otherwise. Upon detection, the authors  leverage modern processor pipeline's in-built abiltiy to roll-back computation to try to correct the error. Because the roll-back (i.e., redundant execution) occurs only when an error is flagged or on a false postiive (which are about 2% of all instructions), the proposed  method acheives lower power and performance overheads than previous full-redundancy approaches.

In contrast to value prediction for performance where the prediciton has to exactly match the actual value, exploiting value locality to detect soft errors has to  be accurate enough to detect value perturbation without   prediction the value with 100% accuracy.  An initial study, called Perturbation Based Fault Screening (PBFS), explored exploiting value locality to provide hints
of soft faults whenever a value falls outside its neighborhood.  However, PBFS achieves low coverage; straightforwardly improving the coverage results in high false-positive rates, and performance and energy overheads.

The authors propose FaultHound, a value-locality-based soft-fault tolerance scheme, which employs five novel mechanisms to address PBFS’s limitations: (1) a scheme to cluster the filters via an inverted organization of the filter tables to reinforce learning and reduce the false-positive rates; (2) a learning scheme for ignoring the delinquent bit positions that raise repeated false alarms, to reduce further the false- positive rate; (3) a light-weight predecessor replay scheme instead of a full rollback to reduce the performance and energy penalty of the remaining false positives; (4) a simple scheme to distinguish rename faults, which require rollback instead of replay for recovery, from false positives to avoid unnecessary rollback penalty; and (5) a detection scheme, which avoids rollback, for the load-store queue which is not covered by the  replay. Using simulations, the authors  show that while PBFS achieves either low coverage (30%), or high false-positive rates (8%) with high performance overheads (97%), FaultHound achieves higher coverage (75%) and lower false-positive rates (3%) with lower performance and energy overheads (10% and 25%). Further, full-redundancy schemes scaled to achieve equal coverage of 75% result in  13% performance loss and 57% energy overheads. These results are for a broad range of benchmarks which include commercial workloads (TPC-C-ilke online transaction processing, SPECjbb, and apache webserver), SPECint, SPECFP, and SPLASH.

Considering all the three metrics of coverage, performance and energy overheads, FaultHound performs better than previous redundancy-based and value-locality-based schemes which perform well against only one or two of these metrics. Because of this attractive combination of features, FaultHound will likely be important in the path of continued CMOS  scaling in modern microprocessors.


Last Modified: 09/28/2017
Modified by: T. N Vijaykumar

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page