Award Abstract # 1717532
SHF: Small: Enabling and Analyzing Accuracy-aware Reliable GPU Computing

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: COLLEGE OF WILLIAM AND MARY
Initial Amendment Date: July 7, 2017
Latest Amendment Date: July 7, 2017
Award Number: 1717532
Award Instrument: Standard Grant
Program Manager: Yuanyuan Yang
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: August 1, 2017
End Date: July 31, 2021 (Estimated)
Total Intended Award Amount: $449,999.00
Total Awarded Amount to Date: $449,999.00
Funds Obligated to Date: FY 2017 = $449,999.00
History of Investigator:
  • Adwait Jog (Principal Investigator)
    ajog@virginia.edu
  • Evgenia Smirni (Co-Principal Investigator)
Recipient Sponsored Research Office: College of William and Mary
1314 S MOUNT VERNON AVE
WILLIAMSBURG
VA  US  23185
(757)221-3965
Sponsor Congressional District: 08
Primary Place of Performance: College of William and Mary
VA  US  23187-8795
Primary Place of Performance
Congressional District:
01
Unique Entity Identifier (UEI): EVWJPCY6AD97
Parent UEI: EVWJPCY6AD97
NSF Program(s): Software & Hardware Foundation
Primary Program Source: 01001718DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7923, 7941
Program Element Code(s): 779800
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Graphics Processing Units (GPUs) are becoming the default choice for general-purpose hardware acceleration because of their ability to enable orders of magnitude faster and energy-efficient execution for large-scale high-performance computing applications. Since the majority of such applications executing on large-scale HPC systems are long-running, it is very important that they cope with a variety of hardware- and software-based faults. Many prior works have shown that real HPC systems are vulnerable to soft errors. An absence of essential protection and checkpointing mechanisms can lead to lower scientific productivity, operational efficiency, and even monetary loss. However, these protection mechanisms (e.g., error correction codes) are themselves not free -- they incur very high performance, energy, and area costs. 

This project takes a holistic approach to explore the avenues to reduce these protection overheads by taking advantage of the fact that all errors do not lead to an unacceptable loss in the accuracy of application output. Prior results show that GPGPU applications are amenable to such accuracy-aware optimizations. In order to enable these optimizations, this project will address three major research questions: a) What hardware/software support and tools are necessary to determine which instructions are not vulnerable to soft errors, b) Based on this analysis, which hardware component(s) need not be protected and for how long, while not sacrificing application quality beyond the user's quality requirements, and c) What optimizations in terms of resource management and scheduling are necessary to make low-overhead but reliable computation more effective and efficient. These questions will be explored via a variety of GPGPU applications emerging from the areas of high-performance computing (HPC), big-data analytics, machine learning, and graphics. If successful, this project will generate several novel research insights that will play an important role in enabling low-cost reliable GPU computing. The results of this project will be integrated into the existing and new undergraduate and graduate courses on computer architecture and reliability, which will facilitate in training students, including women and students from diverse backgrounds and minority groups. 

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Alter, Jacob and Xue, Ji and Dimnaku, Alma and Smirni, Evgenia "SSD failures in the field: symptoms, causes, and prediction models" Supercomputing 2019 , 2019 10.1145/3295500.3356172 Citation Details
Kadam, Gurunath and Smirni, Evgenia and Jog, Adwait "Data-centric Reliability Management in GPUs" 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) , 2021 https://doi.org/10.1109/DSN48987.2021.00040 Citation Details
Kadam, Gurunath and Zhang, Danfeng and Jog, Adwait "BCoal: Bucketing-Based Memory Coalescing for Efficient and Secure GPUs" IEEE International Symposium on High Performance Computer Architecture (HPCA) , 2020 10.1109/HPCA47549.2020.00053 Citation Details
Kadam, Gurunath and Zhang, Danfeng and Jog, Adwait "RCoal: Mitigating GPU Timing Attack via Subwarp-Based Randomized Coalescing Techniques" 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) , 2018 10.1109/HPCA.2018.00023 Citation Details
Nie, Bin and Jog, Adwait and Smirni, Evgenia "Characterizing Accuracy-Aware Resilience of GPGPU Applications" 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) , 2020 https://doi.org/10.1109/CCGrid49817.2020.00-82 Citation Details
Nie, Bin and Xue, Ji and Gupta, Saurabh and Patel, Tirthak and Engelmann, Christian and Smirni, Evgenia and Tiwari, Devesh "Machine Learning Models for GPU Error Prediction in a Large Scale HPC System" 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) , 2018 10.1109/DSN.2018.00022 Citation Details
Nie, Bin and Yang, Lishan and Jog, Adwait and Smirni, Evgenia "Fault Site Pruning for Practical Reliability Analysis of GPGPU Applications" 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) , 2018 10.1109/MICRO.2018.00066 Citation Details
Yang, Lishan and Nie, Bin and Jog, Adwait and Smirni, Evgenia "Enabling Software Resilience in GPGPU Applications via Partial Thread Protection" 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) , 2021 https://doi.org/10.1109/ICSE43902.2021.00114 Citation Details
Yang, Lishan and Nie, Bin and Jog, Adwait and Smirni, Evgenia "Practical Resilience Analysis of GPGPU Applications in the Presence of Single- and Multi-Bit Faults" IEEE Transactions on Computers , v.70 , 2021 https://doi.org/10.1109/TC.2020.2980541 Citation Details
Yang, Lishan and Nie, Bin and Jog, Adwait and Smirni, Evgenia "SUGAR: Speeding Up GPGPU Application Resilience Estimation with Input Sizing" Proceedings of the ACM on Measurement and Analysis of Computing Systems , v.5 , 2021 https://doi.org/10.1145/3447375 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

 

Intellectual Merit
 
Graphics Processing Units (GPUs) have become an important part of almost all computing systems. Especially in the data centers, GPUs execute large-scale long-running jobs. Therefore, it is important that GPUs cope with a variety of hardware- and software-based faults. To address this challenging problem, this project took a holistic approach of characterizing, estimating, and improving the reliability of GPUs. Our initial work focused on characterizing accuracy-aware resilience of GPGPU applications, which appeared at CCGrid 2020. In our SIGMETRICS 2021, TC 2021, and MICRO 2018 papers, we developed fast and accurate techniques to estimate resilience and hence made the reliability analysis of GPGPU applications more practical. Based on these insights and learnings, we also developed low-cost software and hardware mechanisms to improve the reliability of GPUs. The software mechanisms appeared at ICSE 2021 and the hardware mechanisms appeared at DSN 2021. Interestingly, we also found that some of the insights (e.g., data replication management for improving reliability) also improve GPU security (another dimension of GPU robustness). These GPU security works appeared at HPCA 2020 and HPCA 2018.
 
Broader Impacts
 
Our work developed several novel techniques, which we believe can be useful for developing reliable and robust systems. Almost all of our work is published in top-tier architecture, systems, and reliability venues. This project involved several Ph.D. students (including three women), out of which two students graduated and joined industry. On the topics related to this grant, PI Jog and Smirni gave talks at several places (e.g., UC Merced, Monash U.). In terms of outreach, Jog co-chaired the GPGPU workshop twice. co-PI Smirni served as program co-chair for DSN 2017, ICPE 2017, HPDC 2019, SRDS 2019, and as Cloud Computing and Data Center chair for ICDCS 2021. Smirni gave two keynote talks, at ICPE 2019 and QEST 2020.

Last Modified: 10/01/2021
Modified by: Adwait Jog

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page