NSF Award Search: Award # 1617488 - SHF: Small: Collaborative Research: ALETHEIA: A Framework for Automatic Detection/Correction of Corruptions in Extreme Scale Scientific Executions

Award Abstract # 1617488

SHF: Small: Collaborative Research: ALETHEIA: A Framework for Automatic Detection/Correction of Corruptions in Extreme Scale Scientific Executions

NSF Org:	CCF Division of Computing and Communication Foundations
Recipient:	UNIVERSITY OF ILLINOIS
Initial Amendment Date:	June 3, 2016
Latest Amendment Date:	June 3, 2016
Award Number:	1617488
Award Instrument:	Standard Grant
Program Manager:	Almadena Chtchelkanova achtchel@nsf.gov (703)292-7498 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering
Start Date:	June 15, 2016
End Date:	May 31, 2021 (Estimated)
Total Intended Award Amount:	$250,000.00
Total Awarded Amount to Date:	$250,000.00
Funds Obligated to Date:	FY 2016 = $250,000.00
History of Investigator:	Marc Snir (Principal Investigator) snir@illinois.edu Franck Cappello (Co-Principal Investigator)
Recipient Sponsored Research Office:	University of Illinois at Urbana-Champaign 506 S WRIGHT ST URBANA IL US 61801-3620 (217)333-2187
Sponsor Congressional District:	13
Primary Place of Performance:	University of Illinois at Urbana-Champaign IL US 61801-2302
Primary Place of Performance Congressional District:	13
Unique Entity Identifier (UEI):	Y8CWNJRCNN91
Parent UEI:	V2PHZ2CSCH63
NSF Program(s):	Software & Hardware Foundation
Primary Program Source:	01001617DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7923, 7942
Program Element Code(s):	779800
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

Trusting scientific applications requires guaranteeing the validity of computed results. Unfortunately, many examples of scientific computations have led to incorrect results, sometimes with catastrophic consequences. Currently known validation techniques cover only a fraction of the possible corruptions that numerical simulation and data analytics applications may suffer during execution. As science processes grow in size and complexity, the reliability and validity of their constituent steps is increasingly difficult to ascertain. Assessing validity in the presence of potential data corruptions is a serious and insufficiently recognized problem. Corruption may occur at all levels of computing, from the hardware to the application. An important aspect of these corruptions is that until they are discovered, all executions are at risk of being corrupted silently. In some documented cases, months have elapsed between the discovery of a corruption and notification to users. In the meantime, a potentially large number of executions may be corrupted, and incorrect conclusions may result. It may be difficult, after the fact, to check whether executions have actually been corrupted or not, so that even if corruptions do not lead to mistakes, they may lead to significant productivity losses. Virtually all simulations producing very large results need to reduce their data volume in some way before saving it --one technique is called lossy compression.
This project strives to validate the end result of the simulation coupled with lossy compression. This approach is useful for scientific simulations in such diverse areas as climate, cosmology, fluid dynamics, weather, and astrophysics --the drivers of this project.
This collaborative project applies the principle of an external algorithmic observer (EAO), where the product of a scientific application is compared with that of a surrogate function of much lower complexity. Corruptions are corrected using a variation of triple modular redundancy: if a corruption is detected, a second surrogate function is executed, and the correct value is chosen from the two results that are most in agreement. This new online detection/correction approach involves approximate comparison of the lossy compressed results of the scientific application and the surrogate function. The project explores the detection performance of surrogate functions, lossy compressors, and approximate comparison techniques. The project also explores how to select the surrogate, lossy compression, and approximate functions to optimize objectives and constraints set by the users. The evaluation considers a set of five applications spanning different computational methods, producing large datasets with I/O bottlenecks, and covering a variety of science problem domains relevant to the NSF.
In addition to serving the needs of scientists working in the fields listed above, this project will enhance the research experience of undergraduate students. A summer school focused on resilience is planned for summer 2016, and corruption detection/correction will be a major topic. The project is also organizing tutorials in major science conferences that include online detection/correction of numerical simulations.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

J. Tian, C. Rivera, S. Di, J. Chen, X. Liang, D. Tao, and F. Cappello, "Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures," IPDPS , 2021

Chen Wang ; Nikoli Dryden ; Franck Cappello ; Marc Snir "Neural Network Based Silent Error Detector" 2018 IEEE International Conference on Cluster Computing (CLUSTER) , 2018 10.1109/CLUSTER.2018.00035

J. Tian, S. Di, X. Yu, C. Rivera, K. Zhao, S. Jin, Y. Feng, X. Liang, D. Tao, and F. Cappello "Optimizing Error-Bounded Lossy Compression for Scientific Data on GPUs" IEEE Cluster , 2021

Omer Subasi, Sheng Di, Leonardo Bautista-Gomez, Prasanna Balaprakash, Osman Unsal, Jesus Labarta, Adrian Cristal, Franck Cappello, Sriram Krishnamoorthy "Exploring the Capabilities of Support Vector Machines in Detecting Silent Data Corruptions" Sustainable Computing journal, Elsevier , 2018 https://doi.org/10.1016/j.suscom.2018.01.004

R. Underwood, V. Malvoso, J. C. Calhoun, S. Diz, and F. Cappello "Productive and Performant Generic Lossy Data Compression with LibPressio" DRBSD workshop at IEEE/ACM SC2021 , 2021

S. Li, S. Di, K. Zhao, X. Liang, Z. Chen, and F. Cappello "Resilient Error-Bounded Lossy Compressor for Data Transfer" Supercomputing , 2021

S. Li, S. Di, K. Zhao, X. Liang, Z. Chen, F. Cappello "Towards End-to-end SDC Detection for HPC Applications Equipped with Lossy Compression" IEEE Cluster , 2020

Y. Liu, S. Di, K. Zhaoz, S. Jing, C. Wangy, K. Chard, D. Tao, I. Foster, and F. Cappello, "Optimizing Multi-Range based Error-Bounded Lossy Compression for Scientific Datasets." DRBSD workshop at IEEE/ACM SC2021. , 2021

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Results of HPC executions can be affected by two types of errors: uncontrolled ones (silent data corruption) and controlled ones (lossy compression). This project investigates the detection and impact of these different errors and the challenge raised by their potential concurrent occurrence.

Silent data corruption: Computer circuitry continues to become denser, as gate sizes continue to shrink, and the number of layers continues to grow. With denser circuits, it is increasingly harder to prevent transient hardware errors. Furthermore, the power consumption of circuits and their manufacturing costs can be reduced if more frequent errors can be tolerated.

The standard practice for error recovery in HPC is to store periodic checkpoints and the restart from the last checkpoint if an error is detected. It is critical that errors be detected before a corrupted checkpoint is stored. This motivates our research: Can silent errors (i.e., hardware errors not detected by hardware) be detected by software?

The approach we took is to focus on single bit flips (reasonable, if errors are rare) and on iterative numerical algorithms. A bit flip may have a negligible impact on the computation. If it is significant, then we expect to see in the simulation a point perturbance that propagates over successive iterations. We consider a correct simulation as “noise” and try to detect the superimposed “signal” of the perturbation caused by the bit flip. We do so using machine learning: A Convolutional Neural network (CNN) is trained to distinguish normal simulations from perturbed ones.

We tested our approach using various configurations of the FLASH code that is widely used for magneto-hydrodynamic simulations. We achieve almost perfect recall, with a false positive rate between 0 and 2.4%, depending on configuration, if the classifier is run at the end of the iteration when the error occurred. Furthermore, a high recall rate is maintained, in many cases, even if the diagnostic is run up to 10 iterations after the error occurred.

Lossy compression: Data reduction is becoming a necessity for many numerical simulations that generate more data than can be stored, communicated, and analyzed. Error bounded lossy compression provides a reliable way to reduce scientific datasets while respecting user requirements regarding point wise accuracy. To be usable, lossy compression needs to be fast and users must be able to understand its impacts not only on data points, but also globally using statistics and on users’ analysis.

To increase the speed of lossy compression, we optimized cuSZ, a cuda version of SZ and we optimized the Huffman coding specifically for GPUs. The new version (cuSZ+) improves the compression throughputs and ratios by up to 18.4× and 5.3×, respectively, over CUSZ for seven real-world HPC application datasets on NVIDIA V100 and A100 GPUs. Concerning Huffman coding we parallelized the entire algorithm, including codebook construction and we proposed a novel reduction based encoding scheme that can efficiently merge the codewords on GPUs. Experiments show that our solution can improve the encoding throughput by up to 6.8X on NVIDIA V100 over the state-of-the-art GPU Huffman encoder.

Ultimately, users’ trust in lossy compression relies on the preservation of science: same conclusions should be drawn from computations or analysis done from lossy compressed data. In order to better understand the impact of lossy compression error on scientific data and on post-hoc analysis, we analyzed 8 different use-cases and studied 3 different levels of error analysis: visualization (L1), quantitative error analysis (L2) and quantitative analysis of user analysis deviation (L3). Experience from scientific simulations, and instruments show that these three levels of analysis are necessary to quickly detect unacceptable compression generated artifacts, assess the profound nature of the compression error and analyze quantitively the impact of compression error on user analysis. These 3 levels were used to tune and improve lossy compression pipelines in these 8 use-cases to fulfil users accuracy requirements on data and post-hoc analysis.

SDC mitigation/Compression interplay: To detect and correct SDC happening during lossy compression, we designed and implemented the first algorithm-based fault tolerance (ABFT) algorithm for lossy compression. Our solution incurs negligible execution overhead in the fault-free situation. Should soft errors occur, it ensures decompressed data is strictly bounded within user's requirement, with a very limited degradation of compression ratio and low overhead.

Last Modified: 11/15/2021
Modified by: Marc Snir

Please report errors in award information by writing to: awardsearch@nsf.gov.