Chrome Extension
WeChat Mini Program
Use on ChatGLM

On the Detection of Silent Data Corruptions in HPC Applications Using Redundant Multi-threading

EURO-PAR 2020: PARALLEL PROCESSING WORKSHOPS(2021)

Cited 1|Views7
No score
Abstract
This paper studies the use of Redundant Multi-Threading (RMT) to detect Silent Data Corruptions in HPC applications. To understand if it can be a viable solution in an HPC context, we study two software optimizations to reduce RMT performance overhead by reducing the amount of data exchanged between the replicated threads. We conduct experiments with representative HPC workloads to measure the performance gains obtained through these optimizations, and the error detection coverage they achieve. In the best case, when running on a processor that features Simultaneous Multi-Threading, our results show that the overhead can be as low as 1.4x without significantly reducing the ability to detect data corruptions.
More
Translated text
Key words
HPC, Silent data corruptions, Redundant multi-threading
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined