A Comparison of Several Fault-Tolerance Methods for the Detection and Correction of Floating-Point Errors in Matrix-Matrix Multiplication

Valentin Le Fèvre,Thomas Hérault,Julien Langou,Yves Robert

EURO-PAR 2020: PARALLEL PROCESSING WORKSHOPS（2021）

引用 2|浏览13

暂无评分

摘要

This paper compares several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication. These methods include replication, triplication, Algorithm-Based Fault Tolerance (ABFT) and residual checking (RC). Error correction for ABFT can be achieved either by solving a small-size linear system of equations, or by recomputing corrupted coefficients. We show that both approaches can be used for RC. We provide a synthetic presentation of all methods before discussing their pros and cons. We have implemented all these methods with calls to optimized BLAS routines, and we provide performance data for a wide range of failure rates and matrix sizes.

查看译文

关键词

Resilience, Matrix-matrix multiplication, Algorithm-based fault tolerance (ABFT), Residual checking (RC), Silent errors

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要