Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

Shixun Wu,Yujia Zhai,Jinyang Liu,Jiajun Huang,Zizhe Jian,Bryan M. Wong,Zizhong Chen

PROCEEDINGS OF THE 37TH INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ACM ICS 2023（2023）

引用 0|浏览33

暂无评分

摘要

General Matrix Multiplication (GEMM) is a crucial algorithm for various applications such as machine learning and scientific computing since an efficient GEMM implementation is essential for the performance of these calculations. While researchers often strive for faster performance by using large computing platforms, the increased scale of these systems can raise concerns about hardware and software reliability. In this paper, we present a design of a high-performance GPU-based GEMM that integrates an algorithm-based fault tolerance scheme that detects and corrects silent data corruptions at computing units on-the-fly. We explore fault-tolerant designs for GEMM at the thread, warp, and threadblock levels, and also provide a baseline GEMM implementation that is competitive with or faster than the state-of-the-art, closed-source cuBLAS GEMM. We present a kernel fusion strategy to overlap and mitigate the memory latency due to fault tolerance with the original GEMM computation. To support a wide range of input matrix shapes and reduce development costs, we present a template-based approach for automatic code generation for both fault-tolerant and non-fault-tolerant GEMM implementations. We evaluate our work on NVIDIA Tesla T4 and A100 server GPUs. Our experimental results demonstrate that our baseline GEMM shows comparable or superior performance compared to the closed-source cuBLAS. Compared with the prior state-of-the-art non-fused fault-tolerant GEMM, our optimal fused strategy achieves a 39.04% speedup on average. In addition, our fault-tolerant GEMM incurs only a minimal overhead (8.89% on average) compared to cuBLAS even with hundreds of errors injected per minute. For irregularly shaped inputs, the code generator-generated kernels show remarkable speedups of 160% similar to 183.5% and 148.55% similar to 165.12% for fault-tolerant and non-fault-tolerant GEMMs, respectively, which outperforms cuBLAS by up to 41.40%.

查看译文

关键词

GEMM,GPU,Performance Optimization,Reliability,Resilience

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要