Dynamic Selective Protection of Sparse Iterative Solvers via ML Prediction of Soft Error Impacts.

Zizhao Chen, Thomas Verrecchia,Hongyang Sun ,Joshua Dennis Booth,Padma Raghavan

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis(2023)

引用 0|浏览6
暂无评分
摘要
Soft errors occur frequently on large computing platforms due to the increasing scale and complexity of HPC systems. Various resilience techniques (e.g., checkpointing, ABFT, and replication) have been proposed to protect scientific applications from soft errors at different levels. Among them, system-level replication often involves duplicating or even triplicating the entire computation, thus resulting in high resilience overhead. This paper proposes dynamic selective protection for sparse iterative solvers, in particular for the Preconditioned Conjugate Gradient (PCG) solver, at the system level to reduce the resilience overhead. For this method, we leverage machine learning (ML) to predict the impact of soft errors that strike different elements of a key computation (i.e., sparse matrix-vector multiplication) at different iterations of the solver. Based on the result of the prediction, we design a dynamic strategy to selectively protect those elements that would result in a large performance degradation if struck by soft errors. An experimental evaluation demonstrates that our dynamic protection strategy is able to reduce the resilience overhead compared to existing algorithms.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要