A Practical Approach for Handling Soft Errors in Iterative Applications

Cluster Computing(2015)

引用 17|浏览48
暂无评分
摘要
With reducing feature sizes, there is a growing need for soft errors to be handled at the software level. This paper focuses on iterative scientific applications, particularly, solvers of PDEs. After empirically studying the impact of bit flips on convergence and correctness of these applications as well as analyzing the underlying numerical algorithm, we propose the following method for improving accuracy of these applications in the presence of silent data corruptions. We show that changes in value of the residue can serve as the signature that detect the soft errors that can have the most negative impact on the applications. Our analysis also shows that for iterative solvers, bit flips in the later part of the computation are a lot more likely to impact final results. For such cases, we propose partial replication to help improve accuracy without very large overheads. After applying our approach on five scientific applications, we find that our signature based method removes all infinite loops because of bit flips, reduces the error in the final results by up to 99%, and has less than 6% overhead (with an additional 24% overhead for checkpointing and restart). The reduction in error can be as high as 99.9% while using partial replication together with our signature analysis for two of the applications.
更多
查看译文
关键词
soft error handling,iterative application,software fault tolerance,silent data corruption,soft error detection,iterative solver,bit flip,signature analysis,high performance computing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要