Fault-Tolerant LU Factorization Is Low Cost

EURO-PAR 2021: PARALLEL PROCESSING(2021)

引用 1|浏览14
暂无评分
摘要
At large scale, failures are statistically frequent and need to be taken into account. Tolerating failures has arisen as a major challenge in parallel computing as the size of the systems grow, failures become more common and some computation units are expected to fail during the execution of a program. Algorithms used in these programs must be scalable, while being resilient to hardware failures that will happen during the execution. In this paper, we present an algorithm that takes advantage of intrinsic properties of the scalable communication-avoiding LU algorithms in order to make them fault-tolerant and proceed with the computation in spite of failures. We evaluate the overhead of the fault tolerance mechanisms with respect to failure-free execution on both talland-skinny matrices (TSLU) and square matrices (CALU), and the cost of a failure during the execution.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要