Design for a Soft Error Resilient Dynamic Task-Based Runtime

International Parallel & Distributed Processing Symposium(2015)

引用 41|浏览80
暂无评分
摘要
As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exactable a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this paper we explore three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task-based programming paradigms. The first recovers the application by re-executing the minimum required sub-DAG, the second takes critical checkpoints of the data flowing between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re-execution. These mechanisms have been implemented in the PaRSEC task-based runtime framework. Experimental results validate our approach and quantify the overhead introduced by such mechanisms.
更多
查看译文
关键词
soft error resilience,runtime,fault tolerance
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要