Modeling and Simulating Multiple Failure Masking Enabled by Local Recovery for Stencil-Based Applications at Extreme Scales.

IEEE Transactions on Parallel and Distributed Systems(2017)

引用 15|浏览60
暂无评分
摘要
Obtaining multi-process hard failure resilience at the application level is a key challenge that must be overcome before the promise of exascale can be fully realized. Previous work has shown that online global recovery can dramatically reduce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. If online ...
更多
查看译文
关键词
Computational modeling,Delays,Protocols,Resilience,Fault tolerance,Fault tolerant systems,Hardware
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要