Optimal Placement of Retry-Based Fault Recovery Annotations in HPC Applications

semanticscholar(2013)

Cited 0|Views0
No score
Abstract
As larger HPC systems are built, fault recovery becomes a fundamental capability. Traditional fault recovery approaches, such as checkpointing, may not be sufficient for future exascale systems. Retry-based recovery techniques have been proposed as an alternative. These techniques simply re-execute a code region when a fault occurs and require code annotations. However, no previous work has investigated the optimal placement of these annotations in a program. Via fault injection, we evaluate how to place optimally retry annotations in a hydrodynamics mini application. We found that, contrary to our expectations, a simple scheme of protecting the main function works well for low fault rates: slowdown is up to 1.25 for a 3 faults/hour rate. We also found that the optimal recovery method is rolling a few iterations back in the application’s main loop.
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined