Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems

The International Conference on Dependable Systems and Networks(2015)

引用 102|浏览106
暂无评分
摘要
As we approach exascale, the scientific simulations are expected to experience more interruptions due to increased system failures. Designing better HPC resilience techniques requires understanding the key characteristics of system failures on these systems. While temporal properties of system failures on HPC systems have been well-investigated, there is limited understanding about the spatial characteristics of system failures and its impact on the resilience mechanisms. Therefore, we examine the spatial characteristics and behavior of system failures. We investigate the interaction between spatial and temporal characteristics of failures and its implications for system operations and resilience mechanisms on large-scale HPC systems. We show that system failures have \"spatial locality\" at different granularity in the system, study impact of different failure-types, and investigate the correlation among different failure-types. Finally, we propose a novel scheme that exploits the spatial locality in failures to improve application and system performance. Our evaluation shows that the proposed scheme significantly improves the system performance in a dynamic and production-level HPC system.
更多
查看译文
关键词
Spatial Locality,System Failures,Resilience,Fault tolerance,High Performance Computing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要