Failures in large scale systems: long-term measurement, analysis, and implications

SC(2017)

引用 158|浏览130
暂无评分
摘要
Resilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand reliability characteristics and plan for future HPC systems. In this work, we compare and contrast the reliability characteristics of multiple large-scale HPC production systems. Our study covers more than one billion compute node hours across five different systems over a period of 8 years. We confirm previous findings which continue to be valid, discover new findings, and discuss their implications.
更多
查看译文
关键词
large scale system failures,multiple largescale HPC production systems,future HPC systems,reliability characteristics,field data studies,system practitioners,future extreme scale supercomputers,long term measurement
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要