Blue Waters system and component reliability

Brett Bode, David King,Celso L. Mendes,William T. Kramer,Saurabh Jha, Roger Ford, Justin Davis, Steven Dramstad

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE(2024)

引用 0|浏览9
暂无评分
摘要
The Blue Waters system, installed in 2012 at NCSA, has the largest component count of any system Cray has built. Blue Waters includes a mix of dual-socket CPU (XE) and single-socket CPU, single GPU (XK) nodes. The primary storage is provided by Cray's Sonexion/ClusterStor Luster storage system delivering 35 PB (raw) storage at 1 TB/s. The statistical failure rates over time for each component including CPU, DIMM, GPU, disk drive, power supply, blower, etc and their impact on higher level failure rates for individual nodes and the systems as a whole are presented in detail, with a particular emphasis on identifying any increases in rate that might indicate the right-side of the expected bathtub curve has been reached. Strategies employed by NCSA and Cray for minimizing the impact of component failure, such as the preemptive removal of suspect disk drives, are also presented.
更多
查看译文
关键词
failure analysis,system management
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要