Making Problem Diagnosis Work for Large-Scale, Production Storage Systems.

LISA'13: Proceedings of the 27th USENIX conference on Large Installation System Administration(2013)

引用 0|浏览9
暂无评分
摘要
Intrepid has a very-large, production GPFS storage system consisting of 128 file servers, 32 storage controllers, 1152 disk arrays, and 11,520 total disks. In such a large system, performance problems are both inevitable and difficult to troubleshoot. We present our experiences, of taking an automated problem diagnosis approach from proof-of-concept on a 12-server test-bench parallel-file-system cluster, and making it work on Intrepid's storage system. We also present a 15-month case study, of problems observed from the analysis of 624GB of Intrepid's instrumentation data, in which we diagnose a variety of performance-related storage-system problems, in a matter of hours, as compared to the days or longer with manual approaches.
更多
查看译文
关键词
["case study","infrastructure","problem diagnosis","storage systems"]
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要