What Can We Learn from Four Years of Data Center Hardware Failures?

DSN(2017)

引用 107|浏览80
暂无评分
摘要
Hardware failures have a big impact on the dependability of large-scale data centers. We present studies on over 290,000 hardware failure reports collected over the past four years from dozens of data centers with hundreds of thousands of servers. We examine the dataset statistically to discover failure characteristics along the temporal, spatial, product line and component dimensions. We specifically focus on the correlations among different failures, including batch and repeating failures, as well as the human operatorsu0027 response to the failures. We reconfirm or extend findings from previous studies. We also find many new failure and recovery patterns that are the undesirable by-product of the state-of-the-art data center hardware and software design.
更多
查看译文
关键词
data center hardware failures,large-scale data center dependability,failure characteristics,repeating failures,batch failures,recovery patterns,software design
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要