A Memory RAS System Design and Engineering Practice in High Temperature Ambient Data Center

Aili Yao, JinFeng Li, Fengqian Wang,Jie Zhao,Hongmei Liu,Jiajun Zhang,Jun Zhang,Alex Zhou, Youquan Song,Jialiang Xu,Paul Sun, Kunye Zhu,Nishi Ahuja, Dayi Zhu, Sean Kuo

intersociety conference on thermal and thermomechanical phenomena in electronic systems(2020)

引用 3|浏览6
暂无评分
摘要
Data center infrastructure uptime increasing, unplanned downtime reduction and data integrity main-taining are increasingly critical in today’s real-time, service-level agreement (SLA)-driven cloud service business environment. Server, as backbone of cloud computing, has been developing and evolving with diverse challenges of preserving data integrity, increasing availability, minimizing planned downtime, especially in high temperature ambient (HTA) data center environment So, rock robust server system design for reliability, availability, and serviceability (RAS) are crucial for cloud service providers.Memory errors in server are among the most common hardware causes of machine crashes in production sites with large-scale systems. The higher temperature environment, the more errors. The typical response to memory failures is to replace any affected memory modules, which makes memory modules among the most commonly replaced server components. So, memory failures and their correction are very costly. Based on data collection and failure analysis from Baidu infrastructure maintenance group, server system memory (uncorrectable error) failure rate is Top 1 in data center. For reducing memory failure rate and related server downtime, Baidu developed an advanced server handling memory correctable and uncorrectable errors throughout a \"6 pillars\" complete application stack, from the underlying hardware to the scheduling system. Such solutions involve three components: (1) reliability, how the solution preserves data integrity; (2) availability, how it guarantees uninterrupted operation with minimal degradation; and (3) serviceability, how it simplifies proactively and reactively dealing with failed or potentially failed components. Availability is not an independent vector.This paper addresses Baidu rack server memory RAS architecture and design, scoping from Intel Xeon processor hardware errors avoidance, detection and correction RAS features for system reliability and improves fault tolerance; failure identification and reconfiguration such as leakage bucket based software-enhanced error recovery and error containment; extending to kernel level page retirement as well as high availability scheduler. Also, Memory RAS system design specific for HTA data center environment is detail introduced. Then, the related lab test procedure, data, and observations are summarized at the end of each section. Overall conclusion and future work plan are summarized in the end.
更多
查看译文
关键词
Reliability,Availability,and Serviceability (RAS),Memory,Advanced Double Device Data Correction,Hwpoison,Recovery and Containment,High Temperature Ambient (HTA)
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要