Proactive Fault Monitoring In Enterprise Servers

K Whisnant,Kc Gross, N Lingurovska

CDES '05: Proceedings of the 2005 International Conference on Computer Design(2005)

引用 45|浏览52
暂无评分
摘要
New proactive fault monitoring innovations are being developed, demonstrated on executing servers, and productized for enhancing the reliability, availability, and serviceability of enterprise-class servers. A continuous system telemetry harness (CSTH) has been developed that collects time series signals relating to the health of dynamically executing servers. These time series provide quantitative metrics associated with physical variables (distributed temperatures, voltages, and currents throughout the system), "soft" performance variables (loads, throughputs, queue lengths, bit error rates, etc), and various quality-of-service (QoS) metrics. The CSTH signals are continuously archived to an offline circular file (i.e. the "Black Box Flight Recorder'') that is helping to identify and eliminate costly sources of No-Trouble-Founds (NTFs) in Sun systems; and the signals are concurrently processed in real time using advanced pattern recognition for proactive anomaly detection.Examples are presented of the uses of the CSTH coupled with pattern recognition for high-sensitivity predictive failure analysis that is helping to increase component and system availability goals while decreasing the incidence of "No Trouble Found" (NTF) events that have become a costly serviceability/warranty issue in the enterprise computing industry.
更多
查看译文
关键词
autonomic computing,proactive fault monitoring,predictive failure analysis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要