Draco: Statistical Diagnosis Of Chronic Problems In Large Distributed Systems

DSN '12: Proceedings of the 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)(2012)

引用 34|浏览46
暂无评分
摘要
Chronics are recurrent problems that often fly under the radar of operations teams because they do not affect enough users or service invocations to set off alarm thresholds. In contrast with major outages that are rare, often have a single cause, and as a result are relatively easy to detect and diagnose quickly, chronic problems are elusive because they are often triggered by complex conditions, persist in a system for days or weeks, and coexist with other problems active at the same time. In this paper, we present Draco, a scalable engine to diagnose chronics that addresses these issues by using a "top-down" approach that starts by heuristically identifying user interactions that are likely to have failed, e.g., dropped calls, and drills down to identify groups of properties that best explain the difference between failed and successful interactions by using a scalable Bayesian learner. We have deployed Draco in production for the VoIP operations of a major ISP. In addition to providing examples of chronics that Draco has helped identify, we show via a comprehensive evaluation on production data that Draco provided 97% coverage, had fewer than 4% false positives, and outperformed state-of-the-art diagnostic techniques by up to 56% for complex chronics.
更多
查看译文
关键词
Bayes methods,Internet telephony,alarm systems,computer network performance evaluation,learning (artificial intelligence),statistical analysis,Draco,ISP,VoIP operations,alarm thresholds,chronic problems,comprehensive production data evaluation,large distributed systems,recurrent problems,scalable Bayesian learner,service invocations,statistical diagnosis,top-down approach,user interaction identification,
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要