LogAider: A tool for mining potential correlations of HPC log events.

CCGrid(2017)

引用 50|浏览105
暂无评分
摘要
Today's large-scale supercomputers are producing a huge amount of log data. Exploring various potential correlations of fatal events is crucial for understanding their causality and improving the working efficiency for system administrators. To this end, we developed a toolkit, named LogAider, that can reveal three types of potential correlations: across-field, spatial, and temporal. Across-field correlation refers to the statistical correlation across fields within a log or across multiple logs based on probabilistic analysis. For analyzing the spatial correlation of events, we developed a generic, easy-to-use visualizer that can view any events queried by users on a system machine graph. LogAider can also mine spatial correlations by an optimized K-meaning clustering algorithm over a Torus network topology. It is also able to disclose the temporal correlations (or error propagations) over a certain period inside a log or across multiple logs, based on an effective similarity analysis strategy. We assessed LogAider using the one-year reliability-availability-serviceability (RAS) log of Mira system (one of the world's most powerful supercomputers), as well as its job log. We find that LogAider very helpful for revealing the potential correlations of fatal system events and job events, with an accurate mining of across-field correlation with both precision and recall of 99.9-100%, as well as precise detection of temporal-correlation with a high similarity (up to 95%) to the ground-truth.
更多
查看译文
关键词
LogAider,mining potential correlations,HPC log events,large-scale supercomputers,across-field correlation,statistical correlation,probabilistic analysis,easy-to-use visualizer,system machine graph,k-meaning clustering,reliability-availability-serviceability,RAS,Mira system
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要