Topology-Aware Event Sequence Mining for Understanding HPC System Behavior and Detecting Anomalies.

HPCC/SmartCity/DSS(2019)

引用 0|浏览7
暂无评分
摘要
System logs provide invaluable resources for understanding system behavior and detecting anomalies on high performance computing (HPC) systems. As HPC systems continue to grow in both scale and complexity, the sheer volume of system logs and the complex interaction among system components make the traditional manual problem diagnosis and even automated line-by-line log analysis infeasible or ineffective. Sequence mining technologies aim to identify important patterns among a set of objects, which can help us discover regularity among events, detect anomalies, and predict events in HPC environments. The existing sequence mining algorithms are compute-intensive and inefficient to process the overwhelming number of system events which have complex interaction and dependency. In this paper, we present a novel, topology-aware sequence mining method (named TSM) and explore it for event analysis and anomaly detection on production HPC systems. TSM is resource-efficient and capable of producing long and complex event patterns from log messages, which makes TSM suitable for online monitoring and diagnosing of large-scale systems. We evaluate the performance of TSM using system logs collected from a production supercomputer. Experimental results show that TSM is highly efficient in identifying event sequences on single and multiple nodes without any prior knowledge. We apply verification functions and requirements and prove the correctness of the event patterns produced by TSM.
更多
查看译文
关键词
High performance computing systems,System monitoring and diagnosis,Anomaly detection,Sequence mining,Event patterns
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要