Toward An End-To-End Framework For Modeling, Monitoring And Anomaly Detection For Scientific Workflows

2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)(2016)

引用 12|浏览155
暂无评分
摘要
Modern science is often conducted on large scale, distributed, heterogeneous and high-performance computing infrastructures. Increasingly, the scale and complexity of both the applications and the underlying execution platforms have been growing. Scientific workflows have emerged as a flexible representation to declaratively express complex applications with data and control dependences. However, it is extremely challenging for scientists to execute their science workflows in a reliable and scalable way due to a lack of understanding of expected and realistic behavior of complex scientific workflows on large scale and distributed HPC systems. This is exacerbated by failures and anomalies in large scale systems and applications, which makes detecting, analyzing and acting on anomaly events challenging. In this work, we present a prototype of an end-to-end system for modeling and diagnosing the run-time performance of complex scientific workflows. We interfaced the Pegasus workflow management system, Aspen performance modeling, monitoring and anomaly detection into an integrated framework that not only improves the understanding of complex scientific applications on large scale complex infrastructure, but also detects anomalies and supports adaptivity. We present a black box modeling tool, a comprehensive online monitoring system, and anomaly detection algorithms that employ the models and monitoring data to detect anomaly events. We present an evaluation of the system with a Spallation Neutron Source workflow as a driving use case.
更多
查看译文
关键词
scientific workflows,performance modeling,monitoring,anomaly detection
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要