CanarIO: Sounding the Alarm on IO-Related Performance Degradation

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)(2020)

Cited 5|Views32
No score
Abstract
Users interact with High Performance Computing (HPC) machines through batch systems, which take user job submissions and allocate them to computing resources. While some resource managers have a generalized resource model, in nearly all modern systems, nodes are the only resource managed. Other resources, such as parallel file systems, are also necessary for jobs to make progress, but schedulers are blind to these resources. Facility staff can manually detect critical problems and manually hold jobs that need particular file systems, but this requires manual monitoring. Without human intervention, modern schedulers will happily run jobs whose required resources are not available. As a result, resources are wasted when IO-intensive jobs are scheduled on file systems with degraded performance.We introduce CanarIO, a tool for predicting the IO-sensitivity of HPC jobs and detecting IO-related performance degradation on HPC systems. CanarIO uses a set of "canary" IO probes run at regular intervals on the system. Using performance measurements from these jobs, CanarIO builds classifiers that can determine which jobs are IO-sensitive and when file system performance is degraded. We demonstrate the accuracy of our tool with a simulation of system execution using real HPC data. Specifically, we detect 37.5% of IO degradation events and correctly identify >90% of IO-sensitive jobs. We show that with CanarIO predictions we recover >1,500 node-hours in 10 days, with a potential maximum of nearly 10,000 node-hours. CanarIO is the first step necessary for augmenting schedulers to be resource-aware.
More
Translated text
Key words
performance degradation,alarm,io-related
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined