ElastiSim: A Batch-System Simulator for Malleable Workloads

Proceedings of the 51st International Conference on Parallel Processing(2022)

引用 2|浏览18
暂无评分
摘要
As high-performance computing infrastructures move towards exascale, the role of resource and job management systems is more critical now than ever. Simulating batch systems to improve scheduling algorithms and resource management efficiency is an indispensable option, as running large-scale experiments is expensive and time-consuming. Batch-system simulators are responsible for simulating the computing infrastructure and the types of jobs that constitute the workload. In contrast to rigid jobs, malleable jobs can dynamically reconfigure their resources during runtime. Although studies indicate that malleability can improve system performance, no simulator exists to investigate malleable scheduling policies. In this work, we present ElastiSim, a batch-system simulator supporting the combined scheduling of rigid and malleable jobs. To facilitate the simulation, we propose a malleable workload model and introduce a scheduling protocol that enables the evaluation of topology-, I/O-, and progress-aware scheduling algorithms. We validate the scaling behavior of our workload model by comparing training runtimes of various deep-learning models against the results achieved by ElastiSim. We use real-world cluster trace files to generate workloads and simulate various scheduling algorithms (FCFS, SJF, DRF, SRTF) to analyze their implications on the simulated platform. The results demonstrate that real-world executions show the same scaling behavior as our proposed workload model. We further show that ElastiSim can capture the complex interplay between emerging workloads and modern platforms to support algorithm designers by providing consistently meaningful results. ElastiSim is publicly available as an open-source project on https://github.com/elastisim.
更多
查看译文
关键词
batch systems, simulations, malleableworkloads, adaptive job scheduling, resource management
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要