Classifying jobs towards power-aware HPC system operation through long-term log analysis

Array(2022)

引用 0|浏览4
暂无评分
摘要
The efficient utilization of high-performance computing (HPC) system resources under rigorous electric power budget or I/O workload constraints is among the most important goals set by system operators to deal with the demanding requirements of application users. In most cases, the effective utilization of CPU and memory devices, which is tightly linked to electric power consumption, is a counterpart metric of I/O activities in most HPC jobs. Towards higher utilization of HPC systems under strict electric power consumption and I/O activity management constrains, we must be careful to prevent hot-spots from developing in power consumption or I/O operations that could lead to unstable system operations by exceeding electric power supply or I/O subsystem capabilities. One of the feasible solutions is arranging compute node assignment not to have such hot-spots in electric power or I/O operations. To address this issue, we analyzed vast amounts of log data collected from the K computer and found strong positive correlations between CPU and memory device utilization rates and electric power consumption levels. On the one hand, we also observed strong negative correlations and reduced electric power consumption in relation to file I/O activities in a specific compute node-layout, thereby indicating unique characteristics in some I/O-intensive HPC jobs in the node-layout. Our investigation revealed that HPC jobs could be divided into two groups when classified in terms of required electric power — jobs consuming high electric power levels and I/O-intensive jobs with reduced electric power levels. Then, we achieved high levels of accuracy when classifying jobs in terms of electric power levels using RandomForestClassifier among multiple machine learning classification models provided from scikit-learn. The classification can prevent us from hot-spots in electric power consumption in compute node assignment in job scheduling. Thus we demonstrated efficient job classifications towards power-aware system operations in the supercomputer Fugaku, which is the successor to the K computer.
更多
查看译文
关键词
Classification,Machine learning,Electric power,FLOPS,Memory bandwidth,File I/O
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要