DLRover: An Elastic Deep Training Extension with Auto Job Resource Recommendation

Qinlong Wang, Bo Sang, Haitao Zhang, Mingjie Tang,Ke Zhang

CoRR(2023)

引用 0|浏览22
暂无评分
摘要
The cloud is still a popular platform for distributed deep learning (DL) training jobs since resource sharing in the cloud can improve resource utilization and reduce overall costs. However, such sharing also brings multiple challenges for DL training jobs, e.g., high-priority jobs could impact, even interrupt, low-priority jobs. Meanwhile, most existing distributed DL training systems require users to configure the resources (i.e., the number of nodes and resources like CPU and memory allocated to each node) of jobs manually before job submission and can not adjust the job's resources during the runtime. The resource configuration of a job deeply affect this job's performance (e.g., training throughput, resource utilization, and completion rate). However, this usually leads to poor performance of jobs since users fail to provide optimal resource configuration in most cases. \system~is a distributed DL framework can auto-configure a DL job's initial resources and dynamically tune the job's resources to win the better performance. With elastic capability, \system~can effectively adjusts the resources of a job when there are performance issues detected or a job fails because of faults or eviction. Evaluations results show \system~can outperform manual well-tuned resource configurations. Furthermore, in the production Kubernetes cluster of \company, \system~reduces the medium of job completion time by 31\%, and improves the job completion rate by 6\%, CPU utilization by 15\%, and memory utilization by 20\% compared with manual configuration.
更多
查看译文
关键词
elastic deep training extension,auto job resource
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要