Dynamic Resource Allocation for Deep Learning Clusters with Separated Compute and Storage.

Mingxia Li,Zhenhua Han,Chi Zhang,Ruiting Zhou, Yuanchi Liu,Haisheng Tan

INFOCOM（2023）

引用 0|浏览58

暂无评分

摘要

The separation of compute and storage in modern cloud services eases the deployment of general applications. However, with the development of accelerators such as GPU/TPU, Deep Learning (DL) training is suffering from potential IO bottlenecks when loading data from storage clusters. Therefore, DL training jobs need to either create local cache in the compute cluster to reduce the bandwidth demands or scale up the IO capacity with higher bandwidth cost. It is full of challenges to choose the best strategy due to the heterogeneous cache/IO preference of DL models, shared dataset among multiple jobs and dynamic GPU scaling of DL training. In this work, we exploit the job characteristics based on their training throughput, dataset size and scalability. For fixed GPU allocation of jobs, we propose CBA to minimize the training cost with a closed-form approach. For clusters that can automatically scale the GPU allocations of jobs, we extend CBA to AutoCBA to support diverse job utility functions and improve social welfare within a limited budget. Extensive experiments with production traces validate that CBA and AutoCBA can reduce IO cost and improve total social welfare by up to 20.5% and 2.27×, respectively, over the state-of-the-art schedulers for DL training.

查看译文

关键词

AutoCBA,bandwidth demands,CBA,closed-form approach,compute cluster,dataset size,Deep Learning clusters,Deep Learning training,diverse job utility functions,DL training jobs,dynamic GPU scaling,dynamic resource allocation,general applications,higher bandwidth cost,IO capacity,IO cost,job characteristics,loading data,local cache,modern cloud services,multiple jobs,potential IO bottlenecks,separated compute,shared dataset,storage clusters,training cost,training throughput

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要