DLB: A Dynamic Load Balance Strategy for Distributed Training of Deep Neural Networks

Qing Ye,Yuhao Zhou,Mingjia Shi,Yanan Sun,Jiancheng Lv

IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE（2022）

引用 0|浏览8

暂无评分

摘要

Synchronous strategies with data parallelism are widely utilized in distributed training of Deep Neural Networks (DNNs), largely owing to their easy implementation yet promising performance. In these strategies, the workers with different computational capabilities need to wait for each other because of the essential gradient or weight synchronization. This will inevitably cause the high-performance workers to waste time waiting for the weak computational workers, which in turn results in the inefficiency of the cluster. In this paper, we propose a Dynamic Load Balance (DLB) strategy for the distributed training of DNNs to tackle this issue. Specifically, the performance of each worker is evaluated first based on the performance demonstration during the previous training epochs, and then the batch size and dataset partition are adaptively adjusted in consideration of the current performance of the workers. As a result, the waiting cost among the workers will be eliminated, thereby the utilization of the clusters is highly improved. Furthermore, the essential theoretical analysis has also been provided to justify the convergence of the proposed algorithm. Extensive experiments have been conducted on the CIFAR10 and CIFAR100 benchmark datasets with four state-of-the-art DNN models. The experimental results indicate that the proposed algorithm can significantly improve the utilization of the distributed cluster. In addition, the proposed algorithm can also prevent the load imbalance of the distributed DNN training from being affected by the disturbance and can be employed flexibly in conjunction with the other synchronous distributed DNN training methods.

查看译文

关键词

Local SGD,Load balance,Straggler problem,Distributed DNN training

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要