An Adaptive Load Balancing Strategy for Distributed Machine Learning

2023 3rd International Conference on Frontiers of Electronics, Information and Computation Technologies (ICFEICT)(2023)

引用 0|浏览1
暂无评分
摘要
Within a distributed deep learning training system, variances in performance among computing nodes, as well as the influence of external environmental factors, can result in training interruptions or reduced convergence speed. As such, this paper presents an approach to address this issue by proposing a dynamic task allocation strategy among nodes, aimed at mitigating the impact of performance discrepancies on the training efficiency of distributed deep learning systems. This proposed approach is referred to as the “Auto weight-based load balancing strategy” (Auto-WLBS) and involves dynamically adjusting the task allocation among computing nodes based on their performance characteristics. In order to maximize computing power while minimizing the lag effect on the training of the entire system, Auto-WLBS partitions and alters the errands by introductory division and halfway alteration. Combined with LSP model, AW-LSP model is proposed. Finally, a comparative experiment was conducted on the cifar10 and cifar100 datasets, and Auto-WLBS was experimentally verified from the perspective of model training loss function changes, model accuracy, and training time. The experimental findings demonstrate that compared to the BSP model, SSP model, and LSP model, the AW-LSP model has a smaller communication overhead, and the communication overhead can be reduced by up to 23.70%. Compared with the SSP model and the LSP model, the model accuracy can be improved by up to 14.5% and 8.6%, respectively.
更多
查看译文
关键词
Distributed machine learning,parametric server architecture,task allocation tuning,dynamic scheduling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要