Distributed Adaptive Optimization with Divisible Communication

MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, ECML PKDD 2023, PT III(2023)

引用 0|浏览1
暂无评分
摘要
Synchronous distributed training can scale the training of deep neural networks on large-scale data, thus it has been widely adopted in large-scale applications. Because it often suffers from the communication bottleneck, many methods have been proposed to reduce the communication cost. However, these communication reduction methods often lead to poor performance for the adaptive optimizer, largely due to its non-linearity. To address this challenging issue, we propose a novel method to divide the communication into the foreground and background communication. The foreground communication is more informative but can be of low cost to achieve communication efficiency, while the background communication runs in the background and requires no synchronization time. We use Adam as the base optimizer and achieve x1024 foreground compression ratio on CIFAR-10, x128 on non-iid CIFAR-10, x64 on ImageNet image classification tasks, and x128 on WMT'16 EN-DE machine translation task with comparable performance, which leads to x7, x6.4, x3.5, and x7 training speedup, respectively. Moreover, we provide rigorous theoretical analysis to prove that our method obtains the same convergence rate as Adam and achieves linear speedup regarding the number of workers.
更多
查看译文
关键词
Adaptive Optimization,Communication Efficiency,Distributed Training
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要