GRID: Gradient Routing With In-Network Aggregation for Distributed Training

IEEE-ACM TRANSACTIONS ON NETWORKING(2023)

引用 8|浏览26
暂无评分
摘要
As the scale of distributed training increases, it brings huge communication overhead in clusters. Some works try to reduce the communication cost through gradient compression or communication scheduling. However, these methods either downgrade the training accuracy or do not reduce the total transmission amount. One promising approach, called in-network aggregation, is proposed to mitigate the bandwidth bottleneck in clusters by aggregating gradients in programmable hardware (e.g., Intel Tofino switches). However, existing solutions mainly implement in-network aggregation through fixed (or default) routing paths, resulting in load imbalancing and long communication time. To deal with this issue, we propose GRID, the first-of-its-kind work on Gradient Routing with In-network Aggregation for Distributed Training. In the control plane, we present an efficient gradient routing algorithm based on randomized rounding and formally analyze the approximation performance. In the data plane, we realize in-network aggregation by carefully designing the logic of workers and programmable switches. We implement GRID and evaluate its performance on a small-scale testbed consisting of 3 Intel Tofino switches and 9 commodity servers. With a combination of testbed experiments and large-scale simulations, we show that GRID can reduce the communication time by 38.4%–60.1% and speed up distributed training by 17.4%–52.7% compared with state-of-the-art solutions.
更多
查看译文
关键词
In-network aggregation,gradient routing,distributed training,datacenter network,programmable network
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要