AggTree: A Routing Tree With In-Network Aggregation for Distributed Training

Jianglong Nie,Wenfei Wu

2023 IEEE International Performance, Computing, and Communications Conference (IPCCC)(2023)

引用 0|浏览18
暂无评分
摘要
For distributed training (DT) based on the parameter servers (PS) architecture, the communication overhead is huge in the network for servers synchronizing parameters. In the PS architecture, the workers send gradients over the network to PS for aggregation. With the development of programmable switches, in-network aggregation (INA) is proposed to accelerate distributed training by utilizing the programmable switches in the network to implement gradients aggregation, not only at PS. However, the existing routing methods can not fully utilize the capability of INA, resulting in load imbalance and long communication time. This paper analyzes and models the routing problem in INA under the constraint of network resources. And we propose a routing algorithm named AggTree to solve this problem by searching the high-rate routing path. The result of simulations shows that AggTree can reduce communication time by 4.1%-37.9% for a single DT job and 12.7%-74.0% for multiple DT jobs compared with state-of-the-art solutions.
更多
查看译文
关键词
In-network aggregation,gradient routing,distributed training,programmable switch
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要