Chrome Extension
WeChat Mini Program
Use on ChatGLM

AggTree: A Routing Tree With In-Network Aggregation for Distributed Training

Jianglong Nie,Wenfei Wu

2023 IEEE International Performance, Computing, and Communications Conference (IPCCC)(2023)

Cited 0|Views27
No score
Abstract
For distributed training (DT) based on the parameter servers (PS) architecture, the communication overhead is huge in the network for servers synchronizing parameters. In the PS architecture, the workers send gradients over the network to PS for aggregation. With the development of programmable switches, in-network aggregation (INA) is proposed to accelerate distributed training by utilizing the programmable switches in the network to implement gradients aggregation, not only at PS. However, the existing routing methods can not fully utilize the capability of INA, resulting in load imbalance and long communication time. This paper analyzes and models the routing problem in INA under the constraint of network resources. And we propose a routing algorithm named AggTree to solve this problem by searching the high-rate routing path. The result of simulations shows that AggTree can reduce communication time by 4.1%-37.9% for a single DT job and 12.7%-74.0% for multiple DT jobs compared with state-of-the-art solutions.
More
Translated text
Key words
In-network aggregation,gradient routing,distributed training,programmable switch
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined