Full-Stack Allreduce on Multi-Rail Networks
CoRR(2024)
Abstract
The high communication costs impede scalability in distributed systems.
Multimodal models like Sora exacerbate this issue by requiring more resources
than current networks can support. However, existing network architectures fail
to address this gap. In this paper, we provide full-stack support for allreduce
on multi-rail networks, aiming to overcome the scalability limitations of
large-scale networks by facilitating collaborative data transfer across various
networks. To achieve this, we propose the Nezha system, which integrates TCP,
in-network computing protocol SHARP, and RDMA-based protocol GLEX. To maximize
data transfer rates, Nezha incorporates a load balancing data allocation scheme
based on cost feedback and combines exception handling to achieve reliable data
transmission. Our experiments on a six-node cluster demonstrate that Nezha
significantly enhances allreduce performance by 58% to 87% in homogeneous
dual-rail configurations and offers considerable acceleration in heterogeneous
settings, contingent on the performance variance among networks.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined