谷歌浏览器插件
订阅小程序
在清言上使用

ACCL: Architecting Highly Scalable Distributed Training Systems With Highly Efficient Collective Communication Library

Jianbo Dong, Shaochuang Wang, Fei Feng, Zheng Cao, Heng Pan, Lingbo Tang, Pengcheng Li, Hao Li, Qianyuan Ran, Yiqun Guo, Shanyuan Gao, Xin Long, Jie Zhang, Yong Li, Zhisheng Xia, Liuyihan Song, Yingya Zhang, Pan Pan, Guohui Wang, Xiaowei Jiang

IEEE Micro(2021)

引用 3|浏览182
暂无评分
摘要
Distributed systems have been widely adopted for deep neural networks model training. However, the scalability of distributed training systems is largely bounded by the communication cost. We design a highly efficient collective communication library, namely Alibaba Collective Communication Library (ACCL), to build distributed training systems with linear scalability. ACCL provides optimized algorithms to fully make use of heterogeneous interconnects simultaneously. And the experimental results show significant performance improvement.
更多
查看译文
关键词
Servers,Training,Bandwidth,Routing,Fabrics,Payloads,Parallel algorithms
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要