GLEX_Allreduce: Optimization for medium and small message of Allreduce on Tianhe system.

Peng Liu,Jintao Peng,Jie Liu,Min Xie, Liuhua Chi

International Conference on Parallel and Distributed Systems(2023)

引用 0|浏览2
暂无评分
摘要
Global communication may affect the scalability of some parallel applications. Message Passing Interface (MPI) provides some commonly used collective communication Application Programming Interface(API). Allreduce is one of the APIs that is mostly used on parallel applications. Small message Allreduce is useful for dot products and solving linear systems. This paper proposes a medium and small message Allreduce for the Tianhe series system. For intra-node reduction/broadcast, this paper proposed a cache-aware tree and shared memory implementation. Besides, this paper proposes broadcast merge and cache line awareness methods to improve performance further. For inter-node communication, this paper proposes zero event Remote Direct Memory Access(RDMA) to avoid event overhead on the Tianhe system. In addition, a zero event immediate-data RDMA(Imm-RDMA) is used to optimize small message RDMA. Through experiments, for 16384 MPI processes, GLEX Allreduce achieves 2.4-5.1 times speedup compared to MPI. Compared to other collective communication libraries on Infiniband(IB) and Omni-Path, GLEX_Allreduce achieves similar or better performance.
更多
查看译文
关键词
Communication algorithms,Shared memory,Cache-aware,Tianhe
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要