Bruck Algorithm Performance Analysis for Multi-GPU All-to-All Communication.

Andres Sewell,Ke Fan,Ahmedur Rahman Shovon, Landon Dyken,Sidharth Kumar,Steve Petruzza

International Conference on High Performance Computing in Asia-Pacific Region（2024）

引用 0|浏览0

暂无评分

摘要

In high-performance computing, collective communication is critical for facilitating comprehensive data exchange involving all processes within an MPI communicator. Due to their inherently global nature, many collective operations present scalability challenges, particularly the all-to-all data shuffle with its quadratic communication pattern. Using a logarithmic communication pattern, the Bruck algorithm was designed to provide communication efficiency for all-to-all data shuffles involving short-sized messages. The Bruck algorithm has been extensively used to facilitate global data shuffles in a multi-CPU environment and is also part of the MPICH and Open MPI implementations. This work presents the first investigation of using the Bruck algorithm for all-to-all communication in multi-GPU systems using the NVIDIA Collective Communications Library (NCCL). Our experimental study demonstrates that while the Bruck algorithm exhibits superior performance for small-sized messages in a multi-CPU environment, the same advantages are not evident for multi-GPU environments. Furthermore, we describe and compare an optimized Bruck algorithm implementation in NCCL and compare it to NCCL’s default all-to-all and MPI-based implementations. Finally, we discuss the challenges and opportunities of implementing new multi-GPU collectives using NCCL’s public-facing API.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要