Designing In-network Computing Aware Reduction Collectives in MPI

2023 IEEE Symposium on High-Performance Interconnects (HOTI)(2023)

Cited 0|Views10
No score
Abstract
The Message-Passing Interface (MPI) provides convenient abstractions such as MPI_Allreduce for inter-process collective reduction operations. With the advent of deep learning and large-scale HPC systems, it is ever so important to optimize the latency of the MPI_Allreduce operation for large messages. Due to the amount of compute and communication involved in MPI_Allreduce, it is beneficial to offload collective computation/communication to the network to allow the CPU to work on other important operations and provide maximal overlap/scalability. NVIDIA’s HDR InfiniBand switches provide in-network computing features using the Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) for this purpose with two protocols targeted at different message ranges: 1) Local Latency Tree (LLT) for small messages, and 2) Streaming aggregation tree (SAT) for large messages. In this paper, we first analyze the overheads involved in using SHARP-based reductions with SAT in an MPI library using micro-benchmarks. Next, we propose designs for large message MPI_Allreduce by fully utilizing the capabilities provided by the SHARP runtime while overcoming various bottlenecks. The efficacy of our proposed designs is demonstrated using micro-benchmark results. We observe up to 89% improvements over MVAPICH2-X and HPC-X for large message reductions.
More
Translated text
Key words
HPC,InfiniBand,MPI,In-network computing,NVIDIA SHARP,Allreduce
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined