Optimizing Distributed ML Communication with Fused Computation-Collective Operations
arxiv(2023)
摘要
In order to satisfy their ever increasing capacity and compute requirements,
machine learning models are distributed across multiple nodes using numerous
parallelism strategies. As a result, collective communications are often on the
critical path, and hiding their latency by overlapping kernel-granular
communication and computation is difficult due to the absence of independent
computation. In this work, we propose fusing computation with dependent
collective communication by leveraging GPUs' massive parallelism and
GPU-initiated communication. We have developed self-contained GPU kernels where
workgroups (WGs) immediately communicate their results to remote GPUs when they
complete their computation. Meanwhile, other WGs within the same kernel perform
overlapping computation, maintaining high ALU utilization.
We demonstrate our approach by creating three prototype fused operators
(embedding + All-to-All, GEMV + AllReduce, and GEMM + All-to-All) to address
the pervasive communication overheads observed in DLRM, Transformers and MoE
model architectures. In order to demonstrate that our approach can be
integrated into ML frameworks for wide adoption in production environments, we
expose our fused operators as new PyTorch operators as well as extend the
Triton framework to enable them. Our evaluations show that our approach can
effectively overlap communication with computations, subsequently reducing
their combined execution time than the current collective library-based
approaches. Our scale-up GEMV + AllReduce and GEMM + All-to-All implementations
achieve up to 22
All-to-All reduces execution time by 20
configurations. Large scale-out simulations indicate that our approach reduces
DLRM execution time by 21
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要