Compressed Collective Sparse-Sketch for Distributed Data-Parallel Training of Deep Learning Models.

IEEE J. Sel. Areas Commun.(2023)

引用 1|浏览44
暂无评分
摘要
Distributed data-parallel training (DDP) is prevalent in large-scale deep learning. To increase the training throughput and scalability, high-performance collective communication methods such as AllReduce have recently proliferated for DDP use. However, these approaches require long communication periods with increasing model sizes. Collective communication transmits many sparse gradient values that can be efficiently compressed to reduce the required training time. State-of-the-art compression approaches do not provide mergeable compression for AllReduce and lack convergence bounds. We present a sparse sketch reducer (S2Reducer), a sparsity-preserving sketch-based collective communication method. S2Reducer preserves gradient sparsity and reduces communication costs via a bitmap informed count sketch structure and adapts to efficient AllReduce operators. We tune the count sketch organization to minimize the hash conflicts in a fixed-size budget. We prove that our method has the same convergence rate as vanilla data-parallel training and a much smaller communication overhead than those of state-of-the-art methods. We implement a GPU-accelerated S2Reducer for the Ring AllReduce-based DDP system. We perform extensive evaluations against four state-of-the-art methods across seven deep learning models. Our results show that S2Reducer converges to the same accuracy as that of state-of-the-art approaches while reducing the sparse communication overhead by up to 86% and achieving a speedup of up to $3.5\times $ in distributed training.
更多
查看译文
关键词
Distributed training,deep learning,sparse,communication,sketch
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要