Accelerate distributed deep learning with cluster-aware sketch quantization

SCIENCE CHINA-INFORMATION SCIENCES(2023)

引用 0|浏览17
暂无评分
摘要
Gradient quantization has been widely used in distributed training of deep neural network (DNN) models to reduce communication cost. However, existing quantization methods overlook that gradients have a nonuniform distribution changing over time, which can lead to significant compression error that not only increases the number of training iterations but also requires a higher number of quantization bits (and consequently higher delay for each iteration) to keep the validation accuracy as high as the original stochastic gradient descent (SGD) approach. To address this problem, in this paper we propose cluster-aware sketch quantization (CASQ), a novel sketch-based gradient quantization method for SGD with convergence guarantees. CASQ models the nonuniform distribution of gradients via clustering, and adaptively allocates appropriate numbers of hash buckets based on the statistics of different clusters to compress gradients. Extensive evaluation shows that compared to existing quantization methods, CASQ-based SGD (i) achieves the same validation accuracy when decreasing quantization level from 3 bits to 2 bits, and (ii) reduces the training time to convergence by up to 43% for the same training loss.
更多
查看译文
关键词
distributed training,deep learning,communication,sketch,quantization
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要