Self-Adaptive Gradient Quantization for Geo-Distributed Machine Learning Over Heterogeneous and Dynamic Networks

Chenyu Fan,Xiaoning Zhang,Yangming Zhao,Yutao Liu,Shui Yu

IEEE TRANSACTIONS ON CLOUD COMPUTING（2023）

引用 0|浏览6

暂无评分

摘要

Geo-Distributed Machine Learning (Geo-DML) has been proposed to collaborate geographically dispersed data centers (DCs) and train large scale machine learning (ML) models for various applications. While Geo-DML can achieve excellent performance, it also injects massive data traffic into the Wide Area Networks (WANs) in order to exchange gradients during model training process. Such a huge amount of traffic will not only incur network congestion and prolong the training procedure, but also result in straggler problem when DCs are working in heterogeneous network environments. To alleviate these problems, we propose Self-Adaptive Gradient Quantization (SAGQ) for Geo-DML in this work. In SAGQ, each worker DC adopts specific quantization method based on the heterogeneous and dynamic link bandwidth in order to reduce the communication overhead and balance the communication time among worker DCs. By doing so, SAGQ will speed up the Geo-DML training process without sacrificing the ML model performance. Extensive experiments show that compared with the state-of-the-art techniques, SAGQ reduces the Wall-clock time spent to train an ML model by 1.13x-21.31x. In addition, SAGQ can also improve the model accuracy by 0.11%-2.27% over baselines.

查看译文

关键词

Geo-Distributed machine learning,gradient quantization,resource scheduling,wide area network

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要