Ultima: Robust and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud

Ertza Warraich, Omer Shabtai,Khalid Manaa,Shay Vargaftik, Yonatan Piasetzky, Matty Kadosh,Lalith Suresh,Muhammad Shahbaz

CoRR(2023)

引用 0|浏览3
暂无评分
摘要
We present Ultima, a new collective-communication system for the cloud with bounded, predictable completion times for deep-learning jobs in the presence of varying computation (stragglers) and communication (congestion and gradient drops) variabilities. Ultima exploits the inherent resiliency and the stochastic nature of distributed deep-learning (DDL) training to work with approximated gradients, and provides an efficient balance between (tail) performance and the resulting accuracy of the trained models. Exploiting this domain-specific characteristic of DDL, Ultima introduces (1) mechanisms (e.g., Transpose AllReduce, unreliable connection-oriented transport, and adaptive timeout) to improve the DDL jobs' tail execution time, and (2) strategies (e.g., Hadamard Transform) to mitigate the impact of gradient drops on model accuracy. Our evaluation shows that Ultima achieves 60% faster time-to-accuracy (TTA), on average, when operating in shared environments (e.g., public cloud), and is on par with existing algorithms (e.g., Ring-AllReduce) in dedicated environments (like HPC).
更多
查看译文
关键词
deep learning,tail-optimal
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要