Asynchronous Local-SGD Training for Language Modeling
CoRR(2024)
Abstract
Local stochastic gradient descent (Local-SGD), also referred to as federated
averaging, is an approach to distributed optimization where each device
performs more than one SGD update per communication. This work presents an
empirical study of asynchronous Local-SGD for training language models;
that is, each worker updates the global parameters as soon as it has finished
its SGD steps. We conduct a comprehensive investigation by examining how worker
hardware heterogeneity, model size, number of workers, and optimizer could
impact the learning performance. We find that with naive implementations,
asynchronous Local-SGD takes more iterations to converge than its synchronous
counterpart despite updating the (global) model parameters more frequently. We
identify momentum acceleration on the global parameters when worker gradients
are stale as a key challenge. We propose a novel method that utilizes a delayed
Nesterov momentum update and adjusts the workers' local training steps based on
their computation speed. This approach, evaluated with models up to 150M
parameters on the C4 dataset, matches the performance of synchronous Local-SGD
in terms of perplexity per update step, and significantly surpasses it in terms
of wall clock time.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined