Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules
arxiv(2024)
摘要
Sparsely-activated Mixture-of-Expert (MoE) layers have found practical
applications in enlarging the model size of large-scale foundation models, with
only a sub-linear increase in computation demands. Despite the wide adoption of
hybrid parallel paradigms like model parallelism, expert parallelism, and
expert-sharding parallelism (i.e., MP+EP+ESP) to support MoE model training on
GPU clusters, the training efficiency is hindered by communication costs
introduced by these parallel paradigms. To address this limitation, we propose
Parm, a system that accelerates MP+EP+ESP training by designing two dedicated
schedules for placing communication tasks. The proposed schedules eliminate
redundant computations and communications and enable overlaps between
intra-node and inter-node communications, ultimately reducing the overall
training time. As the two schedules are not mutually exclusive, we provide
comprehensive theoretical analyses and derive an automatic and accurate
solution to determine which schedule should be applied in different scenarios.
Experimental results on an 8-GPU server and a 32-GPU cluster demonstrate that
Parm outperforms the state-of-the-art MoE training system, DeepSpeed-MoE,
achieving 1.13× to 5.77× speedup on 1296 manually configured MoE
layers and approximately 3× improvement on two real-world MoE models
based on BERT and GPT-2.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要