Chrome Extension
WeChat Mini Program
Use on ChatGLM

BaPipe: Balanced Pipeline Parallelism for DNN Training

PARALLEL PROCESSING LETTERS(2022)

Cited 1|Views25
No score
Abstract
The size of deep neural networks (DNNs) grows rapidly as the complexity of the machine learning algorithm increases. Distributed deep learning based on model parallelism has been widely used to satisfy the requirements of DNN training related to computation and memory. In this paper, we propose a training framework for pipeline parallelism called BaPipe (Balanced Pipeline) that can automatically explore methods to schedule pipeline parallelism and balanced partition strategies for DNN training on heterogeneous accelerator clusters. In BaPipe, each accelerator calculates the forward and backward propagation for the assigned partition of networks to implement an intra-batch pipeline parallelism strategy. By considering the parameters of DNN models as well as the computation, memory, and communication resources of each accelerator, BaPipe automatically selects the most suitable method of pipeline scheduling from among multiple proposed scheduling modes. It also uses a novel strategy to automatically investigate load balancing in the context of inter-layer partition, intra-layer partition, and coarse-grained partition. We trained such DNNs as VGG-16, ResNet-50, and Google's Neural Machine Translation (GNMT) on GPU clusters, and simulated the training-related performance of FPGA clusters. Compared with the state-of-the-art frameworks for data parallelism (DP) and pipeline parallelism, BaPipe provides a speedup of 3.2x and 4x of memory reduction on various homogeneous and heterogeneous platforms.
More
Translated text
Key words
DNN training, pipeline parallelism, load balancing, parallel and distributed systems
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined