Chrome Extension
WeChat Mini Program
Use on ChatGLM

CrossBow: Scaling Deep Learning on Multi-GPU Servers

user-5da93e5d530c70bec9508e2b(2018)

Cited 1|Views11
No score
Abstract
With the widespread availability of servers with 4 or more GPUs, scalability in terms of the number of GPUs in a server when training deep learning models becomes a paramount concern. Systems such as TensorFlow and MXNet train using synchronous stochastic gradient descent—an input batch is partitioned across the GPUs, each computing a partial gradient. The gradients are then combined to update the model parameters before proceeding to the next batch. For many deep learning models, this introduces a scalability challenge: to keep multiple GPUs fully utilised, the batch size must be sufficiently large, but a large batch size slows down model convergence due to the less frequent model updates, thus prolonging the time to reach a desired level of accuracy. This paper introduces CrossBow, a new single-server multiGPU deep learning system that avoids the above trade-off. CrossBow trains multiple model replicas concurrently on each GPU, thereby avoiding under-utilisation of GPUs even when the preferred batch size is small. For this, CrossBow must (i) decide on an appropriate number of model replicas per GPU and (ii) employ an efficient and scalable synchronisation scheme within and across GPUs. CrossBow automatically tunes the number of replicas per GPU at runtime to maximise training throughput for a given batch size. We designed a novel synchronisation scheme that eliminates dependencies among model replicas, enabling high throughput and scalability. Our experiments show that CrossBow outperforms TensorFlow on a 4-GPU server by 2.5× with ResNet-32.
More
Translated text
Key words
Server,Scalability,Crossbow,Throughput (business),Deep learning,Parallel computing,Scheme (programming language),Computer science,Scaling,Artificial intelligence,Multi gpu
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined