Chrome Extension
WeChat Mini Program
Use on ChatGLM

BatOpt: Optimizing GPU-Based Deep Learning Inference Using Dynamic Batch Processing

IEEE TRANSACTIONS ON CLOUD COMPUTING(2024)

Cited 0|Views15
No score
Abstract
Deep learning (DL) has been applied in billions of mobile devices due to its astonishing performance in image, text, and audio processing. However, limited by the computing capability of mobile devices, a large amount of DL inference tasks need to be offloaded to edge or cloud servers, which makes powerful GPU servers are struggling to ensure the quality of service(QoS). To better utilize the highly parallel computing architecture of GPU to improve the QoS, we propose BatOpt, a framework that uses dynamic batch processing to strike a good balance between service latency and GPU memory usage in DL inference services. Specifically, BatOpt innovatively models the DL inference service as a M/G(a,b)/1/N queue, with the consideration of stochastic task arrivals, which enables it to predict the service latency accurately in different system states. Furthermore, we propose an optimization algorithm to trade off the service latency and GPU memory usage in different system states by analyzing the queueing model. We have implemented BatOpt on Pytorch and evaluated it on an RTX 2080 GPU using real DL models. BatOpt brings up to 31x and 4.3x times performance boost in terms of service latency, compared to single-input and fixed-batch-size strategies, respectively. And BatOpt's maximum GPU memory usage is only 0.3x that of greedy-dynamic-batch-size strategy on the premise of the same service latency.
More
Translated text
Key words
Graphics processing units,Task analysis,Servers,Throughput,Computational modeling,Mobile handsets,Stochastic processes,Deep learning,GPU servers,batch processing,stochastic process
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined