BatOpt: Optimizing GPU-Based Deep Learning Inference Using Dynamic Batch Processing

IEEE TRANSACTIONS ON CLOUD COMPUTING(2024)

引用 0|浏览4
暂无评分
摘要
Deep learning (DL) has been applied in billions of mobile devices due to its astonishing performance in image, text, and audio processing. However, limited by the computing capability of mobile devices, a large amount of DL inference tasks need to be offloaded to edge or cloud servers, which makes powerful GPU servers are struggling to ensure the quality of service(QoS). To better utilize the highly parallel computing architecture of GPU to improve the QoS, we propose BatOpt, a framework that uses dynamic batch processing to strike a good balance between service latency and GPU memory usage in DL inference services. Specifically, BatOpt innovatively models the DL inference service as a M/G(a,b)/1/N queue, with the consideration of stochastic task arrivals, which enables it to predict the service latency accurately in different system states. Furthermore, we propose an optimization algorithm to trade off the service latency and GPU memory usage in different system states by analyzing the queueing model. We have implemented BatOpt on Pytorch and evaluated it on an RTX 2080 GPU using real DL models. BatOpt brings up to 31x and 4.3x times performance boost in terms of service latency, compared to single-input and fixed-batch-size strategies, respectively. And BatOpt's maximum GPU memory usage is only 0.3x that of greedy-dynamic-batch-size strategy on the premise of the same service latency.
更多
查看译文
关键词
Graphics processing units,Task analysis,Servers,Throughput,Computational modeling,Mobile handsets,Stochastic processes,Deep learning,GPU servers,batch processing,stochastic process
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要