ESEN: Efficient GPU sharing of Ensemble Neural Networks

Neurocomputing(2024)

引用 0|浏览0
暂无评分
摘要
Ensemble neural networks are widely applied in cloud-based inference services due to their remarkable performance, while the growing demand for low-latency services leads researchers to pay more attention to the execution efficiency of these models, especially the device utilisation. It is highly desirable to fully utilize GPUs by multiplexing different inference tasks on the same GPU with advanced sharing technique, such as Multi-Process-Service (MPS). However, we find it struggling when applying MPS to Ensemble Neural Networks, which consist of multiple related sub-models. The critical challenge in this predicament revolves around the efficient allocation of resources within an ensemble, aiming to minimize job completion time.To tackle this challenge, we initially examine the interplay among individual neural networks within an ensemble, outlining a guideline for achieving the shortest job completion time. Subsequently, we establish a mathematical model to formalize the resource requirements of each neural network. We introduce a search-based allocation algorithm designed to swiftly identify optimal solutions. Finally, we introduce ESEN, comprising the search-based resource allocation algorithm and efficient model execution mechanisms within PyTorch. ESEN is augmented with customized execution mechanisms for user-friendly implementation. Experimental results demonstrate that proposed ESEN can attain an efficiency improvement up to 17.84% and a GPU utilization increase of 28.09% compared to the default strategy. With the optimisation of GPU resource allocation, ESEN significantly improves the efficiency of ensemble models. It provides a low-latency and high-accuracy solution for online interactive services.
更多
查看译文
关键词
GPU sharing,Ensemble Neural Network,MPS,Inference services,Resource allocation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要