Efficient CUDA stream management for multi-DNN real-time inference on embedded GPUs

Weiguang Pang,Xiantong Luo, Kailun Chen,Dong Ji, Lei Qiao,Wang Yi

Journal of Systems Architecture(2023)

引用 0|浏览14
暂无评分
摘要
Deep Neural Networks (DNNs) are widely used in Cyber–Physical Systems (CPS) that often involve multiple DNN tasks with varying real-time requirements. These tasks need to be deployed on a single embedded hardware platform with limited resources, such as an embedded GPU. Efficiently sharing the same embedded GPU among multiple real-time DNN tasks is a complex challenge. While existing DNN frameworks (e.g., PyTorch and TensorFlow) focus on maximizing average performance and high throughput on GPU, they lack scheduling management mechanisms considering multiple DNNs with different timing requirements. In this paper, we address this challenge by thoroughly examining and summarizing the scheduling rules for multiple kernels with different priorities in CUDA streams. Based on these rules, we design a framework that supports multi-DNN real-time inference and propose a method for allocating CUDA streams to DNN kernels to meet schedulability requirements while maximizing GPU resource utilization. Our proposed approach is implemented on an NVIDIA Jetson AGX Xavier embedded GPU system and validated using several popular DNNs. The results show that our approach achieves shorter response times compared with several state-of-the-art methods.
更多
查看译文
关键词
DNN, Real-time scheduling, GPU, CUDA stream priority
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要