Ebird: Elastic Batch for Improving Responsiveness and Throughput of Deep Learning Services

2019 IEEE 37th International Conference on Computer Design (ICCD)(2019)

引用 23|浏览33
暂无评分
摘要
GPUs have been widely adopted to serve online deep learning-based services that have stringent QoS requirements. However, emerging deep learning serving systems often result in long latency and low throughput of the inference requests that damage user experience and increase the number of GPUs required to host an online service. Our investigation shows that the poor batching operation and the lacking of data transfer-computation overlap are the root causes of the long latency and low throughput. To this end, we propose Ebird, a deep learning serving system that is comprised of a GPU-resident memory pool, a multi-granularity inference engine, and an elastic batch scheduler. The memory pool eliminates the unnecessary waiting of the batching operation and enables data transfer-computation overlap. The inference engine enables concurrent execution of different batches, improving the GPUs resource utilization. The batch scheduler organizes inference requests elastically. Our experimental results on an Nvidia Titan RTX GPU show that Ebird reduces the response latency of inferences by up to 70.9% and improves the throughput by up to 49.3% while guaranteeing the QoS target compared with TensorFlow Serving.
更多
查看译文
关键词
GPUs, DL Serving, Latency, Throughput, Responsiveness
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要