InferLine : Prediction Pipeline Provisioning and Management for Tight Latency Objectives

Daniel Crankshaw,Gur-Eyal Sela,Corey Zumar,Xiangxi Mo,Joseph E. Gonzalez,Ion Stoica,Alexey Tumanov

semanticscholar（2019）

引用 1|浏览4

暂无评分

摘要

Serving prediction pipelines spanning multiple models and hardware accelerators is a key challenge in production machine learning. Optimally configuring these pipelines to meet tight end-to-end latency goals is complicated by the interaction between model batch size, the choice of hardware accelerator, and variation in the query arrival process. In this paper we introduce InferLine, a system which provisions and executes ML prediction pipelines subject to end-to-end latency constraints by proactively optimizing and reactively controlling per-model configurations in a fine-grained fashion. InferLine leverages automated offline profiling and performance characterization to construct a cost-minimizing initial configuration and then introduces a reactive planner to adjust the configuration in response to changes in the query arrival process. We demonstrate that InferLine outperforms existing approaches by up to 7.6x in cost while achieving up to 34.5x lower latency SLO miss rate on realistic workloads and generalizes across state-of-the-art model serving frameworks.

查看译文

关键词

Hardware acceleration,Latency (engineering),Network calculus,Discrete event simulation,Combinatorial search,Provisioning,Distributed computing,Profiling (computer programming),Pipeline transport,Computer science

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要