SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads
CoRR(2023)
摘要
The increasing deployment of ML models on the critical path of production
applications in both datacenter and the edge requires ML inference serving
systems to serve these models under unpredictable and bursty request arrival
rates. Serving models under such conditions requires these systems to strike a
careful balance between the latency and accuracy requirements of the
application and the overall efficiency of utilization of scarce resources.
State-of-the-art systems resolve this tension by either choosing a static point
in the latency-accuracy tradeoff space to serve all requests or load specific
models on the critical path of request serving. In this work, we instead
resolve this tension by simultaneously serving the entire-range of models
spanning the latency-accuracy tradeoff space. Our novel mechanism, SubNetAct,
achieves this by carefully inserting specialized operators in weight-shared
SuperNetworks. These operators enable SubNetAct to dynamically route requests
through the network to meet a latency and accuracy target. SubNetAct requires
upto 2.6x lower memory to serve a vastly-higher number of models than prior
state-of-the-art. In addition, SubNetAct's near-instantaneous actuation of
models unlocks the design space of fine-grained, reactive scheduling policies.
We explore the design of one such extremely effective policy, SlackFit and
instantiate both SubNetAct and SlackFit in a real system, SuperServe.
SuperServe achieves 4.67
higher SLO attainment for the same accuracy on a trace derived from the
real-world Microsoft Azure Functions workload and yields the best trade-offs on
a wide range of extremely-bursty synthetic traces automatically.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要