FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping
CoRR(2023)
摘要
Serverless computing has become increasingly popular for machine learning
inference. However, current serverless platforms lack efficient support for
GPUs, limiting their ability to deliver low-latency inference. In this paper,
we propose FaaSwap, a GPU-efficient serverless inference platform. FaaSwap
employs a holistic approach to system and algorithm design. It maintains models
in main memory and dynamically swaps them onto GPUs upon request arrivals
(i.e., late binding), thereby enabling a large number of inference functions to
efficiently share a node's GPUs. FaaSwap uses various techniques, including
asynchronous API redirection, GPU runtime sharing, pipelined model execution,
and efficient GPU memory management, to achieve the optimal performance. We
also develop an interference-aware request scheduling algorithm that allows
FaaSwap to meet the latency SLOs for individual inference functions. We have
implemented FaaSwap as a prototype on a leading commercial serverless platform.
Experimental evaluations demonstrate that, with model swapping, FaaSwap can
concurrently serve hundreds of functions on a single worker node with 4 V100
GPUs, while achieving inference performance comparable to native execution
(where each function runs on a dedicated GPU). When deployed on a 6-node
production testbed, FaaSwap meets the latency SLOs for over 1k functions, the
maximum that the testbed can handle concurrently.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要