Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
arxiv(2024)
摘要
Each LLM serving request goes through two phases. The first is prefill which
processes the entire input prompt to produce one output token and the second is
decode which generates the rest of output tokens, one-at-a-time. Prefill
iterations have high latency but saturate GPU compute due to parallel
processing of the input prompt. In contrast, decode iterations have low latency
but also low compute utilization because a decode iteration processes only a
single token per request. This makes batching highly effective for decodes and
consequently for overall throughput. However, batching multiple requests leads
to an interleaving of prefill and decode iterations which makes it challenging
to achieve both high throughput and low latency.
We introduce an efficient LLM inference scheduler Sarathi-Serve inspired by
the techniques we originally proposed for optimizing throughput in Sarathi.
Sarathi-Serve leverages chunked-prefills from Sarathi to create stall-free
schedules that can add new requests in a batch without pausing ongoing decodes.
Stall-free scheduling unlocks the opportunity to improve throughput with large
batch sizes while minimizing the effect of batching on latency. Our evaluation
shows that Sarathi-Serve improves serving throughput within desired latency
SLOs of Mistral-7B by up to 2.6x on a single A100 GPU and up to 6.9x for
Falcon-180B on 8 A100 GPUs over Orca and vLLM.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要