Stateful Large Language Model Serving with Pensieve
CoRR(2023)
摘要
Large Language Models (LLMs) have recently experienced great success, as
evident in the widespread popularity of ChatGPT. Existing LLM serving systems
are stateless across requests. Consequently, when LLMs are used in the common
setting of multi-turn conversations, a growing log of the conversation history
must be processed alongside any request by the serving system at each turn,
resulting in repeated history processing. In this paper, we design $Pensieve$,
a system optimized for multi-turn conversation LLM serving. $Pensieve$
maintains the conversation state across requests by caching previously
processed history to avoid duplicate processing. $Pensieve$'s multi-tier
caching strategy can utilize both GPU and CPU memory to efficiently store and
retrieve cached data. $Pensieve$ also generalizes the recent PagedAttention
kernel to support attention between multiple input tokens with a GPU cache
spread over non-contiguous memory. Our evaluation shows that $Pensieve$ is able
to achieve 1.51-1.95x throughput compared to vLLM and reduce latency by 60-75%.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要