MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving
CoRR(2024)
摘要
Large language models (LLMs) have demon- strated remarkable performance, and
organiza- tions are racing to serve LLMs of varying sizes as endpoints for
use-cases like chat, programming and search. However, efficiently serving
multiple LLMs poses significant challenges for existing approaches due to
varying popularity of LLMs. In the paper, we present MuxServe, a flexible
spatial-temporal multiplexing system for efficient multiple LLM serving. The
key insight behind is to colocate LLMs considering their popularity to
multiplex memory resources, and leverage the characteristics of prefill and
decoding phases to separate and flexibly colocate them to multiplex computation
resources. MuxServe formally for- mulates the multiplexing problem, and
proposes a novel placement algorithm and adaptive batch scheduling strategy to
identify optimal coloca- tions and maximize utilization. MuxServe de- signs a
unified resource manager to enable flexi- ble and efficient multiplexing.
Evaluation results show that MuxServe can achieves up to 1.8× higher
throughput or processes 2.9× more requests within 99% SLO attainment.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要