MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving

Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang

CoRR(2024)

引用 0|浏览8
暂无评分
摘要
Large language models (LLMs) have demon- strated remarkable performance, and organiza- tions are racing to serve LLMs of varying sizes as endpoints for use-cases like chat, programming and search. However, efficiently serving multiple LLMs poses significant challenges for existing approaches due to varying popularity of LLMs. In the paper, we present MuxServe, a flexible spatial-temporal multiplexing system for efficient multiple LLM serving. The key insight behind is to colocate LLMs considering their popularity to multiplex memory resources, and leverage the characteristics of prefill and decoding phases to separate and flexibly colocate them to multiplex computation resources. MuxServe formally for- mulates the multiplexing problem, and proposes a novel placement algorithm and adaptive batch scheduling strategy to identify optimal coloca- tions and maximize utilization. MuxServe de- signs a unified resource manager to enable flexi- ble and efficient multiplexing. Evaluation results show that MuxServe can achieves up to 1.8× higher throughput or processes 2.9× more requests within 99% SLO attainment.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要