BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models
CoRR(2024)
摘要
The growing demand for Large Language Models (LLMs) across diverse
applications has prompted a paradigm shift in the design of deep learning
serving systems. Deploying LLMs, especially in multi-tenant environments,
presents considerable challenges due to their high computational and memory
demands. We present BlockLLM, a serving system that exploits the potential of
sharing components among fine-tuned LLM models to offer an efficient and
flexible solution for LLM workloads. BlockLLM partitions the models into
finer-grained blocks to enable the reuse of model components and independent
provisioning to improve the computation efficiency. BlockLLM consists of an
offline block zoo, for storing the blocks, and an online system to serve the
requests through chains of blocks. It offers multi-fold flexibility: (1)
Adaptive assembly of block chains on-the-fly is achieved with the help of
equivalence evaluation among blocks in the zoo. (2) We enable per-block batch
size and configure best-effort KV cache coordination at individual block level.
(3) We adopt speculative execution and locality-aware block placement to
mitigate the communication costs from dynamic block resource allocation. Our
evaluation demonstrates that BlockLLM reduces memory and storage footprints and
improves computation efficiency, outperforming existing serving approach in
95%ile latency and GPU utilization by 33.5% and 20.1%, respectively.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要