MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving
CoRR(2024)
摘要
This paper presents MoE-Infinity, a cost-efficient mixture-of-expert (MoE)
serving system that realizes activation-aware expert offloading. MoE-Infinity
features sequence-level expert activation tracing, a new approach adept at
identifying sparse activations and capturing the temporal locality of MoE
inference. By analyzing these traces, MoE-Infinity performs novel
activation-aware expert prefetching and caching, substantially reducing the
latency overheads usually associated with offloading experts for improved cost
performance. Extensive experiments in a cluster show that MoE-Infinity
outperforms numerous existing systems and approaches, reducing latency by 4 -
20X and decreasing deployment costs by over 8X for various MoEs. MoE-Infinity's
source code is publicly available at https://github.com/TorchMoE/MoE-Infinity
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要