CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
CoRR(2024)
摘要
Recent advancements in Multimodal Large Language Models (LLMs) have focused
primarily on scaling by increasing text-image pair data and enhancing LLMs to
improve performance on multimodal tasks. However, these scaling approaches are
computationally expensive and overlook the significance of improving model
capabilities from the vision side. Inspired by the successful applications of
Mixture-of-Experts (MoE) in LLMs, which improves model scalability during
training while keeping inference costs similar to those of smaller models, we
propose CuMo. CuMo incorporates Co-upcycled Top-K sparsely-gated
Mixture-of-experts blocks into both the vision encoder and the MLP connector,
thereby enhancing the multimodal LLMs with minimal additional activated
parameters during inference. CuMo first pre-trains the MLP blocks and then
initializes each expert in the MoE block from the pre-trained MLP block during
the visual instruction tuning stage. Auxiliary losses are used to ensure a
balanced loading of experts. CuMo outperforms state-of-the-art multimodal LLMs
across various VQA and visual-instruction-following benchmarks using models
within each model size group, all while training exclusively on open-sourced
datasets. The code and model weights for CuMo are open-sourced at
https://github.com/SHI-Labs/CuMo.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要