Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning
CoRR(2024)
摘要
Large language models equipped with retrieval-augmented generation (RAG)
represent a burgeoning field aimed at enhancing answering capabilities by
leveraging external knowledge bases. Although the application of RAG with
language-only models has been extensively explored, its adaptation into
multimodal vision-language models remains nascent. Going beyond mere answer
generation, the primary goal of multimodal RAG is to cultivate the models'
ability to reason in response to relevant queries. To this end, we introduce a
novel multimodal RAG framework named RMR (Retrieval Meets Reasoning). The RMR
framework employs a bi-modal retrieval module to identify the most relevant
question-answer pairs, which then serve as scaffolds for the multimodal
reasoning process. This training-free approach not only encourages the model to
engage deeply with the reasoning processes inherent in the retrieved content
but also facilitates the generation of answers that are precise and richly
interpretable. Surprisingly, utilizing solely the ScienceQA dataset, collected
from elementary and high school science curricula, RMR significantly boosts the
performance of various vision-language models across a spectrum of benchmark
datasets, including A-OKVQA, MMBench, and SEED. These outcomes highlight the
substantial potential of our multimodal retrieval and reasoning mechanism to
improve the reasoning capabilities of vision-language models.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要