Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering
CoRR(2024)
摘要
Knowledge-based visual question answering (KVQA) has been extensively studied
to answer visual questions with external knowledge, e.g., knowledge graphs
(KGs). While several attempts have been proposed to leverage large language
models (LLMs) as an implicit knowledge source, it remains challenging since
LLMs may generate hallucinations. Moreover, multiple knowledge sources, e.g.,
images, KGs and LLMs, cannot be readily aligned for complex scenarios. To
tackle these, we present a novel modality-aware integration with LLMs for KVQA
(MAIL). It carefully leverages multimodal knowledge for both image
understanding and knowledge reasoning. Specifically, (i) we propose a two-stage
prompting strategy with LLMs to densely embody the image into a scene graph
with detailed visual features; (ii) We construct a coupled concept graph by
linking the mentioned entities with external facts. (iii) A tailored
pseudo-siamese graph medium fusion is designed for sufficient multimodal
fusion. We utilize the shared mentioned entities in two graphs as mediums to
bridge a tight inter-modal exchange, while maximally preserving insightful
intra-modal learning by constraining the fusion within mediums. Extensive
experiments on two benchmark datasets show the superiority of MAIL with 24x
less resources.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要