Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time
arxiv(2024)
摘要
Leveraging Large Language Models' remarkable proficiency in text-based tasks,
recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like
vision and audio. However, the progress in these directions has been mostly
focused on tasks that only require a coarse-grained understanding of the
audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a
fine-grained understanding of image and audio both spatially and temporally.
With a new modality alignment module based on optimal transport and a
cross-attention module that enforces audio-visual consistency, Meerkat can
tackle challenging tasks such as audio referred image grounding, image guided
audio temporal localization, and audio-visual fact-checking. Moreover, we
carefully curate a large dataset AVFIT that comprises 3M instruction tuning
samples collected from open-source datasets, and introduce MeerkatBench that
unifies five challenging audio-visual tasks. We achieve state-of-the-art
performance on all these downstream tasks with a relative improvement of up to
37.12
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要