Instruction-Guided Visual Masking
CoRR(2024)
摘要
Instruction following is crucial in contemporary LLM. However, when extended
to multimodal setting, it often suffers from misalignment between specific
textual instruction and targeted local region of an image. To achieve more
accurate and nuanced multimodal instruction following, we introduce
Instruction-guided Visual Masking (IVM), a new versatile visual grounding model
that is compatible with diverse multimodal models, such as LMM and robot model.
By constructing visual masks for instruction-irrelevant regions, IVM-enhanced
multimodal models can effectively focus on task-relevant image regions to
better align with complex instructions. Specifically, we design a visual
masking data generation pipeline and create an IVM-Mix-1M dataset with 1
million image-instruction pairs. We further introduce a new learning technique,
Discriminator Weighted Supervised Learning (DWSL) for preferential IVM training
that prioritizes high-quality data samples. Experimental results on generic
multimodal tasks such as VQA and embodied robotic control demonstrate the
versatility of IVM, which as a plug-and-play tool, significantly boosts the
performance of diverse multimodal models, yielding new state-of-the-art results
across challenging multimodal benchmarks. Code is available at
https://github.com/2toinf/IVM.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要