Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement
CoRR(2024)
摘要
Large vision-language models (LVLMs) have achieved impressive results in
various visual question-answering and reasoning tasks through vision
instruction tuning on specific datasets. However, there is still significant
room for improvement in the alignment between visual and language modalities.
Previous methods to enhance this alignment typically require external models or
data, heavily depending on their capabilities and quality, which inevitably
sets an upper bound on performance. In this paper, we propose SIMA, a framework
that enhances visual and language modality alignment through self-improvement,
eliminating the needs for external models or data. SIMA leverages prompts from
existing vision instruction tuning datasets to self-generate responses and
employs an in-context self-critic mechanism to select response pairs for
preference tuning. The key innovation is the introduction of three vision
metrics during the in-context self-critic process, which can guide the LVLM in
selecting responses that enhance image comprehension. Through experiments
across 14 hallucination and comprehensive benchmarks, we demonstrate that SIMA
not only improves model performance across all benchmarks but also achieves
superior modality alignment, outperforming previous approaches.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要