Vision-Language Instruction Tuning: A Review and Analysis.
CoRR(2023)
摘要
Instruction tuning is an essential supervised training phase for Large
Language Models (LLMs), with the goal of enhancing LLMs' capacity to generalize
instruction execution and adapt to user preferences. With the growing
incorporation of multi-modal data into LLMs, there is an increasing interest in
the performance of vision-language instruction tuning which presents more
complex features in comparison to pure text instructions. In this paper, we
systematically review the latest vision-language instruction tuning settings
and datasets in multi-modal LLMs and summarize the characteristics that
high-quality vision-language tuning data should have. We consider these
characteristics as the foundational principles for constructing vision-language
instruction data and propose a complete construction pipeline consisting of
data collection, instruction generation, and quality control modules that
incorporate meticulously designed instruction property evaluation indicators.
We perform vision-language instruction tuning on three widely used multi-modal
LLMs based on the instruction data we constructed and conduct extensive
experiments on the corresponding metrics to demonstrate the rationality of the
construction principles proposed in this paper. The code and dataset related to
this paper have been open-sourced at
\url{https://github.com/palchenli/VL-Instruction-Tuning}.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要