FlexCap: Generating Rich, Localized, and Flexible Captions in Images
arxiv(2024)
摘要
We introduce a versatile flexible-captioning vision-language model
(VLM) capable of generating region-specific descriptions of varying lengths.
The model, FlexCap, is trained to produce length-conditioned captions for input
bounding boxes, and this allows control over the information density of its
output, with descriptions ranging from concise object labels to detailed
captions. To achieve this we create large-scale training datasets of image
region descriptions of varying length, starting from captioned images. This
flexible-captioning capability has several valuable applications.
First, FlexCap demonstrates superior performance in dense captioning tasks on
the Visual Genome dataset. Second, a visual question answering (VQA) system can
be built by employing FlexCap to generate localized descriptions as inputs to a
large language model. The resulting system achieves state-of-the-art zero-shot
performance on a number of VQA datasets. We also demonstrate a
localize-then-describe approach with FlexCap can be better at
open-ended object detection than a describe-then-localize approach
with other VLMs. We highlight a novel characteristic of FlexCap, which is its
ability to extract diverse visual information through prefix conditioning.
Finally, we qualitatively demonstrate FlexCap's broad applicability in tasks
such as image labeling, object attribute recognition, and visual dialog.
Project webpage: https://flex-cap.github.io .
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要