VeCLIP: Improving CLIP Training via Visual-enriched Captions
arxiv(2023)
摘要
Large-scale web-crawled datasets are fundamental for the success of
pre-training vision-language models, such as CLIP. However, the inherent noise
and potential irrelevance of web-crawled AltTexts pose challenges in achieving
precise image-text alignment. Existing methods utilizing large language models
(LLMs) for caption rewriting have shown promise on small, curated datasets like
CC3M and CC12M. This study introduces a scalable pipeline for noisy caption
rewriting. Unlike recent LLM rewriting techniques, we emphasize the
incorporation of visual concepts into captions, termed as Visual-enriched
Captions (VeCap). To ensure data diversity, we propose a novel mixed training
scheme that optimizes the utilization of AltTexts alongside newly generated
VeCap. We showcase the adaptation of this method for training CLIP on
large-scale web-crawled datasets, termed VeCLIP. Employing this cost-effective
pipeline, we effortlessly scale our dataset up to 300 million samples named
VeCap dataset. Our results show significant advantages in image-text alignment
and overall model performance. For example, VeCLIP achieves up to +25.2
in COCO and Flickr30k retrieval tasks under the 12M setting. For data
efficiency, VeCLIP achieves +3
in the vanilla CLIP and 11
complementary with other well curated datasets good for zero-shot
classification tasks. When combining VeCap and DFN, our model can achieve
strong performance on both of image-text retrieval and zero-shot classification
tasks, e.g. 83.1
the pre-trained models at https://github.com/apple/ml-veclip.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要