VeCLIP: Improving CLIP Training via Visual-enriched Captions

Zhengfeng Lai,Haotian Zhang, Bowen Zhang,Wentao Wu,Haoping Bai, Aleksei Timofeev,Xianzhi Du,Zhe Gan,Jiulong Shan,Chen-Nee Chuah,Yinfei Yang,Meng Cao

arxiv（2023）

引用 0|浏览23

暂无评分

摘要

Large-scale web-crawled datasets are fundamental for the success of pre-training vision-language models, such as CLIP. However, the inherent noise and potential irrelevance of web-crawled AltTexts pose challenges in achieving precise image-text alignment. Existing methods utilizing large language models (LLMs) for caption rewriting have shown promise on small, curated datasets like CC3M and CC12M. This study introduces a scalable pipeline for noisy caption rewriting. Unlike recent LLM rewriting techniques, we emphasize the incorporation of visual concepts into captions, termed as Visual-enriched Captions (VeCap). To ensure data diversity, we propose a novel mixed training scheme that optimizes the utilization of AltTexts alongside newly generated VeCap. We showcase the adaptation of this method for training CLIP on large-scale web-crawled datasets, termed VeCLIP. Employing this cost-effective pipeline, we effortlessly scale our dataset up to 300 million samples named VeCap dataset. Our results show significant advantages in image-text alignment and overall model performance. For example, VeCLIP achieves up to +25.2 in COCO and Flickr30k retrieval tasks under the 12M setting. For data efficiency, VeCLIP achieves +3 in the vanilla CLIP and 11 complementary with other well curated datasets good for zero-shot classification tasks. When combining VeCap and DFN, our model can achieve strong performance on both of image-text retrieval and zero-shot classification tasks, e.g. 83.1 the pre-trained models at https://github.com/apple/ml-veclip.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要