CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts
arxiv(2023)
摘要
Contrastive vision-language models, such as CLIP, have garnered considerable
attention for various dowmsteam tasks, mainly due to the remarkable ability of
the learned features for generalization. However, the features they learned
often blend content and style information, which somewhat limits their
generalization capabilities under distribution shifts. To address this
limitation, we adopt a causal generative perspective for multimodal data and
propose contrastive learning with data augmentation to disentangle content
features from the original representations. To achieve this, we begins with
exploring image augmentation techniques and develop a method to seamlessly
integrate them into pre-trained CLIP-like models to extract pure content
features. Taking a step further, recognizing the inherent semantic richness and
logical structure of text data, we explore the use of text augmentation to
isolate latent content from style features. This enables CLIP-like model's
encoders to concentrate on latent content information, refining the learned
representations by pre-trained CLIP-like models. Our extensive experiments
across diverse datasets demonstrate significant improvements in zero-shot and
few-shot classification tasks, alongside enhanced robustness to various
perturbations. These results underscore the effectiveness of our proposed
methods in refining vision-language representations and advancing the
state-of-the-art in multimodal learning.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要