Zoom-shot: Fast and Efficient Unsupervised Zero-Shot Transfer of CLIP to Vision Encoders with Multimodal Loss
CoRR(2024)
摘要
The fusion of vision and language has brought about a transformative shift in
computer vision through the emergence of Vision-Language Models (VLMs).
However, the resource-intensive nature of existing VLMs poses a significant
challenge. We need an accessible method for developing the next generation of
VLMs. To address this issue, we propose Zoom-shot, a novel method for
transferring the zero-shot capabilities of CLIP to any pre-trained vision
encoder. We do this by exploiting the multimodal information (i.e. text and
image) present in the CLIP latent space through the use of specifically
designed multimodal loss functions. These loss functions are (1)
cycle-consistency loss and (2) our novel prompt-guided knowledge distillation
loss (PG-KD). PG-KD combines the concept of knowledge distillation with CLIP's
zero-shot classification, to capture the interactions between text and image
features. With our multimodal losses, we train a linear mapping
between the CLIP latent space and the latent space of a pre-trained vision
encoder, for only a single epoch. Furthermore, Zoom-shot is entirely
unsupervised and is trained using unpaired data. We test the
zero-shot capabilities of a range of vision encoders augmented as new VLMs, on
coarse and fine-grained classification datasets, outperforming the previous
state-of-the-art in this problem domain. In our ablations, we find Zoom-shot
allows for a trade-off between data and compute during training; and our
state-of-the-art results can be obtained by reducing training from 20
the ImageNet training data with 20 epochs. All code and models are available on
GitHub.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要