Animating Images to Transfer CLIP for Video-Text Retrieval

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval(2022)

Cited 10|Views80
No score
Abstract
Recent works show the possibility of transferring the CLIP (Contrastive Language-Image Pretraining) model for video-text retrieval with promising performance. However, due to the domain gap between static images and videos, CLIP-based video-text retrieval models with interaction-based matching perform far worse than models with representation-based matching. In this paper, we propose a novel image animation strategy to transfer the image-text CLIP model to video-text retrieval effectively. By imitating the video shooting components, we convert widely used image-language corpus to synthesized video-text data for pretraining. To reduce the time complexity of interaction matching, we further propose a coarse to fine framework which consists of dual encoders for fast candidates searching and a cross-modality interaction module for fine-grained re-ranking. The coarse to fine framework with the synthesized video-text pretraining provides significant gains in retrieval accuracy while preserving efficiency. Comprehensive experiments conducted on MSR-VTT, MSVD, and VATEX datasets demonstrate the effectiveness of our approach.
More
Translated text
Key words
Video-Text Retrieval, CLIP, Image Animation, Coarse to Fine Retrieval
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined