VideoCLIP - Contrastive Pre-training for Zero-shot Video-Text Understanding.

Hu Xu,Gargi Ghosh,Po-Yao Huang,Dmytro Okhonko,Armen Aghajanyan,Florian Metze Luke Zettlemoyer Christoph Feichtenhofer,Luke Zettlemoyer,Christoph Feichtenhofer

EMNLP（2021）

Cited 309|Views172

No score

Abstract

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/examples/MMPT.

Translated text

Key words

videoclip,understanding,pre-training,zero-shot,video-text

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined