Chrome Extension
WeChat Mini Program
Use on ChatGLM

Attentive Snippet Prompting for Video Retrieval

IEEE TRANSACTIONS ON MULTIMEDIA(2024)

Cited 0|Views14
No score
Abstract
The recent advance of video retrieval has been driven by large-scale visual-language pretraining models. In particular, the state-of-the-art approaches are mainly based on temporal extension of the well-known CLIP model. However, they ignore a critical problem in video retrieval, i.e., the text often refers to a small snippet in the corresponding video. Blindly aggregating all the frames inevitably reduces the discriminative capacity of the final video token to match the text token. Hence, these approaches are limited to retrieve complex videos with diversified contents. To tackle this problem, we propose a concise and novel Attentive Snippet Prompting (ASP) framework, which can dynamically exploit the text-relevant video snippet to boost retrieval. Specifically, our ASP consists of two simple but effective modules, i.e., snippet prompting and video aggregating. Given a pair of text and video, snippet prompting can smartly use cross-modal attention to construct a text-driven visual prompt, namely attentive snippet token, which adaptively describes the relevant video snippet of the text query. Alternatively, video aggregating can summarize all the frame tokens as a video token, for providing the global context. With cooperation of attentive snippet token and global video token, our ASP can effectively learn a robust and text-relevant visual representation for video retrieval. Finally, we evaluate our ASP framework on the widely-used benchmarks, where it simply outperforms a number of recent approaches with a large margin.
More
Translated text
Key words
Video retrieval,prompting,attention,multi-modal learning
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined