Adaptive Token Excitation with Negative Selection for Video-Text Retrieval

Juntao Yu,Zhangkai Ni,Taiyi Su,Hanli Wang

ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII（2023）

Cited 0|Views8

No score

Abstract

Video-text retrieval aims to efficiently retrieve videos from large collections based on the given text, whereas methods based on the large-scale pre-trained model have drawn sustained attention recently. However, existing-methods neglect detailed information in video and text, thus failing to align cross-modal semantic features well and leading to performance bottlenecks. Meanwhile, the general training strategy often treats semantically similar pairs as negatives, which provides the model with incorrect supervision. To address these issues, an adaptive token excitation (ATE) model with negative selection is proposed to adaptively refine features encoded by a large-scale pre-trained model to obtain more informative features without introducing additional complexity. In detail, ATE is first advanced to adaptively aggregate and align different events described in text and video using multiple non-linear event blocks. Then a negative selection strategy is exploited to mitigate false negative effects, which stabilizes the training process. Extensive experiments on several datasets demonstrate the feasibility and superiority of the proposed ATE compared to other state-of-the-art methods. The source code of this work can be found in https://mic.tongji.edu.cn.

Translated text

Key words

Video-text retrieval,Adaptive token excitation,Negative selection

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined