Adaptive Token Excitation with Negative Selection for Video-Text Retrieval

ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII(2023)

Cited 0|Views8
No score
Abstract
Video-text retrieval aims to efficiently retrieve videos from large collections based on the given text, whereas methods based on the large-scale pre-trained model have drawn sustained attention recently. However, existing-methods neglect detailed information in video and text, thus failing to align cross-modal semantic features well and leading to performance bottlenecks. Meanwhile, the general training strategy often treats semantically similar pairs as negatives, which provides the model with incorrect supervision. To address these issues, an adaptive token excitation (ATE) model with negative selection is proposed to adaptively refine features encoded by a large-scale pre-trained model to obtain more informative features without introducing additional complexity. In detail, ATE is first advanced to adaptively aggregate and align different events described in text and video using multiple non-linear event blocks. Then a negative selection strategy is exploited to mitigate false negative effects, which stabilizes the training process. Extensive experiments on several datasets demonstrate the feasibility and superiority of the proposed ATE compared to other state-of-the-art methods. The source code of this work can be found in https://mic.tongji.edu.cn.
More
Translated text
Key words
Video-text retrieval,Adaptive token excitation,Negative selection
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined