Learning Multi-scale Representations with Single-stream Network for Video Retrieval.

CVPR Workshops(2023)

引用 2|浏览10
暂无评分
摘要
With the explosive growth of video contents in the Internet, video retrieval has become an important issue that can benefit video recommendation and copyright detection. Since the key features of a video may distribute in distant regions of a lengthy video, several works have made a success by exploiting multi-stream, multi-scale architectures to learn and merge distant features. However, a multi-stream network is costly in terms of memory and computing overhead. The number of scales and these scales are handcrafted and fixed once a model is finalized. Further, being more complicated, multi-stream networks are more prone to being overfitting and lead to poorer generalization. This paper proposes a single-stream network with built-in dilated spatial and temporal learning capability. By combining with modern techniques, including Denoising Autoencoder, Squeeze-and-Excitation Attention, and Triplet Comparative Mechanism, our model achieves state-of-the-art performance in several video retrieval tasks on the FIVR200K, CC WEB VIDEO, and EVVE datasets.
更多
查看译文
关键词
CC WEB VIDEO datasets,copyright detection,denoising autoencoder,dilated spatial learning capability,EVVE datasets,FIVR200K datasets,Internet,multiscale architectures,multiscale representations,multistream network,single-stream network,squeeze-and-excitation attention,temporal learning capability,triplet comparative mechanism,video contents,video recommendation,video retrieval tasks
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要