Multi-Dimensional Attentive Hierarchical Graph Pooling Network for Video-Text Retrieval.

ICME(2021)

引用 2|浏览4
暂无评分
摘要
Video-text retrieval task has raised increasing attention due to the rapid growth of videos on the Internet. Existing works adopt various networks to encode videos and texts into a common latent space and calculate their similarities. However, most works ignore mining significant frames of videos and the difference among different dimensions in word representations, leading to unsatisfactory retrieval results. In this paper, we propose a Multi-Dimensional Attentive Hierarchical Graph Pooling Network (MAGP) to learn improved representations for video-text retrieval. Specifically, we design a novel hierarchical graph pooling method to extract significant frames in videos and discard unrelated frames, hence the model can learn hierarchical and discriminative video representations. Moreover, a multi-dimensional attention mechanism is utilized in text encoder to strengthen representation ability by dimension-level attention. Experimental results on three video-text datasets demonstrate our MAGP model out-performs the state-of-the-art models.
更多
查看译文
关键词
Cross-modal retrieval,video-text retrieval,graph neural network
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要