Visual Relation-Aware Unsupervised Video Captioning.

International Conference on Artificial Neural Networks and Machine Learning (ICANN)(2022)

引用 2|浏览3
暂无评分
摘要
Unsupervised video captioning aims to describe videos from unlabeled videos and sentence corpus without the reliance on human annotated video-sentence pairs. A straightforward manner is to borrow the merit from unsupervised image captioning methods, which resort to pseudo captions retrieved by visual concepts detected in image. However, directly applying this methodology to the video domain leads to sub-optimum performance since visual concepts cannot represent the major video content accurately and completely. Besides, these methods also do not consider the problem of noise interference caused by words unrelated to visual concept in the pseudo captions. In this paper, we propose a visual relation-aware unsupervised video captioning method which retrieves pseudo captions using visual relation. Based on these, we train the proposed visual relation-aware captioning model. Specifically, our model is designed to focus on learning from dependable words corresponding to the detected relation triplets. Extensive experimental results on two public benchmarks show the effectiveness and significance of our method.
更多
查看译文
关键词
Video captioning,Visual relation,Unsupervised learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要