Towards accurate unsupervised video captioning with implicit visual feature injection and explicit

Yunjie Zhang, Tianyang Xu, Xiaoning Song, Xue-Feng Zhu, Zhenghua Feng, Xiao-Jun Wu

Pattern Recognition Letters(2024)

引用 0|浏览2
暂无评分
摘要
In the realm of the video captioning field, acquiring large amounts of high-quality aligned video-text pairs remains laborious, impeding its practical applications. Therefore, we explore the modelling techniques for unsupervised video captioning. Using text inputs similar to the video representation to generate captions has been a successful unsupervised video captioning generation strategy in the past. However, this setting relies solely on the textual data for training, neglecting vital visual cues related to the spatio-temporal appearance within the video. The absence of visual information increases the risk of generating erroneous video captions. In view of this, we propose a novel unsupervised video captioning method that introduces visual information related to text features keywords to implicitly enhance training for text generation tasks. Simultaneously, our method incorporates sentence to explicitly augment the training process. our method injects additional implicit visual features and explicit keywords into the model, Which can inject the generated captions with more accurate semantics. the experimental analysis demonstrates the merit of the proposed formulation, achieving superior performance against the state-of-the-art unsupervised studies.
更多
查看译文
关键词
Unsupervised video captioning,Text generation,Visual information,Sentence keywords
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要