Bidirectional transformer with knowledge graph for video captioning

Maosheng Zhong,Youde Chen,Hao Zhang,Hao Xiong,Zhixiang Wang

Multimedia Tools and Applications（2023）

Cited 0|Views12

No score

Abstract

Models based on transformer architecture have risen to prominence for video captioning. However, most models are only to improve either the encoder or the decoder, because when we improve the encoder and decoder simultaneously, the shortcomings of either side may be amplified. Based on the transformer architecture, we connect a bidirectional decoder and an encoder that integrates fine-grained spatio-temporal features, objects, and relationships between the objects in the video. Experiments show that improvements in the encoder amplify the information leakage of the bidirectional decoder and further produce a worse result. To tackle this problem, we generate pseudo reverse captions and propose a Bidirectional Transformer with Knowledge Graph (BTKG), which integrates the outputs of two encoders into the forward and backward decoders of the bidirectional decoder, respectively. In addition, we make fine-grained improvements on the interior of the different encoders according to four modal features of the video. Experiments on two mainstream benchmark datasets, i.e., MSVD and MSR-VTT, demonstrate the effectiveness of BTKG, which achieves state-of-the-art performance in significant metrics. Moreover, the sentences generated by BTKG contain scene words and modifiers, that are more in line with human language habits. Codes are available on https://github.com/nickchen121/BTKG .

Translated text

Key words

Video captioning,Bidirectional transformer,Knowledge graph,Multimodal of video

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined