Chrome Extension
WeChat Mini Program
Use on ChatGLM

Bidirectional transformer with knowledge graph for video captioning

Multimedia Tools and Applications(2023)

Cited 0|Views12
No score
Abstract
Models based on transformer architecture have risen to prominence for video captioning. However, most models are only to improve either the encoder or the decoder, because when we improve the encoder and decoder simultaneously, the shortcomings of either side may be amplified. Based on the transformer architecture, we connect a bidirectional decoder and an encoder that integrates fine-grained spatio-temporal features, objects, and relationships between the objects in the video. Experiments show that improvements in the encoder amplify the information leakage of the bidirectional decoder and further produce a worse result. To tackle this problem, we generate pseudo reverse captions and propose a Bidirectional Transformer with Knowledge Graph (BTKG), which integrates the outputs of two encoders into the forward and backward decoders of the bidirectional decoder, respectively. In addition, we make fine-grained improvements on the interior of the different encoders according to four modal features of the video. Experiments on two mainstream benchmark datasets, i.e., MSVD and MSR-VTT, demonstrate the effectiveness of BTKG, which achieves state-of-the-art performance in significant metrics. Moreover, the sentences generated by BTKG contain scene words and modifiers, that are more in line with human language habits. Codes are available on https://github.com/nickchen121/BTKG .
More
Translated text
Key words
Video captioning,Bidirectional transformer,Knowledge graph,Multimodal of video
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined