A comparative study of language transformers for video question answering.

Zekun Yang,Noa Garcia,Chenhui Chu,Mayu Otani,Yuta Nakashima,Haruo Takemura

Neurocomputing（2021）

引用 9|浏览51

暂无评分

摘要

With the goal of correctly answering questions about images or videos, visual question answering (VQA) has quickly developed in recent years. However, current VQA systems mainly focus on answering questions about a single image and face many challenges in answering video-based questions. VQA in video not only has to understand the evolution between video frames but also requires a certain understanding of corresponding subtitles. In this paper, we propose a language Transformer-based video question answering model to encode the complex semantics from video clips. Different from previous models which represent visual features by recurrent neural networks, our model encodes visual concept sequences with a pre-trained language Transformer. We investigate the performance of our model using four language Transformers over two different datasets. The results demonstrate outstanding improvements compared to previous work.

查看译文

关键词

Video question answering,Language representation,Language transformers

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要