Generalized pyramid co-attention with learnable aggregation net for video question answering

Pattern Recognition(2021)

Cited 5|Views28
No score
Abstract
•To handle the complexity of videos in V-VQA, we propose a generalized pyramid co-attention mechanism with diversity learning to explicitly encourage accuracy and diverse attention maps. For this generalized module, two possible ways are tried, Multi-path Pyramid Co-attention with diversity learning (MPC) and Cascaded Pyramid Transformer Co-attention with diversity learning (CPTC). This strategy benefits the capturing of distinct, complementary and informative features.•To aggregate the sequential features without destroying the feature distributions and temporal information, we propose a new learnable aggregation component. It imitates Bags-of-Words (BoW) quantization mechanism to automatically aggregate adaptively-weighted frame-level feature (or word-level feature).•We extensively evaluate the effectiveness of the overall model on two publicly available datasets (i.e., TGIF-QA and TVQA) for V-VQA task. The experimental results demonstrate that our model outperforms the existing state-of-the-art by a large margin and our extended CPTC performs better than MPC. Code and model have been released at: https://github.com/lixiangpengcs/LAD-Net-for-VideoQA.
More
Translated text
Key words
Video question answering,Diversity learning,Learnable aggregation,Cascaded pyramid transformer co-attention
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined