Heterogeneous Interactive Graph Network for Audio–Visual Question Answering

Knowledge-Based Systems(2024)

引用 0|浏览0
暂无评分
摘要
Audio–visual question answering (AVQA) is an emerging task that aims to provide answers by integrating visual contents, audio streams, and their associations within given videos. The major challenge lies in effectively fusing heterogeneous multi-modal data to comprehend complex scenes while capturing question-related clues to infer correct answers. Current AVQA models primarily employ attention mechanisms to extract question-related clues separately from visual and audio modalities before combining them. However, these approaches have two limitations: (1) They neglect the exploration of the association and complementary between audio and visual; (2) Encoding visual or audio holistically limits the capacity to capture the cross-modal and cross-temporal dynamic events. In this paper, we introduce the Heterogeneous Interactive Graph Network, a novel solution designed to address these limitations. Specifically, we construct heterogeneous multi-modal graphs that facilitate unified integration of multiple modalities, including visual, audio, and question. This approach effectively explores the associations and complementarity among multiple modalities, and it investigates local temporal interactions across visual and audio, enabling the effective capture of cross-modal and cross-temporal dynamic events. Additionally, we present a cross-modal feature alignment module, which acts as a bridge to overcome the semantic gap among heterogeneous multi-modal data. It promotes the convergence of multi-modal data distributions into a shared feature space, facilitating more effective and efficient processing. Extensive experimental results demonstrate the superiority of our method compared to state-of-the-art models across various question types on the challenging MUSIC-AVQA and AVQA benchmarks.
更多
查看译文
关键词
Audio–visual question answering,Graph convolutional network,Feature alignment
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要