UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization
arxiv(2024)
摘要
Video localization tasks aim to temporally locate specific instances in
videos, including temporal action localization (TAL), sound event detection
(SED) and audio-visual event localization (AVEL). Existing methods
over-specialize on each task, overlooking the fact that these instances often
occur in the same video to form the complete video content. In this work, we
present UniAV, a Unified Audio-Visual perception network, to achieve joint
learning of TAL, SED and AVEL tasks for the first time. UniAV can leverage
diverse data available in task-specific datasets, allowing the model to learn
and share mutually beneficial knowledge across tasks and modalities. To tackle
the challenges posed by substantial variations in datasets
(size/domain/duration) and distinct task characteristics, we propose to
uniformly encode visual and audio modalities of all videos to derive generic
representations, while also designing task-specific experts to capture unique
knowledge for each task. Besides, we develop a unified language-aware
classifier by utilizing a pre-trained text encoder, enabling the model to
flexibly detect various types of instances and previously unseen ones by simply
changing prompts during inference. UniAV outperforms its single-task
counterparts by a large margin with fewer parameters, achieving on-par or
superior performances compared to state-of-the-art task-specific methods across
ActivityNet 1.3, DESED and UnAV-100 benchmarks.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要