BAT: Learning to Reason about Spatial Sounds with Large Language Models
CoRR(2024)
摘要
Spatial sound reasoning is a fundamental human skill, enabling us to navigate
and interpret our surroundings based on sound. In this paper we present BAT,
which combines the spatial sound perception ability of a binaural acoustic
scene analysis model with the natural language reasoning capabilities of a
large language model (LLM) to replicate this innate ability. To address the
lack of existing datasets of in-the-wild spatial sounds, we synthesized a
binaural audio dataset using AudioSet and SoundSpaces 2.0. Next, we developed
SpatialSoundQA, a spatial sound-based question-answering dataset, offering a
range of QA tasks that train BAT in various aspects of spatial sound perception
and reasoning. The acoustic front end encoder of BAT is a novel spatial audio
encoder named Spatial Audio Spectrogram Transformer, or Spatial-AST, which by
itself achieves strong performance across sound event detection, spatial
localization, and distance estimation. By integrating Spatial-AST with LLaMA-2
7B model, BAT transcends standard Sound Event Localization and Detection (SELD)
tasks, enabling the model to reason about the relationships between the sounds
in its environment. Our experiments demonstrate BAT's superior performance on
both spatial sound perception and reasoning, showcasing the immense potential
of LLMs in navigating and interpreting complex spatial audio environments.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要