Investigating the Emergent Audio Classification Ability of ASR Foundation Models
arxiv(2023)
摘要
Text and vision foundation models can perform many tasks in a zero-shot
setting, a desirable property that enables these systems to be applied in
general and low-resource settings. There has been far less work, however, on
the zero-shot abilities of ASR foundation models, with these systems typically
fine-tuned to specific tasks or constrained to applications that match their
training criterion and data annotation. In this work we investigate the ability
of Whisper and MMS, ASR foundation models trained primarily for speech
recognition, to perform zero-shot audio classification. We use simple
template-based text prompts at the decoder and use the resulting decoding
probabilities to generate zero-shot predictions. Without training the model on
extra data or adding any new parameters, we demonstrate that Whisper shows
promising zero-shot classification performance on a range of 8
audio-classification datasets, outperforming the accuracy of existing
state-of-the-art zero-shot baselines by an average of 9
unlock the emergent ability is debiasing, where a simple unsupervised
reweighting method of the class probabilities yields consistent significant
performance gains. We further show that performance increases with model size,
implying that as ASR foundation models scale up, they may exhibit improved
zero-shot performance.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要