On-the-fly ASR Corrections with Audio Exemplars

Conference of the International Speech Communication Association (INTERSPEECH)(2022)

引用 0|浏览21
暂无评分
摘要
On-device end-to-end (E2E) models are required to handle long-tail vocabulary and a large number of acoustic conditions. With finite amount of training data some of these conditions and vocabulary words are unseen during training, which often leads to recognition errors. Text-based contextual biasing is intended to mitigate this problem, yet it works well only when sufficient textual context is provided, and when the speech signal is well modeled by the ASR system. In this work, we propose to extend biasing to operate directly in the audio domain. We address a scenario where audio samples and the associated transcriptions are available, as is the case of manually corrected voice typing. We propose to directly compare incoming audio embeddings against a list of Audio Exemplars (AE), each associated with a text correction. We demonstrate the effectiveness of our approach by correcting the outputs of a production-quality RNNT model, which results in relative-WER reduction of 21.7% (one-shot) and 33.7% (multi-shot) on the Wiki-Names data set.
更多
查看译文
关键词
audio exemplars,corrections,on-the-fly
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要