Using Phoneme Representations to Build Predictive Models Robust to ASR Errors

international acm sigir conference on research and development in information retrieval(2020)

引用 17|浏览104
暂无评分
摘要
Even though Automatic Speech Recognition (ASR) systems significantly improved over the last decade, they still introduce a lot of errors when they transcribe voice to text. One of the most common reasons for these errors is phonetic confusion between similar-sounding expressions. As a result, ASR transcriptions often contain "quasi-oronyms", i.e., words or phrases that sound similar to the source ones, but that have completely different semantics (e.g., "win" instead of "when" or "accessible on defecting" instead of "accessible and affecting"). These errors significantly affect the performance of downstream Natural Language Understanding (NLU) models (e.g., intent classification, slot filling, etc.) and impair user experience. To make NLU models more robust to such errors, we propose novel phonetic-aware text representations. Specifically, we represent ASR transcriptions at the phoneme level, aiming to capture pronunciation similarities, which are typically neglected in word-level representations (e.g., word embeddings). To train and evaluate our phoneme representations, we generate noisy ASR transcriptions of four existing datasets - Stanford Sentiment Treebank, SQuAD, TREC Question Classification and Subjectivity Analysis - and show that common neural network architectures exploiting the proposed phoneme representations can effectively handle noisy transcriptions and significantly outperform state-of-the-art baselines. Finally, we confirm these results by testing our models on real utterances spoken to the Alexa virtual assistant.
更多
查看译文
关键词
Phoneme Embeddings, Deep Learning, Natural Language Understanding, ASR Errors, Virtual Assistant
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要