Joint Phoneme-Grapheme Model For End-To-End Speech Recognition

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING（2020）

引用 16|浏览43

暂无评分

摘要

This paper proposes methods to improve a commonly used end-to-end speech recognition model, Listen-Attend-Spell (LAS). The methods we propose use multi-task learning to improve generalization of the model by leveraging information from multiple labels. The focus in this paper is on multi-task models for simultaneous signal-to-grapheme and signal-to-phoneme conversions while sharing the encoder parameters. Since phonemes are designed to be a precise description of the linguistic aspects of the speech signal, using phoneme recognition as an auxiliary task can help guiding the early stages of training to be more stable. In addition to conventional multi-task learning, we obtain further improvements by introducing a method that can exploit dependencies between labels in different tasks. Specifically, the dependencies between phonemes and grapheme sequences are considered. In conventional multi-task learning these sequences are assumed to be independent. Instead, in this paper, a joint model is proposed based on "iterative refinement" where dependency modeling is achieved by a multi-pass strategy. The proposed method is evaluated on a 28000h corpus of Japanese speech data. Performance of a conventional multi-task approach is contrasted with that of the joint model with iterative refinement.

查看译文

关键词

Automatic speech recognition, Listen-Attend-Spell, multi-task learning, iterative refinement

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要