Using Noisy Word-Level Labels to Train a Phoneme Recognizer based on Neural Networks by Expectation Maximization.

ICCPR(2019)

引用 0|浏览5
暂无评分
摘要
The Connectionist Temporal Classification (CTC) technique can be used to train a neural-network based speech recognizer. When the technique is used to train a phoneme recognizer, it is required that training data should be annotated with phoneme-level labels. This is not feasible if large speech databases are used. One approach to make use of such speech data is to convert the word-level transcriptions into phoneme-level labels, followed by a CTC training. The problem of this approach is that the converted phonemelevel labels may mismatch the audio content of the speech data. This paper uses a probabilistic model to describe the probability of observing the noisy phoneme-level labels given an utterance. The model consists of a neural network which predicts the probability of any phoneme sequence, and another so-called mismatch model to describe the probability of disturbing a phoneme sequence to another. Based on the Expectation-Maximization (EM) framework, we propose a training algorithm which can simultaneously learn parameters of the neural-network and the mismatch model. Effectiveness of our method is verified by comparing recognition performance of our method with a conventional training method on TIMIT corpus.
更多
查看译文
关键词
phoneme recognizer,neural networks,expectation maximization,word-level
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要