Word Order does not Matter for Speech Recognition.
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)(2022)
摘要
In this paper, we study training of automatic speech recognition system in a weakly supervised setting where the order of words in transcript labels of the audio training data is not known. We train a word-level acoustic model which aggregates the distribution of all output frames using LogSumExp operation and uses a cross-entropy loss to match with the ground-truth words distribution. Using the pseudo-labels generated from this model on the training set, we then train a letter-based acoustic model using Connectionist Temporal Classification loss. Our system achieves 2.4%/5.3% on test-clean/test-other subsets of LibriSpeech, which is competitive with the supervised baseline's performance.
更多查看译文
关键词
word order,speech recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要