CRIM's Speech Recognition System for OpenASR21 Evaluation with Conformer and Voice Activity Detector Embeddings.

SPECOM(2022)

引用 0|浏览7
暂无评分
摘要
GRIM participated in all the 15 low resource languages and the three languages with case sensitive scoring in OpenASR21 for the constrained condition. For acoustic modeling, we developed both hybrid DNN-HMM systems and a conformer based system. We trained three different multi-stream acoustic models for decoding: with MFCC + i-vector features, with combined MFCC, i-vector and conformer embeddings, and with combined MFCC, i-vector and VAD (voice activity detector) embeddings. For final submission, we used two different VADs for segmenting the evaluation audio: GMM-HMM based and TDNN based. For language model text, we used the training text from LDC corpora when available. We also found significant amount of text over the internet. In the past, using this downloaded text for language modeling increased the word error rate significantly for the development set containing conversational speech. So we used sentence selection to filter this text in order to use it effectively to reduce word error rates (WER). For most languages, we were able to reduce WER with this strongly filtered text. Our best results combined six decodes: two different VAD based segments, and three different acoustic models. In the final evaluation, we ranked second in Tamil, third in Farsi and Javanese, and fourth in seven other languages. Since then, we have reduced the WER for all the languages significantly. Major contributing factors for this additional WER reduction were the intelligent use of MUSAN noise for data augmentation, and further tuning of acoustic models.
更多
查看译文
关键词
OpenASR21,Low resource,Speech recognition,Conformer embedding,Voice activity detector embedding
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要