A Joint Speech Enhancement and Self-Supervised Representation Learning Framework for Noise-Robust Speech Recognition.

IEEE ACM Trans. Audio Speech Lang. Process.(2023)

引用 4|浏览35
暂无评分
摘要
Though speech enhancement (SE) can be used to improve speech quality in noisy environments, it may also cause distortions that degrade the performance of automatic speech recognition (ASR) models. Self-supervised pre-training, on the other hand, has been shown to improve the noise robustness of ASR models. However, the potential of the (optimal) integration of SE and self-supervised pre-training still remains unclear. In this paper, we propose a novel self-supervised pre-training framework that incorporates SE to improve ASR performance in noisy environments. First, in the pre-training phase the original noisy waveform or the waveform obtained by SE is fed into the self-supervised model to learn the contextual representation, where the quantized clean speech acts as the target. Second, we propose a dual-attention fusion method to fuse the features of noisy and enhanced speech, which can compensate for the information loss caused by separately using individual modules. Due to the flexible exploitation of clean/noisy/enhanced branches, the proposed method turns out to be a generalization of some existing noise-robust ASR models, e.g., enhanced wav2vec2.0. Finally, experimental results on both synthetic and real noisy datasets show that the proposed joint training approach can improve the ASR performance under various noisy settings, leading to a stronger noise robustness.
更多
查看译文
关键词
Noise measurement, Speech enhancement, Noise robustness, Training, Feature extraction, Speech recognition, Data models, Wav2vec2.0, speech recognition, speech enhancement, self-supervised pre-training, noise robustness
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要