t-SOT FNT: Streaming Multi-talker ASR with Text-only Domain Adaptation\n Capability

Jian Wu,Naoyuki Kanda,Takuya Yoshioka,Rui Zhao,Zhuo Chen,Jinyu Li

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)（2023）

引用 0|浏览27

暂无评分

摘要

Token-level serialized output training (t-SOT) was recently proposed to address the challenge of streaming multi-talker automatic speech recognition (ASR). T-SOT effectively handles overlapped speech by representing multi-talker transcriptions as a single token stream with $\\langle \\text{cc}\\rangle$ symbols interspersed. However, the use of a naive neural transducer architecture significantly constrained its applicability for text-only adaptation. To overcome this limitation, we propose a novel t-SOT model structure that incorporates the idea of factorized neural transducers (FNT). The proposed method separates a language model (LM) from the transducer's predictor and handles the unnatural token order resulting from the use of $\\langle \\text{cc}\\rangle$ symbols in t-SOT. We achieve this by maintaining multiple hidden states and introducing special handling of the $\\langle \\text{cc}\\rangle$ tokens within the LM. The proposed t-SOT FNT model achieves comparable performance to the original t-SOT model while retaining the ability to reduce word error rate (WER) on both single and multi-talker datasets through text-only adaptation.

查看译文

关键词

factorized neural transducer,multi-talker speech recognition,token-level serialized output training,text-only adaptation

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要