Pre-Trained Acoustic-and-Textual Modeling for End-To-End Speech-To-Text Translation

Weitai Zhang, Hanyi Zhang, Chenxuan Liu,Zhongyi Ye,Xinyuan Zhou, Chao Lin,Lirong Dai

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2024)

引用 0|浏览5
暂无评分
摘要
End-to-end paradigm has aroused more and more interests and attention for improving speech-to-text translation (ST) recently. Existing end-to-end models mainly attributes and attempts to address the problem of modeling burden and data scarcity, while always fail to maintain both cross-modal and cross-lingual mapping well at the same time. In this work, we investigate methods for improving endto-end ST with pre-trained acoustic-and-textual models. Our acoustic encoder and decoder begins with processing the source speech sequence as usual. A textual encoder and an adaptor module then obtain source acoustic and textual information respectively, alleviating the representation inconsistency with attentive interactions in the textual decoder. Also, we utilize pre-trained models, and develop an adaptation fine-tuning method to preserve the pre-training knowledge. Experimental results on the IWSLT2023 offline ST task from English to German, Japanese and Chinese show that our method achieves state-of-the-art BLEU scores and surpasses the strong cascaded ST counterparts in unrestricted setting.
更多
查看译文
关键词
end-to-end,pre-training,speech-to-text translation,cross-modal,cross-lingual
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要