Mandarin Text-to-Speech Front-End With Lightweight Distilled Convolution Network.

IEEE Signal Process. Lett.(2023)

引用 0|浏览10
暂无评分
摘要
Mandarin text-to-speech (TTS) systems heavily depend on front-end processing, such as grapheme-to-phoneme conversion and prosodic boundary prediction, to produce expressive, human-like speech. Utilizing a pre-trained language model, such as the bidirectional encoder representations from Transformers (BERT), could significantly improve the TTS front-end’s performance. However, the original BERT is too big for edge TTS applications with tight limits on memory costs and inference latency. Although a distilled BERT can alleviate this problem, a considerable efficiency barrier may still exist due to the self-attention module's quadratic complexity and the feed-forward module's enormous computation. To this end, we propose a lightweight distilled convolution network as an alternative to the distilled BERT. Unlike previous knowledge distillation methods, which commonly used the same self-attention network for the teacher and student models, we transfer knowledge from the self-attention network to a convolution network. Experiments on two major Mandarin TTS front-end tasks have shown that our distilled convolution model can achieve comparable results to various distilled BERT variants while drastically reducing the model size and inference latency.
更多
查看译文
关键词
Mandarin TTS front-end,knowledge distillation,BERT,lightweight,convolution network
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要