U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF
arxiv(2024)
摘要
Scale has opened new frontiers in natural language processing, but at a high
cost. In response, by learning to only activate a subset of parameters in
training and inference, Mixture-of-Experts (MoE) have been proposed as an
energy efficient path to even larger and more capable language models and this
shift towards a new generation of foundation models is gaining momentum,
particularly within the field of Automatic Speech Recognition (ASR). Recent
works that incorporating MoE into ASR models have complex designs such as
routing frames via supplementary embedding network, improving multilingual
ability for the experts, and utilizing dedicated auxiliary losses for either
expert load balancing or specific language handling. We found that delicate
designs are not necessary, while an embarrassingly simple substitution of MoE
layers for all Feed-Forward Network (FFN) layers is competent for the ASR task.
To be more specific, we benchmark our proposed model on a large scale
inner-source dataset (160k hours), the results show that we can scale our
baseline Conformer (Dense-225M) to its MoE counterparts (MoE-1B) and achieve
Dense-1B level Word Error Rate (WER) while maintaining a Dense-225M level Real
Time Factor (RTF). Furthermore, by applying Unified 2-pass framework with
bidirectional attention decoders (U2++), we achieve the streaming and
non-streaming decoding modes in a single MoE based model, which we call U2++
MoE. We hope that our study can facilitate the research on scaling speech
foundation models without sacrificing deployment efficiency.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要