Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity.

CoRR(2023)

引用 1|浏览46
暂无评分
摘要
Mixture-of-experts (MoE) models that employ sparse activation have demonstrated effectiveness in significantly increasing the number of parameters while maintaining low computational requirements per token. However, recent studies have established that MoE models are inherently parameter-inefficient as the improvement in performance diminishes with an increasing number of experts. We hypothesize this parameter inefficiency is a result of all experts having equal capacity, which may not adequately meet the varying complexity requirements of different tokens or tasks, e.g., in a multilingual setting, languages based on their resource levels might require different capacities. In light of this, we propose Stratified Mixture of Experts(SMoE) models, which feature a stratified structure and can assign dynamic capacity to different tokens. We demonstrate the effectiveness of SMoE on two multilingual machine translation benchmarks, where it outperforms multiple state-of-the-art MoE models. On a diverse 15-language dataset, SMoE improves the translation quality over vanilla MoE by +0.93 BLEU points on average. Additionally, SMoE is parameter-efficient, matching vanilla MoE performance with around 50\% fewer parameters.
更多
查看译文
关键词
transformer,stratified,parameter-efficient
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要