Layer-wise Regularized Dropout for Neural Language Models
CoRR(2024)
摘要
Among the various pre-trained neural language models that are popular today,
dropout is already an indispensable regularization technique. To solve the
inconsistency between training and inference caused by the randomness of
dropout, some studies use consistency training to regularize dropout at the
output layer. In this paper, we propose a novel Layer-wise Regularized Dropout
(LR-Drop), which is specially designed for Transformer-based Language models.
Specifically, LR-Drop layer-wise regularizes each Transformer layer using the
consistency training strategy. Each training sample passes through the two
siamese sub-models sampled by dropout, and then LR-Drop forces the hidden
states, multi-head attention matrices, and output distribution of the two
siamese sub-models to be consistent. The proposed LR-Drop can be regarded as a
"self-distillation" framework, in which each sub-model generated by dropout is
the other's "teacher" model and "student" model. Through extensive experiments
on 8 natural language understanding datasets, 6 neural machine translation
datasets, and 1 abstractive summarization dataset (a total of 15 datasets), we
show that LR-Drop achieves superior performances, including state-of-the-art
results.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要