Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models
arxiv(2023)
摘要
Text generation models are notoriously vulnerable to errors in the training
data. With the wide-spread availability of massive amounts of web-crawled data
becoming more commonplace, how can we enhance the robustness of models trained
on a massive amount of noisy web-crawled text? In our work, we propose Error
Norm Truncation (ENT), a robust enhancement method to the standard training
objective that truncates noisy data. Compared to methods that only uses the
negative log-likelihood loss to estimate data quality, our method provides a
more accurate estimation by considering the distribution of non-target tokens,
which is often overlooked by previous work. Through comprehensive experiments
across language modeling, machine translation, and text summarization, we show
that equipping text generation models with ENT improves generation quality over
standard training and previous soft and hard truncation methods. Furthermore,
we show that our method improves the robustness of models against two of the
most detrimental types of noise in machine translation, resulting in an
increase of more than 2 BLEU points over the MLE baseline when up to 50
noise is added to the data.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要