Disentangling Length from Quality in Direct Preference Optimization
arxiv(2024)
摘要
Reinforcement Learning from Human Feedback (RLHF) has been a crucial
component in the recent success of Large Language Models. However, RLHF is know
to exploit biases in human preferences, such as verbosity. A well-formatted and
eloquent answer is often more highly rated by users, even when it is less
helpful and objective. A number of approaches have been developed to control
those biases in the classical RLHF literature, but the problem remains
relatively under-explored for Direct Alignment Algorithms such as Direct
Preference Optimization (DPO). Unlike classical RLHF, DPO does not train a
separate reward model or use reinforcement learning directly, so previous
approaches developed to control verbosity cannot be directly applied to this
setting. Our work makes several contributions. For the first time, we study the
length problem in the DPO setting, showing significant exploitation in DPO and
linking it to out-of-distribution bootstrapping. We then develop a principled
but simple regularization strategy that prevents length exploitation, while
still maintaining improvements in model quality. We demonstrate these effects
across datasets on summarization and dialogue, where we achieve up to 20%
improvement in win rates when controlling for length, despite the GPT4 judge's
well-known verbosity bias.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要