Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards
arxiv(2024)
摘要
Reinforcement learning from human feedback (RLHF) is the mainstream paradigm
used to align large language models (LLMs) with human preferences. Yet existing
RLHF heavily relies on accurate and informative reward models, which are
vulnerable and sensitive to noise from various sources, e.g. human labeling
errors, making the pipeline fragile. In this work, we improve the effectiveness
of the reward model by introducing a penalty term on the reward, named as
contrastive rewards.
steps: (1) an offline sampling step to obtain responses to prompts that serve
as baseline calculation and (2) a contrastive reward calculated using the
baseline responses and used in the Proximal Policy Optimization (PPO) step. We
show that contrastive rewards enable the LLM to penalize reward uncertainty,
improve robustness, encourage improvement over baselines, calibrate according
to task difficulty, and reduce variance in PPO. We show empirically contrastive
rewards can improve RLHF substantially, evaluated by both GPTs and humans, and
our method consistently outperforms strong baselines.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要