Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble
CoRR(2024)
摘要
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted
approach for aligning large language models with human values. However, RLHF
relies on a reward model that is trained with a limited amount of human
preference data, which could lead to inaccurate predictions. As a result, RLHF
may produce outputs that are misaligned with human values. To mitigate this
issue, we contribute a reward ensemble method that allows the reward model to
make more accurate predictions. As using an ensemble of large language
model-based reward models can be computationally and resource-expensive, we
explore efficient ensemble methods including linear-layer ensemble and
LoRA-based ensemble. Empirically, we run Best-of-n and Proximal Policy
Optimization with our ensembled reward models, and verify that our ensemble
methods help improve the alignment performance of RLHF outputs.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要