RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs
CoRR(2024)
Abstract
State-of-the-art large language models (LLMs) have become indispensable tools
for various tasks. However, training LLMs to serve as effective assistants for
humans requires careful consideration. A promising approach is reinforcement
learning from human feedback (RLHF), which leverages human feedback to update
the model in accordance with human preferences and mitigate issues like
toxicity and hallucinations. Yet, an understanding of RLHF for LLMs is largely
entangled with initial design choices that popularized the method and current
research focuses on augmenting those choices rather than fundamentally
improving the framework. In this paper, we analyze RLHF through the lens of
reinforcement learning principles to develop an understanding of its
fundamentals, dedicating substantial focus to the core component of RLHF – the
reward model. Our study investigates modeling choices, caveats of function
approximation, and their implications on RLHF training algorithms, highlighting
the underlying assumptions made about the expressivity of reward. Our analysis
improves the understanding of the role of reward models and methods for their
training, concurrently revealing limitations of the current methodology. We
characterize these limitations, including incorrect generalization, model
misspecification, and the sparsity of feedback, along with their impact on the
performance of a language model. The discussion and analysis are substantiated
by a categorical review of current literature, serving as a reference for
researchers and practitioners to understand the challenges of RLHF and build
upon existing efforts.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined