Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF
CoRR(2024)
Abstract
Reinforcement learning from human feedback (RLHF) has emerged as a central
tool for language model alignment. We consider online exploration in RLHF,
which exploits interactive access to human or AI feedback by deliberately
encouraging the model to produce diverse, maximally informative responses. By
allowing RLHF to confidently stray from the pre-trained model, online
exploration offers the possibility of novel, potentially super-human
capabilities, but its full potential as a paradigm for language model training
has yet to be realized, owing to computational and statistical bottlenecks in
directly adapting existing reinforcement learning techniques. We propose a new
algorithm for online exploration in RLHF, Exploratory Preference Optimization
(XPO), which is simple and practical – a one-line change to (online) Direct
Preference Optimization (DPO; Rafailov et al., 2023) – yet enjoys the
strongest known provable guarantees and promising empirical performance. XPO
augments the DPO objective with a novel and principled exploration bonus,
empowering the algorithm to explore outside the support of the initial model
and human feedback data. In theory, we show that XPO is provably
sample-efficient and converges to a near-optimal language model policy under
natural exploration conditions, irrespective of whether the initial model has
good coverage. Our analysis, which builds on the observation that DPO
implicitly performs a form of Q^⋆-approximation (or, Bellman error
minimization), combines previously disparate techniques from language modeling
and theoretical reinforcement learning in a serendipitous fashion through the
perspective of KL-regularized Markov decision processes. Empirically, we find
that XPO is more sample-efficient than non-exploratory DPO variants in a
preliminary evaluation.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined