BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling
arxiv(2024)
Abstract
This paper concerns the problem of aligning samples from large language
models to human preferences using best-of-n sampling, where we draw n
samples, rank them, and return the best one. We consider two fundamental
problems. First: what is the relationship between best-of-n and approaches to
alignment that train LLMs to output samples with a high expected reward (e.g.,
RLHF or DPO)? To answer this, we embed both the best-of-n distribution and
the sampling distributions learned by alignment procedures in a common class of
tiltings of the base LLM distribution. We then show that, within this class,
best-of-n is essentially optimal in terms of the trade-off between win-rate
against the base model vs KL distance from the base model. That is, best-of-n
is the best choice of alignment distribution if the goal is to maximize win
rate. However, best-of-n requires drawing n samples for each inference, a
substantial cost. To avoid this, the second problem we consider is how to
fine-tune a LLM to mimic the best-of-n sampling distribution. We derive
BoNBoN Alignment to achieve this by exploiting the special structure of the
best-of-n distribution. Experiments show that BoNBoN alignment yields
substantial improvements in producing a model that is preferred to the base
policy while minimally affecting off-target aspects.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined