Aligning Large Language Models by On-Policy Self-Judgment
CoRR(2024)
Abstract
To align large language models with human preferences, existing research
either utilizes a separate reward model (RM) to perform on-policy learning or
simplifies the training procedure by discarding the on-policy learning and the
need for a separate RM. In this paper, we present a novel alignment framework,
SELF-JUDGE that is (1) on-policy learning and 2) parameter efficient, as it
does not require an additional RM for evaluating the samples for on-policy
learning. To this end, we propose Judge-augmented Supervised Fine-Tuning (JSFT)
to train a single model acting as both a policy and a judge. Specifically, we
view the pairwise judgment task as a special case of the instruction-following
task, choosing the better response from a response pair. Thus, the resulting
model can judge preferences of on-the-fly responses from current policy
initialized from itself. Experimental results show the efficacy of SELF-JUDGE,
outperforming baselines in preference benchmarks. We also show that
self-rejection with oversampling can improve further without an additional
evaluator. Our code is available at https://github.com/oddqueue/self-judge.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined