Differentially Private Reward Estimation with Preference Feedback
arXiv (Cornell University)(2023)
Abstract
Learning from preference-based feedback has recently gained considerable
traction as a promising approach to align generative models with human
interests. Instead of relying on numerical rewards, the generative models are
trained using reinforcement learning with human feedback (RLHF). These
approaches first solicit feedback from human labelers typically in the form of
pairwise comparisons between two possible actions, then estimate a reward model
using these comparisons, and finally employ a policy based on the estimated
reward model. An adversarial attack in any step of the above pipeline might
reveal private and sensitive information of human labelers. In this work, we
adopt the notion of label differential privacy (DP) and focus on the problem of
reward estimation from preference-based feedback while protecting privacy of
each individual labelers. Specifically, we consider the parametric
Bradley-Terry-Luce (BTL) model for such pairwise comparison feedback involving
a latent reward parameter $\theta^* \in \mathbb{R}^d$. Within a standard
minimax estimation framework, we provide tight upper and lower bounds on the
error in estimating $\theta^*$ under both local and central models of DP. We
show, for a given privacy budget $\epsilon$ and number of samples $n$, that the
additional cost to ensure label-DP under local model is $\Theta \big(\frac{1}{
e^\epsilon-1}\sqrt{\frac{d}{n}}\big)$, while it is
$\Theta\big(\frac{\text{poly}(d)}{\epsilon n} \big)$ under the weaker central
model. We perform simulations on synthetic data that corroborate these
theoretical results.
MoreTranslated text
Key words
private reward estimation
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined