When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning
CoRR(2024)
摘要
Past analyses of reinforcement learning from human feedback (RLHF) assume
that the human fully observes the environment. What happens when human feedback
is based only on partial observations? We formally define two failure cases:
deception and overjustification. Modeling the human as Boltzmann-rational
w.r.t. a belief over trajectories, we prove conditions under which RLHF is
guaranteed to result in policies that deceptively inflate their performance,
overjustify their behavior to make an impression, or both. To help address
these issues, we mathematically characterize how partial observability of the
environment translates into (lack of) ambiguity in the learned return function.
In some cases, accounting for partial observability makes it theoretically
possible to recover the return function and thus the optimal policy, while in
other cases, there is irreducible ambiguity. We caution against blindly
applying RLHF in partially observable settings and propose research directions
to help tackle these challenges.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要