Policy Invariance Under Reward Transformations: Theory And Application To Reward Shaping

Ay Ng, D Harada,S Russell

ICML '99: Proceedings of the Sixteenth International Conference on Machine Learning(1999)

引用 2874|浏览651
暂无评分
摘要
This paper investigates conditions under which modifications to the reward function of a Markov decision process preserve the optimal policy. It is shown that, besides the positive linear transformation familiar from utility theory, one can add a reward for transitions between states that is expressible as the difference in value of an arbitrary potential function applied to those states. Furthermore, this is shown to be a necessary condition for invariance, in the sense that any other transformation may yield suboptimal policies unless further assumptions are made about the underlying MDP. These results shed light on the practice of reward shaping, a method used in reinforcement learning whereby additional training rewards are used to guide the learning agent. In particular, some well-known "bugs" in reward shaping procedures are shown to arise from non-potential-based rewards, and methods are given for constructing shaping potentials corresponding to distance-based and subgoal-based heuristics. We show that such potentials can lead to substantial reductions in learning time.
更多
查看译文
关键词
Policy Invariance
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要