Exploration with Expert Policy Advice

semanticscholar(2019)

引用 0|浏览11
暂无评分
摘要
Exploration for Reinforcement Learning is a challenging problem. Random exploration is often highly inefficient and in sparse reward environments may completely fail. In this work, we developed a novel method which incorporates expert advice for exploration in sparse reward environments. In our formulation, the agent has access to a set of expert policies and learns to bias its exploration based on the experts’ suggested actions. By incorporating expert suggestions the agent is able to quickly learn a policy to reach rewarding states. Our method can mix and match experts’ advice during an episode to reach goal states. Moreover, our formulation does not restrict the agent to any policy set. This allows us to aim for a globally optimal solution. In our experiments, we show that using expert advice indeed leads to faster exploration in challenging grid-world environments. The field of Reinforcement Learning (RL) has made a number of major breakthroughs in recent years. Mnih et al. (2013) introduced Deep Q-Networks that successfully learned to play Atari games. More recently RL techniques have been applied to learn to play Go at human performance (Silver et al. 2016; 2017). There have also been successes in high-dimensional control applications (Heess et al. 2017). Despite these many breakthroughs, how to efficiently explore a domain, is still an open problem. During the learning process an agent must try different actions, in order to learn both how actions affect the world and what actions should be included in the final policy. An agent that efficiently samples interesting and unique parts of the state-action space can converge to a good policy with fewer samples, an important metric when considering applications such as robotics where sample collection has a real cost. One potential way to reduce the number of samples that must be collected, is to transfer knowledge from similar tasks that the agent has already learned. However, overly biasing with previous experience can lead an agent to miss better policies for the specific task. Moreover, if a task is sufficiently different, the bias may hurt performance. In this work we propose an algorithm for transferring previous knowledge from agents trained in similar tasks to efficiently learn a new task. The goal is to collect samples Copyright c © 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. more efficiently, while still being able to learn a good policy in the case of bad advice. Our method works by biasing the exploration strategy based on the previous policies. This exploration strategy draws parallels with Randomized Weighted Majority Algorithm (RWMA) (Littlestone and Warmuth 1994) for prediction in sequential trials with expert advice. RWMA maintains a set of weights that capture the utility of each expert’s advice. Our strategy for using expert policies is similar. However, our method differs in two ways, firstly, the weights we maintain are state dependent and secondly, we use long term returns instead of immediate feedback in updating them. We show that our algorithm speeds up policy learning on grid-world environments, Frozen Lake and Four Rooms, compared to baselines.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要