Actor-critic with familiarity-based trajectory experience replay

Information Sciences(2022)

引用 8|浏览14
暂无评分
摘要
This paper aims to solve sample inefficiency in Asynchronous Advantage Actor-Critic (A3C). First, we design a new off-policy actor-critic algorithm, which combines actor-critic with experience replay to improve sample efficiency. Next, we study the sampling method of experience replay for trajectory experience and propose a familiarity-based replay mechanism which uses the number of replay times of experience as the probability weight of sampling. Finally, we use the GAE-V method to correct the bias caused by off-policy learning. We also achieve better performance by adopting a mechanism that combines off-policy learning and on-policy learning to update the network. Our results on Atari and MuJoCo benchmarks show that each of these innovations contributes to improvements in both data efficiency and final performance. Furthermore, our approach keeps a fast coverage speed and the same parallel feature as A3C, and also has better performance on exploration.
更多
查看译文
关键词
Reinforcement learning,Sample efficiency,Actor-critic,Off-policy,Asynchronous Advantage Actor-Critic
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要