Trajectory Based Prioritized Double Experience Buffer For Sample-Efficient Policy Optimization

IEEE ACCESS(2021)

Cited 4|Views0
No score
Abstract
Reinforcement learning has recently made great progress in various challenging domains such as board game of Go and MOBA game of StarCraft II. Policy gradient based reinforcement learning method has become the mainstream due to its effectiveness and simplicity both in discrete and continuous scenarios. However, policy gradient methods commonly involve function approximation and work in an on-policy fashion, which leads to high variance and low sample efficiency. This paper introduces a novel policy gradient method to improve the sample efficiency via a pair of trajectory based prioritized replay buffers and reduce the variance in training with a target network whose weights are updated in a "soft" manner. We evaluate our method on the reinforcement learning suit of Open AI Gym tasks, and the results show that the proposed method can learn more steadily and achieve higher performance than existing methods.
More
Translated text
Key words
Trajectory, Training, Reinforcement learning, Optimization, Linear programming, Gradient methods, Games, Reinforcement learning, policy gradient, replay buffer, distributed RL
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined