Variance Reduction based Experience Replay for Policy Optimization
arxiv(2021)
摘要
For reinforcement learning on complex stochastic systems, it is desirable to
effectively leverage the information from historical samples collected in
previous iterations to accelerate policy optimization. Classical experience
replay, while effective, treats all observations uniformly, neglecting their
relative importance. To address this limitation, we introduce a novel Variance
Reduction Experience Replay (VRER) framework, enabling the selective reuse of
relevant samples to improve policy gradient estimation. VRER, as an adaptable
method that can seamlessly integrate with different policy optimization
algorithms, forms the foundation of our sample efficient off-policy learning
algorithm known as Policy Gradient with VRER (PG-VRER). Furthermore, the lack
of a rigorous understanding of the experience replay approach in the literature
motivates us to introduce a novel theoretical framework that accounts for
sample dependencies induced by Markovian noise and behavior policy
interdependencies. This framework is then employed to analyze the finite-time
convergence of the proposed PG-VRER algorithm, revealing a crucial
bias-variance trade-off in policy gradient estimation: the reuse of older
experience tends to introduce a larger bias while simultaneously reducing
gradient estimation variance. Extensive experiments have shown that VRER offers
a notable and consistent acceleration in learning optimal policies and enhances
the performance of state-of-the-art (SOTA) policy optimization approaches.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要