Guided deterministic policy optimization with gradient-free policy parameters information.

Expert Syst. Appl.(2023)

引用 0|浏览5
暂无评分
摘要
Deep Deterministic Policy Gradient (DDPG) and Twin Delayed Deep Deterministic Policy Gradient (TD3) are two classical deterministic policy gradient algorithms. It is worth noting that the policies of both DDPG and TD3 are completely dependent on the gradient of critics. This will cause the policy to be unstable and easy to converge to the local optimum in the learning process. Although the idea of maximum entropy learning can provide more effective exploration, it can only be applied to the algorithm using stochastic policy, not to DDPG and TD3. In this paper, we propose a deterministic policy optimization method combining gradient-free policy parameters information (GFPPI). Specifically, we obtain a new set of policies by injecting Gaussian noise into the policy parameters, and then weight these policy parameters based on critics to obtain GFPPI. Finally, GFPPI is used as the regularization term of the policy optimization function to guide the policy update. GFPPI can mitigate premature policy convergence and facilitate exploration with optimistic principles. We provide the theoretical guarantee for monotonic improvement of expected cumulative return using augmented loss function with GFPPI, experimentally analyze the role of GFPPI in policy optimization and combine it with deterministic policy gradient information for policy optimization. The experiments on OpenAI gym demonstrate that GFPPI can improve sample efficiency and enable the algorithm to get higher performance.
更多
查看译文
关键词
Deterministic policy gradient,Premature convergence,Local optimum,Policy optimization,Exploration,Sample efficiency
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要