Provably Efficient Neural Offline Reinforcement Learning via Perturbed Rewards

ICLR 2023(2023)

引用 0|浏览59
暂无评分
摘要
We propose a novel offline reinforcement learning (RL) algorithm, namely PEturbed-Reward Value Iteration (PERVI) which amalgamates the randomized value function idea with the pessimism principle. Most current offline RL algorithms explicitly construct statistical confidence regions to obtain pessimism via lower confidence bounds (LCB), which cannot easily scale to complex problems where a neural network is used to estimate the value functions. Instead, PERVI implicitly obtains pessimism by simply perturbing the offline data for multiple times with carefully-designed i.i.d Gaussian noises to learn an ensemble of estimated state-action values and acting greedily to the minimum of the ensemble. The estimated state-action values are obtained via fitting a parametric model (e.g. neural networks) to the perturbed datasets using gradient descent. As a result, PERVI only needs $\mathcal{O}(1)$ time complexity for action selection while LCB-based algorithms require at least $\Omega(K^2)$, where $K$ is the total number of trajectories in the offline data. We also propose a novel data splitting technique that helps remove the potentially large log covering number in the learning bound. We prove that PERVI yields a provable uncertainty quantifier with overparameterized neural networks and achieves an $\tilde{\mathcal{O}}\left( \frac{ \kappa H^{5/2} \tilde{d} }{\sqrt{K}} \right)$ sub-optimality where $\tilde{d}$ is the effective dimension, $H$ is the horizon length and $\kappa$ measures the distributional shift. We corroborate the statistical and computational efficiency of PERVI with an empirical evaluation in a wide set of synthetic and real-world datasets. To the best of our knowledge, PERVI is the first offline RL algorithm that is both provably and computationally efficient in general Markov decision processes (MDPs) with neural network function approximation.
更多
查看译文
关键词
Offline Reinforcement Learning,Neural Networks
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要