Learning Policies for Markov Decision Processes from Data.

IEEE Transactions on Automatic Control(2019)

引用 8|浏览41
暂无评分
摘要
We consider the problem of learning a policy for a Markov decision process consistent with data captured on the state-action pairs followed by the policy. We parameterize the policy using features associated with the state-action pairs. The features can be handcrafted or defined using kernel functions in a reproducing kernel Hilbert space. In either case, the set of features can be large and only a small, unknown subset may need to be used to fit a specific policy to the data. The parameters of such a policy are recovered using $\ell _1$ -regularized logistic regression. We establish bounds on the difference between the average reward of the estimated and the unknown original policies (regret) in terms of the generalization error and the ergodic coefficient of the underlying Markov chain. To that end, we combine sample complexity theory and sensitivity analysis of the stationary distribution of Markov chains. Our analysis suggests that to achieve regret within order $O(\sqrt{\epsilon })$ , it suffices to use training sample size of the order of $\Omega (\log n \cdot \text{poly}(1/\epsilon))$ , where $n$ is the number of the features. We demonstrate the effectiveness of our method on a synthetic robot navigation example.
更多
查看译文
关键词
Markov processes,Kernel,Logistics,Learning (artificial intelligence),Supervised learning,Complexity theory,Process control
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要