Truncated Variance Reduced Value Iteration
CoRR(2024)
Abstract
We provide faster randomized algorithms for computing an ϵ-optimal
policy in a discounted Markov decision process with
A_tot-state-action pairs, bounded rewards, and discount factor
γ. We provide an Õ(A_tot[(1 -
γ)^-3ϵ^-2 + (1 - γ)^-2])-time algorithm in the sampling
setting, where the probability transition matrix is unknown but accessible
through a generative model which can be queried in Õ(1)-time, and an
Õ(s + (1-γ)^-2)-time algorithm in the offline setting where
the probability transition matrix is known and s-sparse. These results
improve upon the prior state-of-the-art which either ran in
Õ(A_tot[(1 - γ)^-3ϵ^-2 + (1 - γ)^-3])
time [Sidford, Wang, Wu, Ye 2018] in the sampling setting, Õ(s +
A_tot (1-γ)^-3) time [Sidford, Wang, Wu, Yang, Ye 2018] in the
offline setting, or time at least quadratic in the number of states using
interior point methods for linear programming. We achieve our results by
building upon prior stochastic variance-reduced value iteration methods
[Sidford, Wang, Wu, Yang, Ye 2018]. We provide a variant that carefully
truncates the progress of its iterates to improve the variance of new
variance-reduced sampling procedures that we introduce to implement the steps.
Our method is essentially model-free and can be implemented in
Õ(A_tot)-space when given generative model access.
Consequently, our results take a step in closing the sample-complexity gap
between model-free and model-based methods.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined