Q-learning and policy iteration algorithms for stochastic shortest path problems

Huizhen Yu; Dimitri P. Bertsekas

doi:10.1007/s10479-012-1128-z

Q-learning and policy iteration algorithms for stochastic shortest path problems

Huizhen Yu, Dimitri P. Bertsekas

Research output: Contribution to journal › Article › peer-review

23 Scopus citations

Abstract

We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound.

Original language	English (US)
Pages (from-to)	95-132
Number of pages	38
Journal	Annals of Operations Research
Volume	208
Issue number	1
DOIs	https://doi.org/10.1007/s10479-012-1128-z
State	Published - Sep 2013
Externally published	Yes

Keywords

Approximate dynamic programming
Markov decision processes
Policy iteration
Q-learning
Stochastic approximation
Stochastic shortest paths
Value iteration

ASJC Scopus subject areas

General Decision Sciences
Management Science and Operations Research

Access to Document

10.1007/s10479-012-1128-z

Cite this

@article{f5230312dd69495caefdc7baed61ec86,

title = "Q-learning and policy iteration algorithms for stochastic shortest path problems",

abstract = "We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound.",

keywords = "Approximate dynamic programming, Markov decision processes, Policy iteration, Q-learning, Stochastic approximation, Stochastic shortest paths, Value iteration",

author = "Huizhen Yu and Bertsekas, {Dimitri P.}",

note = "Funding Information: Work supported by the Air Force Grant FA9550-10-1-0412 and by NSF Grant ECCS-0801549.",

year = "2013",

month = sep,

doi = "10.1007/s10479-012-1128-z",

language = "English (US)",

volume = "208",

pages = "95--132",

journal = "Annals of Operations Research",

issn = "0254-5330",

publisher = "Springer Netherlands",

number = "1",

}

TY - JOUR

T1 - Q-learning and policy iteration algorithms for stochastic shortest path problems

AU - Yu, Huizhen

AU - Bertsekas, Dimitri P.

N1 - Funding Information: Work supported by the Air Force Grant FA9550-10-1-0412 and by NSF Grant ECCS-0801549.

PY - 2013/9

Y1 - 2013/9

N2 - We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound.

AB - We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound.

KW - Approximate dynamic programming

KW - Markov decision processes

KW - Policy iteration

KW - Q-learning

KW - Stochastic approximation

KW - Stochastic shortest paths

KW - Value iteration

UR - http://www.scopus.com/inward/record.url?scp=84883051777&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84883051777&partnerID=8YFLogxK

U2 - 10.1007/s10479-012-1128-z

DO - 10.1007/s10479-012-1128-z

M3 - Article

AN - SCOPUS:84883051777

SN - 0254-5330

VL - 208

SP - 95

EP - 132

JO - Annals of Operations Research

JF - Annals of Operations Research

IS - 1

ER -

Q-learning and policy iteration algorithms for stochastic shortest path problems

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this