Q-learning and enhanced policy iteration in discounted dynamic programming

Dimitri P. Bertsekas; Huizhen Yu

doi:10.1287/moor.1110.0532

Q-learning and enhanced policy iteration in discounted dynamic programming

Dimitri P. Bertsekas, Huizhen Yu

Research output: Contribution to journal › Article › peer-review

24 Scopus citations

Abstract

We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal state costs or Q-factors. The main difference is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm requires solving an optimal stopping problem. The solution of this problem may be inexact, with a finite number of value iterations, in the spirit of modified policy iteration. The stopping problem structure is incorporated into the standard Q-learning algorithm to obtain a new method that is intermediate between policy iteration and Q-learning/value iteration. Thanks to its special contraction properties, our method overcomes some of the traditional convergence difficulties of modified policy iteration and admits asynchronous deterministic and stochastic iterative implementations, with lower overhead and/or more reliable convergence over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm addresses effectively the inherent difficulties of approximate policy iteration due to inadequate exploration of the state and control spaces.

Original language	English (US)
Pages (from-to)	66-94
Number of pages	29
Journal	Mathematics of Operations Research
Volume	37
Issue number	1
DOIs	https://doi.org/10.1287/moor.1110.0532
State	Published - Feb 2012
Externally published	Yes

Keywords

Dynamic programming
Markov decision processes
Q-learning; policy iteration
Reinforcement learning
Stochastic approximation
Value iteration

ASJC Scopus subject areas

General Mathematics
Computer Science Applications
Management Science and Operations Research

Access to Document

10.1287/moor.1110.0532

Cite this

@article{64aeaa0778514e3697f86cd20d40f830,

title = "Q-learning and enhanced policy iteration in discounted dynamic programming",

abstract = "We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal state costs or Q-factors. The main difference is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm requires solving an optimal stopping problem. The solution of this problem may be inexact, with a finite number of value iterations, in the spirit of modified policy iteration. The stopping problem structure is incorporated into the standard Q-learning algorithm to obtain a new method that is intermediate between policy iteration and Q-learning/value iteration. Thanks to its special contraction properties, our method overcomes some of the traditional convergence difficulties of modified policy iteration and admits asynchronous deterministic and stochastic iterative implementations, with lower overhead and/or more reliable convergence over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm addresses effectively the inherent difficulties of approximate policy iteration due to inadequate exploration of the state and control spaces.",

keywords = "Dynamic programming, Markov decision processes, Q-learning; policy iteration, Reinforcement learning, Stochastic approximation, Value iteration",

author = "Bertsekas, {Dimitri P.} and Huizhen Yu",

year = "2012",

month = feb,

doi = "10.1287/moor.1110.0532",

language = "English (US)",

volume = "37",

pages = "66--94",

journal = "Mathematics of Operations Research",

issn = "0364-765X",

publisher = "INFORMS Inst.for Operations Res.and the Management Sciences",

number = "1",

}

TY - JOUR

T1 - Q-learning and enhanced policy iteration in discounted dynamic programming

AU - Bertsekas, Dimitri P.

AU - Yu, Huizhen

PY - 2012/2

Y1 - 2012/2

N2 - We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal state costs or Q-factors. The main difference is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm requires solving an optimal stopping problem. The solution of this problem may be inexact, with a finite number of value iterations, in the spirit of modified policy iteration. The stopping problem structure is incorporated into the standard Q-learning algorithm to obtain a new method that is intermediate between policy iteration and Q-learning/value iteration. Thanks to its special contraction properties, our method overcomes some of the traditional convergence difficulties of modified policy iteration and admits asynchronous deterministic and stochastic iterative implementations, with lower overhead and/or more reliable convergence over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm addresses effectively the inherent difficulties of approximate policy iteration due to inadequate exploration of the state and control spaces.

AB - We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal state costs or Q-factors. The main difference is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm requires solving an optimal stopping problem. The solution of this problem may be inexact, with a finite number of value iterations, in the spirit of modified policy iteration. The stopping problem structure is incorporated into the standard Q-learning algorithm to obtain a new method that is intermediate between policy iteration and Q-learning/value iteration. Thanks to its special contraction properties, our method overcomes some of the traditional convergence difficulties of modified policy iteration and admits asynchronous deterministic and stochastic iterative implementations, with lower overhead and/or more reliable convergence over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm addresses effectively the inherent difficulties of approximate policy iteration due to inadequate exploration of the state and control spaces.

KW - Dynamic programming

KW - Markov decision processes

KW - Q-learning; policy iteration

KW - Reinforcement learning

KW - Stochastic approximation

KW - Value iteration

UR - http://www.scopus.com/inward/record.url?scp=84861380255&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84861380255&partnerID=8YFLogxK

U2 - 10.1287/moor.1110.0532

DO - 10.1287/moor.1110.0532

M3 - Article

AN - SCOPUS:84861380255

SN - 0364-765X

VL - 37

SP - 66

EP - 94

JO - Mathematics of Operations Research

JF - Mathematics of Operations Research

IS - 1

ER -

Q-learning and enhanced policy iteration in discounted dynamic programming

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this