Q-learning and enhanced policy iteration in discounted dynamic programming

Research output: Contribution to journalArticlepeer-review

24 Scopus citations

Abstract

We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal state costs or Q-factors. The main difference is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm requires solving an optimal stopping problem. The solution of this problem may be inexact, with a finite number of value iterations, in the spirit of modified policy iteration. The stopping problem structure is incorporated into the standard Q-learning algorithm to obtain a new method that is intermediate between policy iteration and Q-learning/value iteration. Thanks to its special contraction properties, our method overcomes some of the traditional convergence difficulties of modified policy iteration and admits asynchronous deterministic and stochastic iterative implementations, with lower overhead and/or more reliable convergence over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm addresses effectively the inherent difficulties of approximate policy iteration due to inadequate exploration of the state and control spaces.

Original languageEnglish (US)
Pages (from-to)66-94
Number of pages29
JournalMathematics of Operations Research
Volume37
Issue number1
DOIs
StatePublished - Feb 2012
Externally publishedYes

Keywords

  • Dynamic programming
  • Markov decision processes
  • Q-learning; policy iteration
  • Reinforcement learning
  • Stochastic approximation
  • Value iteration

ASJC Scopus subject areas

  • General Mathematics
  • Computer Science Applications
  • Management Science and Operations Research

Fingerprint

Dive into the research topics of 'Q-learning and enhanced policy iteration in discounted dynamic programming'. Together they form a unique fingerprint.

Cite this