Q-learning and enhanced policy iteration in discounted dynamic programming

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Scopus citations

Abstract

We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal Q-factors. Instead of policy evaluation by solving a linear system of equations, our algorithm involves (possibly inexact) solution of an optimal stopping problem. This problem can be solved with simple Q-learning iterations, in the case where a lookup table representation is used; it can also be solved with the Q-learning algorithm of Tsitsiklis and Van Roy [TsV99], in the case where feature-based Q-factor approximations are used. In exact/lookup table representation form, our algorithm admits asynchronous and stochastic iterative implementations, in the spirit of asynchronous/modified policy iteration, with lower overhead advantages over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm resolves effectively the inherent difficulties of existing schemes due to inadequate exploration.

Original languageEnglish (US)
Title of host publication2010 49th IEEE Conference on Decision and Control, CDC 2010
Pages1409-1416
Number of pages8
DOIs
StatePublished - 2010
Externally publishedYes
Event2010 49th IEEE Conference on Decision and Control, CDC 2010 - Atlanta, GA, United States
Duration: Dec 15 2010Dec 17 2010

Publication series

NameProceedings of the IEEE Conference on Decision and Control
ISSN (Print)0191-2216

Other

Other2010 49th IEEE Conference on Decision and Control, CDC 2010
CountryUnited States
CityAtlanta, GA
Period12/15/1012/17/10

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Modeling and Simulation
  • Control and Optimization

Fingerprint Dive into the research topics of 'Q-learning and enhanced policy iteration in discounted dynamic programming'. Together they form a unique fingerprint.

Cite this