Pathologies of temporal difference methods in approximate dynamic programming

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Scopus citations

Abstract

Approximate policy iteration methods based on temporal differences are popular in practice, and have been tested extensively, dating to the early nineties, but the associated convergence behavior is complex, and not well understood at present. An important question is whether the policy iteration process is seriously hampered by oscillations between poor policies, roughly similar to the attraction of gradient methods to poor local minima. There has been little apparent concern in the approximate DP/reinforcement learning literature about this possibility, even though it has been documented with several simple examples. Recent computational experimentation with the game of tetris, a popular testbed for approximate DP algorithms over a 15-year period, has brought the issue to sharp focus. In particular, using a standard set of 22 features and temporal difference methods, an average score of a few thousands was achieved. Using the same features and a random search method, an overwhelmingly better average score was achieved (600,000-900,000). The paper explains the likely mechanism of this phenomenon, and derives conditions under which it will not occur.

Original languageEnglish (US)
Title of host publication2010 49th IEEE Conference on Decision and Control, CDC 2010
Pages3034-3039
Number of pages6
DOIs
StatePublished - 2010
Externally publishedYes
Event2010 49th IEEE Conference on Decision and Control, CDC 2010 - Atlanta, GA, United States
Duration: Dec 15 2010Dec 17 2010

Publication series

NameProceedings of the IEEE Conference on Decision and Control
ISSN (Print)0191-2216

Other

Other2010 49th IEEE Conference on Decision and Control, CDC 2010
CountryUnited States
CityAtlanta, GA
Period12/15/1012/17/10

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Modeling and Simulation
  • Control and Optimization

Fingerprint Dive into the research topics of 'Pathologies of temporal difference methods in approximate dynamic programming'. Together they form a unique fingerprint.

Cite this