Pathologies of temporal difference methods in approximate dynamic programming

Dimitri P. Bertsekas

doi:10.1109/CDC.2010.5717644

Pathologies of temporal difference methods in approximate dynamic programming

Dimitri P. Bertsekas

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

7 Scopus citations

Abstract

Approximate policy iteration methods based on temporal differences are popular in practice, and have been tested extensively, dating to the early nineties, but the associated convergence behavior is complex, and not well understood at present. An important question is whether the policy iteration process is seriously hampered by oscillations between poor policies, roughly similar to the attraction of gradient methods to poor local minima. There has been little apparent concern in the approximate DP/reinforcement learning literature about this possibility, even though it has been documented with several simple examples. Recent computational experimentation with the game of tetris, a popular testbed for approximate DP algorithms over a 15-year period, has brought the issue to sharp focus. In particular, using a standard set of 22 features and temporal difference methods, an average score of a few thousands was achieved. Using the same features and a random search method, an overwhelmingly better average score was achieved (600,000-900,000). The paper explains the likely mechanism of this phenomenon, and derives conditions under which it will not occur.

Original language	English (US)
Title of host publication	2010 49th IEEE Conference on Decision and Control, CDC 2010
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	3034-3039
Number of pages	6
ISBN (Print)	9781424477456
DOIs	https://doi.org/10.1109/CDC.2010.5717644
State	Published - 2010
Externally published	Yes
Event	49th IEEE Conference on Decision and Control, CDC 2010 - Atlanta, United States Duration: Dec 15 2010 → Dec 17 2010

Publication series

Name	Proceedings of the IEEE Conference on Decision and Control
ISSN (Print)	0743-1546
ISSN (Electronic)	2576-2370

Conference

Conference	49th IEEE Conference on Decision and Control, CDC 2010
Country/Territory	United States
City	Atlanta
Period	12/15/10 → 12/17/10

ASJC Scopus subject areas

Control and Systems Engineering
Modeling and Simulation
Control and Optimization

Access to Document

10.1109/CDC.2010.5717644

Cite this

Pathologies of temporal difference methods in approximate dynamic programming. / Bertsekas, Dimitri P.
2010 49th IEEE Conference on Decision and Control, CDC 2010. Institute of Electrical and Electronics Engineers Inc., 2010. p. 3034-3039 5717644 (Proceedings of the IEEE Conference on Decision and Control).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Bertsekas, DP 2010, Pathologies of temporal difference methods in approximate dynamic programming. in 2010 49th IEEE Conference on Decision and Control, CDC 2010., 5717644, Proceedings of the IEEE Conference on Decision and Control, Institute of Electrical and Electronics Engineers Inc., pp. 3034-3039, 49th IEEE Conference on Decision and Control, CDC 2010, Atlanta, United States, 12/15/10. https://doi.org/10.1109/CDC.2010.5717644

@inproceedings{3d99e4ba63a54161ae3ae842af256ce4,

title = "Pathologies of temporal difference methods in approximate dynamic programming",

abstract = "Approximate policy iteration methods based on temporal differences are popular in practice, and have been tested extensively, dating to the early nineties, but the associated convergence behavior is complex, and not well understood at present. An important question is whether the policy iteration process is seriously hampered by oscillations between poor policies, roughly similar to the attraction of gradient methods to poor local minima. There has been little apparent concern in the approximate DP/reinforcement learning literature about this possibility, even though it has been documented with several simple examples. Recent computational experimentation with the game of tetris, a popular testbed for approximate DP algorithms over a 15-year period, has brought the issue to sharp focus. In particular, using a standard set of 22 features and temporal difference methods, an average score of a few thousands was achieved. Using the same features and a random search method, an overwhelmingly better average score was achieved (600,000-900,000). The paper explains the likely mechanism of this phenomenon, and derives conditions under which it will not occur.",

author = "Bertsekas, {Dimitri P.}",

year = "2010",

doi = "10.1109/CDC.2010.5717644",

language = "English (US)",

isbn = "9781424477456",

series = "Proceedings of the IEEE Conference on Decision and Control",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "3034--3039",

booktitle = "2010 49th IEEE Conference on Decision and Control, CDC 2010",

note = "49th IEEE Conference on Decision and Control, CDC 2010 ; Conference date: 15-12-2010 Through 17-12-2010",

}

TY - GEN

T1 - Pathologies of temporal difference methods in approximate dynamic programming

AU - Bertsekas, Dimitri P.

PY - 2010

Y1 - 2010

N2 - Approximate policy iteration methods based on temporal differences are popular in practice, and have been tested extensively, dating to the early nineties, but the associated convergence behavior is complex, and not well understood at present. An important question is whether the policy iteration process is seriously hampered by oscillations between poor policies, roughly similar to the attraction of gradient methods to poor local minima. There has been little apparent concern in the approximate DP/reinforcement learning literature about this possibility, even though it has been documented with several simple examples. Recent computational experimentation with the game of tetris, a popular testbed for approximate DP algorithms over a 15-year period, has brought the issue to sharp focus. In particular, using a standard set of 22 features and temporal difference methods, an average score of a few thousands was achieved. Using the same features and a random search method, an overwhelmingly better average score was achieved (600,000-900,000). The paper explains the likely mechanism of this phenomenon, and derives conditions under which it will not occur.

AB - Approximate policy iteration methods based on temporal differences are popular in practice, and have been tested extensively, dating to the early nineties, but the associated convergence behavior is complex, and not well understood at present. An important question is whether the policy iteration process is seriously hampered by oscillations between poor policies, roughly similar to the attraction of gradient methods to poor local minima. There has been little apparent concern in the approximate DP/reinforcement learning literature about this possibility, even though it has been documented with several simple examples. Recent computational experimentation with the game of tetris, a popular testbed for approximate DP algorithms over a 15-year period, has brought the issue to sharp focus. In particular, using a standard set of 22 features and temporal difference methods, an average score of a few thousands was achieved. Using the same features and a random search method, an overwhelmingly better average score was achieved (600,000-900,000). The paper explains the likely mechanism of this phenomenon, and derives conditions under which it will not occur.

UR - http://www.scopus.com/inward/record.url?scp=79953145727&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79953145727&partnerID=8YFLogxK

U2 - 10.1109/CDC.2010.5717644

DO - 10.1109/CDC.2010.5717644

M3 - Conference contribution

AN - SCOPUS:79953145727

SN - 9781424477456

T3 - Proceedings of the IEEE Conference on Decision and Control

SP - 3034

EP - 3039

BT - 2010 49th IEEE Conference on Decision and Control, CDC 2010

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 49th IEEE Conference on Decision and Control, CDC 2010

Y2 - 15 December 2010 through 17 December 2010

ER -

Pathologies of temporal difference methods in approximate dynamic programming

Abstract

Publication series

Conference

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this