TY - GEN
T1 - Pathologies of temporal difference methods in approximate dynamic programming
AU - Bertsekas, Dimitri P.
PY - 2010
Y1 - 2010
N2 - Approximate policy iteration methods based on temporal differences are popular in practice, and have been tested extensively, dating to the early nineties, but the associated convergence behavior is complex, and not well understood at present. An important question is whether the policy iteration process is seriously hampered by oscillations between poor policies, roughly similar to the attraction of gradient methods to poor local minima. There has been little apparent concern in the approximate DP/reinforcement learning literature about this possibility, even though it has been documented with several simple examples. Recent computational experimentation with the game of tetris, a popular testbed for approximate DP algorithms over a 15-year period, has brought the issue to sharp focus. In particular, using a standard set of 22 features and temporal difference methods, an average score of a few thousands was achieved. Using the same features and a random search method, an overwhelmingly better average score was achieved (600,000-900,000). The paper explains the likely mechanism of this phenomenon, and derives conditions under which it will not occur.
AB - Approximate policy iteration methods based on temporal differences are popular in practice, and have been tested extensively, dating to the early nineties, but the associated convergence behavior is complex, and not well understood at present. An important question is whether the policy iteration process is seriously hampered by oscillations between poor policies, roughly similar to the attraction of gradient methods to poor local minima. There has been little apparent concern in the approximate DP/reinforcement learning literature about this possibility, even though it has been documented with several simple examples. Recent computational experimentation with the game of tetris, a popular testbed for approximate DP algorithms over a 15-year period, has brought the issue to sharp focus. In particular, using a standard set of 22 features and temporal difference methods, an average score of a few thousands was achieved. Using the same features and a random search method, an overwhelmingly better average score was achieved (600,000-900,000). The paper explains the likely mechanism of this phenomenon, and derives conditions under which it will not occur.
UR - http://www.scopus.com/inward/record.url?scp=79953145727&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79953145727&partnerID=8YFLogxK
U2 - 10.1109/CDC.2010.5717644
DO - 10.1109/CDC.2010.5717644
M3 - Conference contribution
AN - SCOPUS:79953145727
SN - 9781424477456
T3 - Proceedings of the IEEE Conference on Decision and Control
SP - 3034
EP - 3039
BT - 2010 49th IEEE Conference on Decision and Control, CDC 2010
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 49th IEEE Conference on Decision and Control, CDC 2010
Y2 - 15 December 2010 through 17 December 2010
ER -