Approximate policy iteration: A survey and some new methods

Dimitri P. Bertsekas

doi:10.1007/s11768-011-1005-3

Approximate policy iteration: A survey and some new methods

Dimitri P. Bertsekas

Research output: Contribution to journal › Article › peer-review

173 Scopus citations

Abstract

We consider the classical policy iteration method of dynamic programming (DP), where approximations and simulation are used to deal with the curse of dimensionality. We survey a number of issues: convergence and rate of convergence of approximate policy evaluation methods, singularity and susceptibility to simulation noise of policy evaluation, exploration issues, constrained and enhanced policy iteration, policy oscillation and chattering, and optimistic and distributed policy iteration. Our discussion of policy evaluation is couched in general terms and aims to unify the available methods in the light of recent research developments and to compare the two main policy evaluation approaches: projected equations and temporal differences (TD), and aggregation. In the context of these approaches, we survey two different types of simulation-based algorithms: matrix inversion methods, such as least-squares temporal difference (LSTD), and iterative methods, such as least-squares policy evaluation (LSPE) and TD (λ), and their scaled variants. We discuss a recent method, based on regression and regularization, which rectifies the unreliability of LSTD for nearly singular projected Bellman equations. An iterative version of this method belongs to the LSPE class of methods and provides the connecting link between LSTD and LSPE. Our discussion of policy improvement focuses on the role of policy oscillation and its effect on performance guarantees. We illustrate that policy evaluation when done by the projected equation/TD approach may lead to policy oscillation, but when done by aggregation it does not. This implies better error bounds and more regular performance for aggregation, at the expense of some loss of generality in cost function representation capability. Hard aggregation provides the connecting link between projected equation/TD-based and aggregation-based policy evaluation, and is characterized by favorable error bounds.

Original language	English (US)
Pages (from-to)	310-335
Number of pages	26
Journal	Journal of Control Theory and Applications
Volume	9
Issue number	3
DOIs	https://doi.org/10.1007/s11768-011-1005-3
State	Published - Aug 2011
Externally published	Yes

Keywords

Aggregation
Chattering
Dynamic programming
Policy iteration
Projected equation
Regularization

ASJC Scopus subject areas

Control and Systems Engineering
Hardware and Architecture
Computer Science Applications

Access to Document

10.1007/s11768-011-1005-3

Cite this

@article{65b4a10b524f4333a5fdc57b97087afe,

title = "Approximate policy iteration: A survey and some new methods",

abstract = "We consider the classical policy iteration method of dynamic programming (DP), where approximations and simulation are used to deal with the curse of dimensionality. We survey a number of issues: convergence and rate of convergence of approximate policy evaluation methods, singularity and susceptibility to simulation noise of policy evaluation, exploration issues, constrained and enhanced policy iteration, policy oscillation and chattering, and optimistic and distributed policy iteration. Our discussion of policy evaluation is couched in general terms and aims to unify the available methods in the light of recent research developments and to compare the two main policy evaluation approaches: projected equations and temporal differences (TD), and aggregation. In the context of these approaches, we survey two different types of simulation-based algorithms: matrix inversion methods, such as least-squares temporal difference (LSTD), and iterative methods, such as least-squares policy evaluation (LSPE) and TD (λ), and their scaled variants. We discuss a recent method, based on regression and regularization, which rectifies the unreliability of LSTD for nearly singular projected Bellman equations. An iterative version of this method belongs to the LSPE class of methods and provides the connecting link between LSTD and LSPE. Our discussion of policy improvement focuses on the role of policy oscillation and its effect on performance guarantees. We illustrate that policy evaluation when done by the projected equation/TD approach may lead to policy oscillation, but when done by aggregation it does not. This implies better error bounds and more regular performance for aggregation, at the expense of some loss of generality in cost function representation capability. Hard aggregation provides the connecting link between projected equation/TD-based and aggregation-based policy evaluation, and is characterized by favorable error bounds.",

keywords = "Aggregation, Chattering, Dynamic programming, Policy iteration, Projected equation, Regularization",

author = "Bertsekas, {Dimitri P.}",

note = "Funding Information: Received 7 January 2011. This work was supported by the National Science Foundation (No.ECCS-0801549), the LANL Information Science and Technology Institute, and the Air Force (No.FA9550-10-1-0412). 1 In our notation, Rn is the n-dimensional Euclidean space, all vectors in Rn are viewed as column vectors, and a prime denotes transposition. The identity matrix is denoted by I. {\textcopyright}c South China University of Technology and Academy of Mathematics and Systems Science, CAS and Springer-Verlag Berlin Heidelberg 2011",

year = "2011",

month = aug,

doi = "10.1007/s11768-011-1005-3",

language = "English (US)",

volume = "9",

pages = "310--335",

journal = "Journal of Control Theory and Applications",

issn = "1672-6340",

publisher = "Springer Science + Business Media",

number = "3",

}

TY - JOUR

T1 - Approximate policy iteration

T2 - A survey and some new methods

AU - Bertsekas, Dimitri P.

N1 - Funding Information: Received 7 January 2011. This work was supported by the National Science Foundation (No.ECCS-0801549), the LANL Information Science and Technology Institute, and the Air Force (No.FA9550-10-1-0412). 1 In our notation, Rn is the n-dimensional Euclidean space, all vectors in Rn are viewed as column vectors, and a prime denotes transposition. The identity matrix is denoted by I. ©c South China University of Technology and Academy of Mathematics and Systems Science, CAS and Springer-Verlag Berlin Heidelberg 2011

PY - 2011/8

Y1 - 2011/8

N2 - We consider the classical policy iteration method of dynamic programming (DP), where approximations and simulation are used to deal with the curse of dimensionality. We survey a number of issues: convergence and rate of convergence of approximate policy evaluation methods, singularity and susceptibility to simulation noise of policy evaluation, exploration issues, constrained and enhanced policy iteration, policy oscillation and chattering, and optimistic and distributed policy iteration. Our discussion of policy evaluation is couched in general terms and aims to unify the available methods in the light of recent research developments and to compare the two main policy evaluation approaches: projected equations and temporal differences (TD), and aggregation. In the context of these approaches, we survey two different types of simulation-based algorithms: matrix inversion methods, such as least-squares temporal difference (LSTD), and iterative methods, such as least-squares policy evaluation (LSPE) and TD (λ), and their scaled variants. We discuss a recent method, based on regression and regularization, which rectifies the unreliability of LSTD for nearly singular projected Bellman equations. An iterative version of this method belongs to the LSPE class of methods and provides the connecting link between LSTD and LSPE. Our discussion of policy improvement focuses on the role of policy oscillation and its effect on performance guarantees. We illustrate that policy evaluation when done by the projected equation/TD approach may lead to policy oscillation, but when done by aggregation it does not. This implies better error bounds and more regular performance for aggregation, at the expense of some loss of generality in cost function representation capability. Hard aggregation provides the connecting link between projected equation/TD-based and aggregation-based policy evaluation, and is characterized by favorable error bounds.

AB - We consider the classical policy iteration method of dynamic programming (DP), where approximations and simulation are used to deal with the curse of dimensionality. We survey a number of issues: convergence and rate of convergence of approximate policy evaluation methods, singularity and susceptibility to simulation noise of policy evaluation, exploration issues, constrained and enhanced policy iteration, policy oscillation and chattering, and optimistic and distributed policy iteration. Our discussion of policy evaluation is couched in general terms and aims to unify the available methods in the light of recent research developments and to compare the two main policy evaluation approaches: projected equations and temporal differences (TD), and aggregation. In the context of these approaches, we survey two different types of simulation-based algorithms: matrix inversion methods, such as least-squares temporal difference (LSTD), and iterative methods, such as least-squares policy evaluation (LSPE) and TD (λ), and their scaled variants. We discuss a recent method, based on regression and regularization, which rectifies the unreliability of LSTD for nearly singular projected Bellman equations. An iterative version of this method belongs to the LSPE class of methods and provides the connecting link between LSTD and LSPE. Our discussion of policy improvement focuses on the role of policy oscillation and its effect on performance guarantees. We illustrate that policy evaluation when done by the projected equation/TD approach may lead to policy oscillation, but when done by aggregation it does not. This implies better error bounds and more regular performance for aggregation, at the expense of some loss of generality in cost function representation capability. Hard aggregation provides the connecting link between projected equation/TD-based and aggregation-based policy evaluation, and is characterized by favorable error bounds.

KW - Aggregation

KW - Chattering

KW - Dynamic programming

KW - Policy iteration

KW - Projected equation

KW - Regularization

UR - http://www.scopus.com/inward/record.url?scp=79960439729&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79960439729&partnerID=8YFLogxK

U2 - 10.1007/s11768-011-1005-3

DO - 10.1007/s11768-011-1005-3

M3 - Article

AN - SCOPUS:79960439729

SN - 1672-6340

VL - 9

SP - 310

EP - 335

JO - Journal of Control Theory and Applications

JF - Journal of Control Theory and Applications

IS - 3

ER -

Approximate policy iteration: A survey and some new methods

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this