Lambda-Policy Iteration: A Review and a New Implementation

Dimitri P. Bertsekas

doi:10.1002/9781118453988.ch17

Lambda-Policy Iteration: A Review and a New Implementation

Dimitri P. Bertsekas

Research output: Chapter in Book/Report/Conference proceeding › Chapter

3 Scopus citations

Abstract

In this chapter, we discuss λ-policy iteration, a method for exact and approximate dynamic programming. It is intermediate between the classical value iteration (VI) and the policy iteration (PI) methods, and it is closely related to optimistic (also known as modified) PI, whereby each policy evaluation is done approximately, using a finite number of VI. We review the theory of the method and associated questions of bias and exploration arising in simulation-based cost function approximation. We then discuss various implementations, which offer advantages over well-established PI methods that use LSPE(λ), LSTD(λ), or TD(λ) for policy evaluation with cost function approximation. One of these implementations is based on a new simulation scheme, called geometric sampling, which uses multiple short trajectories rather than a single infinitely long trajectory.

Original language	English (US)
Title of host publication	Reinforcement Learning and Approximate Dynamic Programming for Feedback Control
Publisher	John Wiley and Sons
Pages	379-409
Number of pages	31
ISBN (Print)	9781118104200
DOIs	https://doi.org/10.1002/9781118453988.ch17
State	Published - Feb 7 2013
Externally published	Yes

Keywords

DP for complex problems, λ-PI
LSTD(λ) batch, simple matrix inversion
MDP and RL, λ-policy iteration in DP
λ-PI without cost function, using geometric
λ-policy, a new implementation

ASJC Scopus subject areas

General Engineering

Access to Document

10.1002/9781118453988.ch17

Cite this

@inbook{fe5a52e84ae143a2b3a429309631f66f,

title = "Lambda-Policy Iteration: A Review and a New Implementation",

abstract = "In this chapter, we discuss λ-policy iteration, a method for exact and approximate dynamic programming. It is intermediate between the classical value iteration (VI) and the policy iteration (PI) methods, and it is closely related to optimistic (also known as modified) PI, whereby each policy evaluation is done approximately, using a finite number of VI. We review the theory of the method and associated questions of bias and exploration arising in simulation-based cost function approximation. We then discuss various implementations, which offer advantages over well-established PI methods that use LSPE(λ), LSTD(λ), or TD(λ) for policy evaluation with cost function approximation. One of these implementations is based on a new simulation scheme, called geometric sampling, which uses multiple short trajectories rather than a single infinitely long trajectory.",

keywords = "DP for complex problems, λ-PI, LSTD(λ) batch, simple matrix inversion, MDP and RL, λ-policy iteration in DP, λ-PI without cost function, using geometric, λ-policy, a new implementation",

author = "Bertsekas, {Dimitri P.}",

year = "2013",

month = feb,

day = "7",

doi = "10.1002/9781118453988.ch17",

language = "English (US)",

isbn = "9781118104200",

pages = "379--409",

booktitle = "Reinforcement Learning and Approximate Dynamic Programming for Feedback Control",

publisher = "John Wiley and Sons",

}

TY - CHAP

T1 - Lambda-Policy Iteration

T2 - A Review and a New Implementation

AU - Bertsekas, Dimitri P.

PY - 2013/2/7

Y1 - 2013/2/7

N2 - In this chapter, we discuss λ-policy iteration, a method for exact and approximate dynamic programming. It is intermediate between the classical value iteration (VI) and the policy iteration (PI) methods, and it is closely related to optimistic (also known as modified) PI, whereby each policy evaluation is done approximately, using a finite number of VI. We review the theory of the method and associated questions of bias and exploration arising in simulation-based cost function approximation. We then discuss various implementations, which offer advantages over well-established PI methods that use LSPE(λ), LSTD(λ), or TD(λ) for policy evaluation with cost function approximation. One of these implementations is based on a new simulation scheme, called geometric sampling, which uses multiple short trajectories rather than a single infinitely long trajectory.

AB - In this chapter, we discuss λ-policy iteration, a method for exact and approximate dynamic programming. It is intermediate between the classical value iteration (VI) and the policy iteration (PI) methods, and it is closely related to optimistic (also known as modified) PI, whereby each policy evaluation is done approximately, using a finite number of VI. We review the theory of the method and associated questions of bias and exploration arising in simulation-based cost function approximation. We then discuss various implementations, which offer advantages over well-established PI methods that use LSPE(λ), LSTD(λ), or TD(λ) for policy evaluation with cost function approximation. One of these implementations is based on a new simulation scheme, called geometric sampling, which uses multiple short trajectories rather than a single infinitely long trajectory.

KW - DP for complex problems, λ-PI

KW - LSTD(λ) batch, simple matrix inversion

KW - MDP and RL, λ-policy iteration in DP

KW - λ-PI without cost function, using geometric

KW - λ-policy, a new implementation

UR - http://www.scopus.com/inward/record.url?scp=84886325771&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84886325771&partnerID=8YFLogxK

U2 - 10.1002/9781118453988.ch17

DO - 10.1002/9781118453988.ch17

M3 - Chapter

AN - SCOPUS:84886325771

SN - 9781118104200

SP - 379

EP - 409

BT - Reinforcement Learning and Approximate Dynamic Programming for Feedback Control

PB - John Wiley and Sons

ER -

Lambda-Policy Iteration: A Review and a New Implementation

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this