A performance gradient perspective on gradient-based policy iteration and a modified value iteration

Lei Yang; James Dankert; Jennie Si

doi:10.1108/17563780810919096

A performance gradient perspective on gradient-based policy iteration and a modified value iteration

Lei Yang, James Dankert, Jennie Si

Research output: Contribution to journal › Article › peer-review

3 Scopus citations

Abstract

Purpose – The purpose of this paper is to develop a mathematical framework to address some algorithmic features of approximate dynamic programming (ADP) by using an average cost formulation based on the concepts of differential costs and performance gradients. Under such a framework, a modified value iteration algorithm is developed that is easy to implement, in the mean time it can address a class of partially observable Markov decision processes (POMDP). Design/methodology/approach – Gradient-based policy iteration (GBPI) is a top-down, system-theoretic approach to dynamic optimization with performance guarantees. In this paper, a bottom-up, algorithmic view is provided to complement the original high-level development of GBPI. A modified value iteration is introduced, which can provide solutions to the same type of POMDP problems dealt with by GBPI. Numerical simulations are conducted to include a queuing problem and a maze problem to illustrate and verify features of the proposed algorithms as compared to GBPI. Findings – The direct connection between GBPI and policy iteration is shown under a Markov decision process formulation. As such, additional analytical insights were gained on GBPI. Furthermore, motivated by this analytical framework, the authors propose a modified value iteration as an alternative to addressing the same POMDP problem handled by GBPI. Originality/value – Several important insights are gained from the analytical framework, which motivate the development of both algorithms. Built on this paradigm, new ADP learning algorithms can be developed, in this case, the modified value iteration, to address a broader class of problems, the POMDP. In addition, it is now possible to provide ADP algorithms with a gradient perspective. Inspired by the fundamental understanding of learning and optimization problems under the gradient-based framework, additional new insight may be developed for bottom-up type of algorithms with performance guarantees.

Original language	English (US)
Pages (from-to)	509-520
Number of pages	12
Journal	International Journal of Intelligent Computing and Cybernetics
Volume	1
Issue number	4
DOIs	https://doi.org/10.1108/17563780810919096
State	Published - Oct 17 2008

Keywords

Gradient methods
Iterative methods
Markov processes
Programming and algorithm theory

ASJC Scopus subject areas

General Computer Science

Access to Document

10.1108/17563780810919096

Cite this

@article{5736b1b275f643158b414600c34ba28f,

title = "A performance gradient perspective on gradient-based policy iteration and a modified value iteration",

abstract = "Purpose – The purpose of this paper is to develop a mathematical framework to address some algorithmic features of approximate dynamic programming (ADP) by using an average cost formulation based on the concepts of differential costs and performance gradients. Under such a framework, a modified value iteration algorithm is developed that is easy to implement, in the mean time it can address a class of partially observable Markov decision processes (POMDP). Design/methodology/approach – Gradient-based policy iteration (GBPI) is a top-down, system-theoretic approach to dynamic optimization with performance guarantees. In this paper, a bottom-up, algorithmic view is provided to complement the original high-level development of GBPI. A modified value iteration is introduced, which can provide solutions to the same type of POMDP problems dealt with by GBPI. Numerical simulations are conducted to include a queuing problem and a maze problem to illustrate and verify features of the proposed algorithms as compared to GBPI. Findings – The direct connection between GBPI and policy iteration is shown under a Markov decision process formulation. As such, additional analytical insights were gained on GBPI. Furthermore, motivated by this analytical framework, the authors propose a modified value iteration as an alternative to addressing the same POMDP problem handled by GBPI. Originality/value – Several important insights are gained from the analytical framework, which motivate the development of both algorithms. Built on this paradigm, new ADP learning algorithms can be developed, in this case, the modified value iteration, to address a broader class of problems, the POMDP. In addition, it is now possible to provide ADP algorithms with a gradient perspective. Inspired by the fundamental understanding of learning and optimization problems under the gradient-based framework, additional new insight may be developed for bottom-up type of algorithms with performance guarantees.",

keywords = "Gradient methods, Iterative methods, Markov processes, Programming and algorithm theory",

author = "Lei Yang and James Dankert and Jennie Si",

year = "2008",

month = oct,

day = "17",

doi = "10.1108/17563780810919096",

language = "English (US)",

volume = "1",

pages = "509--520",

journal = "International Journal of Intelligent Computing and Cybernetics",

issn = "1756-378X",

publisher = "Emerald Group Publishing Ltd.",

number = "4",

}

TY - JOUR

T1 - A performance gradient perspective on gradient-based policy iteration and a modified value iteration

AU - Yang, Lei

AU - Dankert, James

AU - Si, Jennie

PY - 2008/10/17

Y1 - 2008/10/17

N2 - Purpose – The purpose of this paper is to develop a mathematical framework to address some algorithmic features of approximate dynamic programming (ADP) by using an average cost formulation based on the concepts of differential costs and performance gradients. Under such a framework, a modified value iteration algorithm is developed that is easy to implement, in the mean time it can address a class of partially observable Markov decision processes (POMDP). Design/methodology/approach – Gradient-based policy iteration (GBPI) is a top-down, system-theoretic approach to dynamic optimization with performance guarantees. In this paper, a bottom-up, algorithmic view is provided to complement the original high-level development of GBPI. A modified value iteration is introduced, which can provide solutions to the same type of POMDP problems dealt with by GBPI. Numerical simulations are conducted to include a queuing problem and a maze problem to illustrate and verify features of the proposed algorithms as compared to GBPI. Findings – The direct connection between GBPI and policy iteration is shown under a Markov decision process formulation. As such, additional analytical insights were gained on GBPI. Furthermore, motivated by this analytical framework, the authors propose a modified value iteration as an alternative to addressing the same POMDP problem handled by GBPI. Originality/value – Several important insights are gained from the analytical framework, which motivate the development of both algorithms. Built on this paradigm, new ADP learning algorithms can be developed, in this case, the modified value iteration, to address a broader class of problems, the POMDP. In addition, it is now possible to provide ADP algorithms with a gradient perspective. Inspired by the fundamental understanding of learning and optimization problems under the gradient-based framework, additional new insight may be developed for bottom-up type of algorithms with performance guarantees.

AB - Purpose – The purpose of this paper is to develop a mathematical framework to address some algorithmic features of approximate dynamic programming (ADP) by using an average cost formulation based on the concepts of differential costs and performance gradients. Under such a framework, a modified value iteration algorithm is developed that is easy to implement, in the mean time it can address a class of partially observable Markov decision processes (POMDP). Design/methodology/approach – Gradient-based policy iteration (GBPI) is a top-down, system-theoretic approach to dynamic optimization with performance guarantees. In this paper, a bottom-up, algorithmic view is provided to complement the original high-level development of GBPI. A modified value iteration is introduced, which can provide solutions to the same type of POMDP problems dealt with by GBPI. Numerical simulations are conducted to include a queuing problem and a maze problem to illustrate and verify features of the proposed algorithms as compared to GBPI. Findings – The direct connection between GBPI and policy iteration is shown under a Markov decision process formulation. As such, additional analytical insights were gained on GBPI. Furthermore, motivated by this analytical framework, the authors propose a modified value iteration as an alternative to addressing the same POMDP problem handled by GBPI. Originality/value – Several important insights are gained from the analytical framework, which motivate the development of both algorithms. Built on this paradigm, new ADP learning algorithms can be developed, in this case, the modified value iteration, to address a broader class of problems, the POMDP. In addition, it is now possible to provide ADP algorithms with a gradient perspective. Inspired by the fundamental understanding of learning and optimization problems under the gradient-based framework, additional new insight may be developed for bottom-up type of algorithms with performance guarantees.

KW - Gradient methods

KW - Iterative methods

KW - Markov processes

KW - Programming and algorithm theory

UR - http://www.scopus.com/inward/record.url?scp=84986097621&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84986097621&partnerID=8YFLogxK

U2 - 10.1108/17563780810919096

DO - 10.1108/17563780810919096

M3 - Article

AN - SCOPUS:84986097621

SN - 1756-378X

VL - 1

SP - 509

EP - 520

JO - International Journal of Intelligent Computing and Cybernetics

JF - International Journal of Intelligent Computing and Cybernetics

IS - 4

ER -

A performance gradient perspective on gradient-based policy iteration and a modified value iteration

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this