TY - JOUR
T1 - A performance gradient perspective on gradient-based policy iteration and a modified value iteration
AU - Yang, Lei
AU - Dankert, James
AU - Si, Jennie
PY - 2008/10/17
Y1 - 2008/10/17
N2 - Purpose – The purpose of this paper is to develop a mathematical framework to address some algorithmic features of approximate dynamic programming (ADP) by using an average cost formulation based on the concepts of differential costs and performance gradients. Under such a framework, a modified value iteration algorithm is developed that is easy to implement, in the mean time it can address a class of partially observable Markov decision processes (POMDP). Design/methodology/approach – Gradient-based policy iteration (GBPI) is a top-down, system-theoretic approach to dynamic optimization with performance guarantees. In this paper, a bottom-up, algorithmic view is provided to complement the original high-level development of GBPI. A modified value iteration is introduced, which can provide solutions to the same type of POMDP problems dealt with by GBPI. Numerical simulations are conducted to include a queuing problem and a maze problem to illustrate and verify features of the proposed algorithms as compared to GBPI. Findings – The direct connection between GBPI and policy iteration is shown under a Markov decision process formulation. As such, additional analytical insights were gained on GBPI. Furthermore, motivated by this analytical framework, the authors propose a modified value iteration as an alternative to addressing the same POMDP problem handled by GBPI. Originality/value – Several important insights are gained from the analytical framework, which motivate the development of both algorithms. Built on this paradigm, new ADP learning algorithms can be developed, in this case, the modified value iteration, to address a broader class of problems, the POMDP. In addition, it is now possible to provide ADP algorithms with a gradient perspective. Inspired by the fundamental understanding of learning and optimization problems under the gradient-based framework, additional new insight may be developed for bottom-up type of algorithms with performance guarantees.
AB - Purpose – The purpose of this paper is to develop a mathematical framework to address some algorithmic features of approximate dynamic programming (ADP) by using an average cost formulation based on the concepts of differential costs and performance gradients. Under such a framework, a modified value iteration algorithm is developed that is easy to implement, in the mean time it can address a class of partially observable Markov decision processes (POMDP). Design/methodology/approach – Gradient-based policy iteration (GBPI) is a top-down, system-theoretic approach to dynamic optimization with performance guarantees. In this paper, a bottom-up, algorithmic view is provided to complement the original high-level development of GBPI. A modified value iteration is introduced, which can provide solutions to the same type of POMDP problems dealt with by GBPI. Numerical simulations are conducted to include a queuing problem and a maze problem to illustrate and verify features of the proposed algorithms as compared to GBPI. Findings – The direct connection between GBPI and policy iteration is shown under a Markov decision process formulation. As such, additional analytical insights were gained on GBPI. Furthermore, motivated by this analytical framework, the authors propose a modified value iteration as an alternative to addressing the same POMDP problem handled by GBPI. Originality/value – Several important insights are gained from the analytical framework, which motivate the development of both algorithms. Built on this paradigm, new ADP learning algorithms can be developed, in this case, the modified value iteration, to address a broader class of problems, the POMDP. In addition, it is now possible to provide ADP algorithms with a gradient perspective. Inspired by the fundamental understanding of learning and optimization problems under the gradient-based framework, additional new insight may be developed for bottom-up type of algorithms with performance guarantees.
KW - Gradient methods
KW - Iterative methods
KW - Markov processes
KW - Programming and algorithm theory
UR - http://www.scopus.com/inward/record.url?scp=84986097621&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84986097621&partnerID=8YFLogxK
U2 - 10.1108/17563780810919096
DO - 10.1108/17563780810919096
M3 - Article
AN - SCOPUS:84986097621
SN - 1756-378X
VL - 1
SP - 509
EP - 520
JO - International Journal of Intelligent Computing and Cybernetics
JF - International Journal of Intelligent Computing and Cybernetics
IS - 4
ER -