An analysis of gradient-based policy iteration

James Dankert; Lei Yang; Jennie Si

doi:10.1109/IJCNN.2005.1556399

An analysis of gradient-based policy iteration

James Dankert, Lei Yang, Jennie Si

Electrical Engineering

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

Recently, a system theoretic framework for learning and optimization has been developed that shows how many approximate dynamic programming paradigms such as perturbation analysis, Markov decision processes, and reinforcement learning are very closely related. Using this system theoretic framework a new optimization technique called gradient-based policy iteration (GBPI) has been developed. In this paper we will show how GBPI iteration can be extended to partially observable Markov decision processes (POMDPs). We will also develop the value iteration analogue of GBPI and show that this new version of value iteration, extended to POMDPs, not only theoretically acts like value iteration but also does so numerically.

Original language	English (US)
Title of host publication	Proceedings of the International Joint Conference on Neural Networks, IJCNN 2005
Pages	2977-2982
Number of pages	6
DOIs	https://doi.org/10.1109/IJCNN.2005.1556399
State	Published - 2005
Event	International Joint Conference on Neural Networks, IJCNN 2005 - Montreal, QC, Canada Duration: Jul 31 2005 → Aug 4 2005

Publication series

Name	Proceedings of the International Joint Conference on Neural Networks
Volume	5

Other

Other	International Joint Conference on Neural Networks, IJCNN 2005
Country/Territory	Canada
City	Montreal, QC
Period	7/31/05 → 8/4/05

ASJC Scopus subject areas

Software
Artificial Intelligence

Access to Document

10.1109/IJCNN.2005.1556399

Cite this

Dankert, J, Yang, L & Si, J 2005, An analysis of gradient-based policy iteration. in Proceedings of the International Joint Conference on Neural Networks, IJCNN 2005., 1556399, Proceedings of the International Joint Conference on Neural Networks, vol. 5, pp. 2977-2982, International Joint Conference on Neural Networks, IJCNN 2005, Montreal, QC, Canada, 7/31/05. https://doi.org/10.1109/IJCNN.2005.1556399

@inproceedings{6941e7f1bde54e9cb9794fdb0f7c4969,

title = "An analysis of gradient-based policy iteration",

abstract = "Recently, a system theoretic framework for learning and optimization has been developed that shows how many approximate dynamic programming paradigms such as perturbation analysis, Markov decision processes, and reinforcement learning are very closely related. Using this system theoretic framework a new optimization technique called gradient-based policy iteration (GBPI) has been developed. In this paper we will show how GBPI iteration can be extended to partially observable Markov decision processes (POMDPs). We will also develop the value iteration analogue of GBPI and show that this new version of value iteration, extended to POMDPs, not only theoretically acts like value iteration but also does so numerically.",

author = "James Dankert and Lei Yang and Jennie Si",

year = "2005",

doi = "10.1109/IJCNN.2005.1556399",

language = "English (US)",

isbn = "0780390482",

series = "Proceedings of the International Joint Conference on Neural Networks",

pages = "2977--2982",

booktitle = "Proceedings of the International Joint Conference on Neural Networks, IJCNN 2005",

note = "International Joint Conference on Neural Networks, IJCNN 2005 ; Conference date: 31-07-2005 Through 04-08-2005",

}

TY - GEN

T1 - An analysis of gradient-based policy iteration

AU - Dankert, James

AU - Yang, Lei

AU - Si, Jennie

PY - 2005

Y1 - 2005

N2 - Recently, a system theoretic framework for learning and optimization has been developed that shows how many approximate dynamic programming paradigms such as perturbation analysis, Markov decision processes, and reinforcement learning are very closely related. Using this system theoretic framework a new optimization technique called gradient-based policy iteration (GBPI) has been developed. In this paper we will show how GBPI iteration can be extended to partially observable Markov decision processes (POMDPs). We will also develop the value iteration analogue of GBPI and show that this new version of value iteration, extended to POMDPs, not only theoretically acts like value iteration but also does so numerically.

AB - Recently, a system theoretic framework for learning and optimization has been developed that shows how many approximate dynamic programming paradigms such as perturbation analysis, Markov decision processes, and reinforcement learning are very closely related. Using this system theoretic framework a new optimization technique called gradient-based policy iteration (GBPI) has been developed. In this paper we will show how GBPI iteration can be extended to partially observable Markov decision processes (POMDPs). We will also develop the value iteration analogue of GBPI and show that this new version of value iteration, extended to POMDPs, not only theoretically acts like value iteration but also does so numerically.

UR - http://www.scopus.com/inward/record.url?scp=33750137533&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33750137533&partnerID=8YFLogxK

U2 - 10.1109/IJCNN.2005.1556399

DO - 10.1109/IJCNN.2005.1556399

M3 - Conference contribution

AN - SCOPUS:33750137533

SN - 0780390482

SN - 9780780390485

T3 - Proceedings of the International Joint Conference on Neural Networks

SP - 2977

EP - 2982

BT - Proceedings of the International Joint Conference on Neural Networks, IJCNN 2005

T2 - International Joint Conference on Neural Networks, IJCNN 2005

Y2 - 31 July 2005 through 4 August 2005

ER -

An analysis of gradient-based policy iteration

Abstract

Publication series

Other

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this