Learning algorithms for Markov decision processes with average cost

Jinane Abounadi; Dimitrib Bertsekas; V. S. Borkar

doi:10.1137/S0363012999361974

Learning algorithms for Markov decision processes with average cost

Jinane Abounadi, Dimitrib Bertsekas, V. S. Borkar

Research output: Contribution to journal › Article › peer-review

124 Scopus citations

Abstract

This paper gives the first rigorous convergence analysis of analogues of Watkins's Q-learning algorithm, applied to average cost control of finite-state Markov chains. We discuss two algorithms which may be viewed as stochastic approximation counterparts of two existing algorithms for recursively computing the value function of the average cost problem - the traditional relative value iteration (RVI) algorithm and a recent algorithm of Bertsekas based on the stochastic shortest path (SSP) formulation of the problem. Both synchronous and asynchronous implementations are considered and analyzed using the ODE method. This involves establishing asymptotic stability of associated ODE limits. The SSP algorithm also uses ideas from two-time-scale stochastic approximation.

Original language	English (US)
Pages (from-to)	681-698
Number of pages	18
Journal	SIAM Journal on Control and Optimization
Volume	40
Issue number	3
DOIs	https://doi.org/10.1137/S0363012999361974
State	Published - 2002
Externally published	Yes

Keywords

Average cost control
Controlled Markov chains
Dynamic programming
Q-learning
Simulation-based algorithms
Stochastic approximation

ASJC Scopus subject areas

Control and Optimization
Applied Mathematics

Access to Document

10.1137/S0363012999361974

Cite this

@article{22a7cee824854ab48267bbcc24ceb1bd,

title = "Learning algorithms for Markov decision processes with average cost",

abstract = "This paper gives the first rigorous convergence analysis of analogues of Watkins's Q-learning algorithm, applied to average cost control of finite-state Markov chains. We discuss two algorithms which may be viewed as stochastic approximation counterparts of two existing algorithms for recursively computing the value function of the average cost problem - the traditional relative value iteration (RVI) algorithm and a recent algorithm of Bertsekas based on the stochastic shortest path (SSP) formulation of the problem. Both synchronous and asynchronous implementations are considered and analyzed using the ODE method. This involves establishing asymptotic stability of associated ODE limits. The SSP algorithm also uses ideas from two-time-scale stochastic approximation.",

keywords = "Average cost control, Controlled Markov chains, Dynamic programming, Q-learning, Simulation-based algorithms, Stochastic approximation",

author = "Jinane Abounadi and Dimitrib Bertsekas and Borkar, {V. S.}",

year = "2002",

doi = "10.1137/S0363012999361974",

language = "English (US)",

volume = "40",

pages = "681--698",

journal = "SIAM Journal on Control and Optimization",

issn = "0363-0129",

publisher = "Society for Industrial and Applied Mathematics Publications",

number = "3",

}

TY - JOUR

T1 - Learning algorithms for Markov decision processes with average cost

AU - Abounadi, Jinane

AU - Bertsekas, Dimitrib

AU - Borkar, V. S.

PY - 2002

Y1 - 2002

N2 - This paper gives the first rigorous convergence analysis of analogues of Watkins's Q-learning algorithm, applied to average cost control of finite-state Markov chains. We discuss two algorithms which may be viewed as stochastic approximation counterparts of two existing algorithms for recursively computing the value function of the average cost problem - the traditional relative value iteration (RVI) algorithm and a recent algorithm of Bertsekas based on the stochastic shortest path (SSP) formulation of the problem. Both synchronous and asynchronous implementations are considered and analyzed using the ODE method. This involves establishing asymptotic stability of associated ODE limits. The SSP algorithm also uses ideas from two-time-scale stochastic approximation.

AB - This paper gives the first rigorous convergence analysis of analogues of Watkins's Q-learning algorithm, applied to average cost control of finite-state Markov chains. We discuss two algorithms which may be viewed as stochastic approximation counterparts of two existing algorithms for recursively computing the value function of the average cost problem - the traditional relative value iteration (RVI) algorithm and a recent algorithm of Bertsekas based on the stochastic shortest path (SSP) formulation of the problem. Both synchronous and asynchronous implementations are considered and analyzed using the ODE method. This involves establishing asymptotic stability of associated ODE limits. The SSP algorithm also uses ideas from two-time-scale stochastic approximation.

KW - Average cost control

KW - Controlled Markov chains

KW - Dynamic programming

KW - Q-learning

KW - Simulation-based algorithms

KW - Stochastic approximation

UR - http://www.scopus.com/inward/record.url?scp=0036287773&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0036287773&partnerID=8YFLogxK

U2 - 10.1137/S0363012999361974

DO - 10.1137/S0363012999361974

M3 - Article

AN - SCOPUS:0036287773

SN - 0363-0129

VL - 40

SP - 681

EP - 698

JO - SIAM Journal on Control and Optimization

JF - SIAM Journal on Control and Optimization

IS - 3

ER -

Learning algorithms for Markov decision processes with average cost

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this