Joint inference of reward machines and policies for reinforcement learning

Zhe Xu; Ivan Gavran; Yousef Ahmad; Rupak Majumdar; Daniel Neider; Ufuk Topcu; Bo Wu

Joint inference of reward machines and policies for reinforcement learning

Zhe Xu, Ivan Gavran, Yousef Ahmad, Rupak Majumdar, Daniel Neider, Ufuk Topcu, Bo Wu

Research output: Contribution to journal › Conference article › peer-review

Abstract

Incorporating high-level knowledge is an effective way to expedite reinforcement learning (RL), especially for complex tasks with sparse rewards. We investigate an RL problem where the high-level knowledge is in the form of reward machines, a type of Mealy machines that encode non-Markovian reward functions. We focus on a setting in which this knowledge is a priori not available to the learning agent. We develop an iterative algorithm that performs joint inference of reward machines and policies for RL (more specifically, q-learning). In each iteration, the algorithm maintains a hypothesis reward machine and a sample of RL episodes. It uses a separate q-function defined for each state of the current hypothesis reward machine to determine the policy and performs RL to update the q-functions. While performing RL, the algorithm updates the sample by adding RL episodes along which the obtained rewards are inconsistent with the rewards based on the current hypothesis reward machine. In the next iteration, the algorithm infers a new hypothesis reward machine from the updated sample. Based on an equivalence relation between states of reward machines, we transfer the q-functions between the hypothesis reward machines in consecutive iterations. We prove that the proposed algorithm converges almost surely to an optimal policy in the limit. The experiments show that learning high-level knowledge in the form of reward machines leads to fast convergence to optimal policies in RL, while the baseline RL methods fail to converge to optimal policies after a substantial number of training steps.

Original language	English (US)
Pages (from-to)	590-598
Number of pages	9
Journal	Proceedings International Conference on Automated Planning and Scheduling, ICAPS
Volume	30
State	Published - May 29 2020
Externally published	Yes
Event	30th International Conference on Automated Planning and Scheduling, ICAPS 2020 - Nancy, France Duration: Oct 26 2020 → Oct 30 2020

ASJC Scopus subject areas

Artificial Intelligence
Computer Science Applications
Information Systems and Management

Cite this

@article{b33a3c146f124be4b7fe9efd930e49dd,

title = "Joint inference of reward machines and policies for reinforcement learning",

abstract = "Incorporating high-level knowledge is an effective way to expedite reinforcement learning (RL), especially for complex tasks with sparse rewards. We investigate an RL problem where the high-level knowledge is in the form of reward machines, a type of Mealy machines that encode non-Markovian reward functions. We focus on a setting in which this knowledge is a priori not available to the learning agent. We develop an iterative algorithm that performs joint inference of reward machines and policies for RL (more specifically, q-learning). In each iteration, the algorithm maintains a hypothesis reward machine and a sample of RL episodes. It uses a separate q-function defined for each state of the current hypothesis reward machine to determine the policy and performs RL to update the q-functions. While performing RL, the algorithm updates the sample by adding RL episodes along which the obtained rewards are inconsistent with the rewards based on the current hypothesis reward machine. In the next iteration, the algorithm infers a new hypothesis reward machine from the updated sample. Based on an equivalence relation between states of reward machines, we transfer the q-functions between the hypothesis reward machines in consecutive iterations. We prove that the proposed algorithm converges almost surely to an optimal policy in the limit. The experiments show that learning high-level knowledge in the form of reward machines leads to fast convergence to optimal policies in RL, while the baseline RL methods fail to converge to optimal policies after a substantial number of training steps.",

author = "Zhe Xu and Ivan Gavran and Yousef Ahmad and Rupak Majumdar and Daniel Neider and Ufuk Topcu and Bo Wu",

note = "Funding Information: This work is supported in part by grants DFG 389792660-TRR 248, DFG 434592664, ERC 610150, DARPA D19AP00004, ONR N000141712623, and NASA 80NSSC19K0209. Publisher Copyright: Copyright {\textcopyright} 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.; 30th International Conference on Automated Planning and Scheduling, ICAPS 2020 ; Conference date: 26-10-2020 Through 30-10-2020",

year = "2020",

month = may,

day = "29",

language = "English (US)",

volume = "30",

pages = "590--598",

journal = "Proceedings International Conference on Automated Planning and Scheduling, ICAPS",

issn = "2334-0835",

}

TY - JOUR

T1 - Joint inference of reward machines and policies for reinforcement learning

AU - Xu, Zhe

AU - Gavran, Ivan

AU - Ahmad, Yousef

AU - Majumdar, Rupak

AU - Neider, Daniel

AU - Topcu, Ufuk

AU - Wu, Bo

N1 - Funding Information: This work is supported in part by grants DFG 389792660-TRR 248, DFG 434592664, ERC 610150, DARPA D19AP00004, ONR N000141712623, and NASA 80NSSC19K0209. Publisher Copyright: Copyright © 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

PY - 2020/5/29

Y1 - 2020/5/29

N2 - Incorporating high-level knowledge is an effective way to expedite reinforcement learning (RL), especially for complex tasks with sparse rewards. We investigate an RL problem where the high-level knowledge is in the form of reward machines, a type of Mealy machines that encode non-Markovian reward functions. We focus on a setting in which this knowledge is a priori not available to the learning agent. We develop an iterative algorithm that performs joint inference of reward machines and policies for RL (more specifically, q-learning). In each iteration, the algorithm maintains a hypothesis reward machine and a sample of RL episodes. It uses a separate q-function defined for each state of the current hypothesis reward machine to determine the policy and performs RL to update the q-functions. While performing RL, the algorithm updates the sample by adding RL episodes along which the obtained rewards are inconsistent with the rewards based on the current hypothesis reward machine. In the next iteration, the algorithm infers a new hypothesis reward machine from the updated sample. Based on an equivalence relation between states of reward machines, we transfer the q-functions between the hypothesis reward machines in consecutive iterations. We prove that the proposed algorithm converges almost surely to an optimal policy in the limit. The experiments show that learning high-level knowledge in the form of reward machines leads to fast convergence to optimal policies in RL, while the baseline RL methods fail to converge to optimal policies after a substantial number of training steps.

AB - Incorporating high-level knowledge is an effective way to expedite reinforcement learning (RL), especially for complex tasks with sparse rewards. We investigate an RL problem where the high-level knowledge is in the form of reward machines, a type of Mealy machines that encode non-Markovian reward functions. We focus on a setting in which this knowledge is a priori not available to the learning agent. We develop an iterative algorithm that performs joint inference of reward machines and policies for RL (more specifically, q-learning). In each iteration, the algorithm maintains a hypothesis reward machine and a sample of RL episodes. It uses a separate q-function defined for each state of the current hypothesis reward machine to determine the policy and performs RL to update the q-functions. While performing RL, the algorithm updates the sample by adding RL episodes along which the obtained rewards are inconsistent with the rewards based on the current hypothesis reward machine. In the next iteration, the algorithm infers a new hypothesis reward machine from the updated sample. Based on an equivalence relation between states of reward machines, we transfer the q-functions between the hypothesis reward machines in consecutive iterations. We prove that the proposed algorithm converges almost surely to an optimal policy in the limit. The experiments show that learning high-level knowledge in the form of reward machines leads to fast convergence to optimal policies in RL, while the baseline RL methods fail to converge to optimal policies after a substantial number of training steps.

UR - http://www.scopus.com/inward/record.url?scp=85088515869&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85088515869&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85088515869

SN - 2334-0835

VL - 30

SP - 590

EP - 598

JO - Proceedings International Conference on Automated Planning and Scheduling, ICAPS

JF - Proceedings International Conference on Automated Planning and Scheduling, ICAPS

T2 - 30th International Conference on Automated Planning and Scheduling, ICAPS 2020

Y2 - 26 October 2020 through 30 October 2020

ER -

Joint inference of reward machines and policies for reinforcement learning

Abstract

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this