Active Finite Reward Automaton Inference and Reinforcement Learning Using Queries and Counterexamples

Zhe Xu; Bo Wu; Aditya Ojha; Daniel Neider; Ufuk Topcu

doi:10.1007/978-3-030-84060-0_8

Active Finite Reward Automaton Inference and Reinforcement Learning Using Queries and Counterexamples

Zhe Xu, Bo Wu, Aditya Ojha, Daniel Neider, Ufuk Topcu

Mechanical and Aerospace Engineering

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

12 Scopus citations

Abstract

Despite the fact that deep reinforcement learning (RL) has surpassed human-level performances in various tasks, it still has several fundamental challenges. First, most RL methods require intensive data from the exploration of the environment to achieve satisfactory performance. Second, the use of neural networks in RL renders it hard to interpret the internals of the system in a way that humans can understand. To address these two challenges, we propose a framework that enables an RL agent to reason over its exploration process and distill high-level knowledge for effectively guiding its future explorations. Specifically, we propose a novel RL algorithm that learns high-level knowledge in the form of a finite reward automaton by using the L* learning algorithm. We prove that in episodic RL, a finite reward automaton can express any non-Markovian bounded reward functions with finitely many reward values and approximate any non-Markovian bounded reward function (with infinitely many reward values) with arbitrary precision. We also provide a lower bound for the episode length such that the proposed RL approach almost surely converges to an optimal policy in the limit. We test this approach on two RL environments with non-Markovian reward functions, choosing a variety of tasks with increasing complexity for each environment. We compare our algorithm with the state-of-the-art RL algorithms for non-Markovian reward functions, such as Joint Inference of Reward machines and Policies for RL (JIRP), Learning Reward Machine (LRM), and Proximal Policy Optimization (PPO2). Our results show that our algorithm converges to an optimal policy faster than other baseline methods.

Original language	English (US)
Title of host publication	Machine Learning and Knowledge Extraction - 5th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2021, Proceedings
Editors	Andreas Holzinger, Peter Kieseberg, A Min Tjoa, Edgar Weippl
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	115-135
Number of pages	21
ISBN (Print)	9783030840594
DOIs	https://doi.org/10.1007/978-3-030-84060-0_8
State	Published - 2021
Event	5th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference on Machine Learning and Knowledge Extraction, CD-MAKE 2021 - Virtual, Online Duration: Aug 17 2021 → Aug 20 2021

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	12844 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	5th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference on Machine Learning and Knowledge Extraction, CD-MAKE 2021
City	Virtual, Online
Period	8/17/21 → 8/20/21

ASJC Scopus subject areas

Theoretical Computer Science
General Computer Science

Access to Document

10.1007/978-3-030-84060-0_8

Cite this

Xu, Z., Wu, B., Ojha, A., Neider, D., & Topcu, U. (2021). Active Finite Reward Automaton Inference and Reinforcement Learning Using Queries and Counterexamples. In A. Holzinger, P. Kieseberg, A. M. Tjoa, & E. Weippl (Eds.), Machine Learning and Knowledge Extraction - 5th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2021, Proceedings (pp. 115-135). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12844 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-84060-0_8

Active Finite Reward Automaton Inference and Reinforcement Learning Using Queries and Counterexamples. / Xu, Zhe; Wu, Bo; Ojha, Aditya et al.
Machine Learning and Knowledge Extraction - 5th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2021, Proceedings. ed. / Andreas Holzinger; Peter Kieseberg; A Min Tjoa; Edgar Weippl. Springer Science and Business Media Deutschland GmbH, 2021. p. 115-135 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12844 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Xu, Z, Wu, B, Ojha, A, Neider, D & Topcu, U 2021, Active Finite Reward Automaton Inference and Reinforcement Learning Using Queries and Counterexamples. in A Holzinger, P Kieseberg, AM Tjoa & E Weippl (eds), Machine Learning and Knowledge Extraction - 5th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2021, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12844 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 115-135, 5th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference on Machine Learning and Knowledge Extraction, CD-MAKE 2021, Virtual, Online, 8/17/21. https://doi.org/10.1007/978-3-030-84060-0_8

Xu Z, Wu B, Ojha A, Neider D, Topcu U. Active Finite Reward Automaton Inference and Reinforcement Learning Using Queries and Counterexamples. In Holzinger A, Kieseberg P, Tjoa AM, Weippl E, editors, Machine Learning and Knowledge Extraction - 5th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2021, Proceedings. Springer Science and Business Media Deutschland GmbH. 2021. p. 115-135. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-84060-0_8

Xu, Zhe ; Wu, Bo ; Ojha, Aditya et al. / Active Finite Reward Automaton Inference and Reinforcement Learning Using Queries and Counterexamples. Machine Learning and Knowledge Extraction - 5th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2021, Proceedings. editor / Andreas Holzinger ; Peter Kieseberg ; A Min Tjoa ; Edgar Weippl. Springer Science and Business Media Deutschland GmbH, 2021. pp. 115-135 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{12dcdf93cfc94b3ab1982a16d87bfa3e,

title = "Active Finite Reward Automaton Inference and Reinforcement Learning Using Queries and Counterexamples",

abstract = "Despite the fact that deep reinforcement learning (RL) has surpassed human-level performances in various tasks, it still has several fundamental challenges. First, most RL methods require intensive data from the exploration of the environment to achieve satisfactory performance. Second, the use of neural networks in RL renders it hard to interpret the internals of the system in a way that humans can understand. To address these two challenges, we propose a framework that enables an RL agent to reason over its exploration process and distill high-level knowledge for effectively guiding its future explorations. Specifically, we propose a novel RL algorithm that learns high-level knowledge in the form of a finite reward automaton by using the L* learning algorithm. We prove that in episodic RL, a finite reward automaton can express any non-Markovian bounded reward functions with finitely many reward values and approximate any non-Markovian bounded reward function (with infinitely many reward values) with arbitrary precision. We also provide a lower bound for the episode length such that the proposed RL approach almost surely converges to an optimal policy in the limit. We test this approach on two RL environments with non-Markovian reward functions, choosing a variety of tasks with increasing complexity for each environment. We compare our algorithm with the state-of-the-art RL algorithms for non-Markovian reward functions, such as Joint Inference of Reward machines and Policies for RL (JIRP), Learning Reward Machine (LRM), and Proximal Policy Optimization (PPO2). Our results show that our algorithm converges to an optimal policy faster than other baseline methods.",

author = "Zhe Xu and Bo Wu and Aditya Ojha and Daniel Neider and Ufuk Topcu",

note = "Funding Information: Acknowledgment. This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001120C0032, ARL W911NF2020132, ARL ACC-APG-RTP W911NF, NSF 1646522, and Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - grant no. 434592664. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA. Publisher Copyright: {\textcopyright} 2021, IFIP International Federation for Information Processing.; 5th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference on Machine Learning and Knowledge Extraction, CD-MAKE 2021 ; Conference date: 17-08-2021 Through 20-08-2021",

year = "2021",

doi = "10.1007/978-3-030-84060-0_8",

language = "English (US)",

isbn = "9783030840594",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "115--135",

editor = "Andreas Holzinger and Peter Kieseberg and Tjoa, {A Min} and Edgar Weippl",

booktitle = "Machine Learning and Knowledge Extraction - 5th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2021, Proceedings",

address = "Germany",

}

TY - GEN

T1 - Active Finite Reward Automaton Inference and Reinforcement Learning Using Queries and Counterexamples

AU - Xu, Zhe

AU - Wu, Bo

AU - Ojha, Aditya

AU - Neider, Daniel

AU - Topcu, Ufuk

N1 - Funding Information: Acknowledgment. This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001120C0032, ARL W911NF2020132, ARL ACC-APG-RTP W911NF, NSF 1646522, and Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - grant no. 434592664. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA. Publisher Copyright: © 2021, IFIP International Federation for Information Processing.

PY - 2021

Y1 - 2021

N2 - Despite the fact that deep reinforcement learning (RL) has surpassed human-level performances in various tasks, it still has several fundamental challenges. First, most RL methods require intensive data from the exploration of the environment to achieve satisfactory performance. Second, the use of neural networks in RL renders it hard to interpret the internals of the system in a way that humans can understand. To address these two challenges, we propose a framework that enables an RL agent to reason over its exploration process and distill high-level knowledge for effectively guiding its future explorations. Specifically, we propose a novel RL algorithm that learns high-level knowledge in the form of a finite reward automaton by using the L* learning algorithm. We prove that in episodic RL, a finite reward automaton can express any non-Markovian bounded reward functions with finitely many reward values and approximate any non-Markovian bounded reward function (with infinitely many reward values) with arbitrary precision. We also provide a lower bound for the episode length such that the proposed RL approach almost surely converges to an optimal policy in the limit. We test this approach on two RL environments with non-Markovian reward functions, choosing a variety of tasks with increasing complexity for each environment. We compare our algorithm with the state-of-the-art RL algorithms for non-Markovian reward functions, such as Joint Inference of Reward machines and Policies for RL (JIRP), Learning Reward Machine (LRM), and Proximal Policy Optimization (PPO2). Our results show that our algorithm converges to an optimal policy faster than other baseline methods.

AB - Despite the fact that deep reinforcement learning (RL) has surpassed human-level performances in various tasks, it still has several fundamental challenges. First, most RL methods require intensive data from the exploration of the environment to achieve satisfactory performance. Second, the use of neural networks in RL renders it hard to interpret the internals of the system in a way that humans can understand. To address these two challenges, we propose a framework that enables an RL agent to reason over its exploration process and distill high-level knowledge for effectively guiding its future explorations. Specifically, we propose a novel RL algorithm that learns high-level knowledge in the form of a finite reward automaton by using the L* learning algorithm. We prove that in episodic RL, a finite reward automaton can express any non-Markovian bounded reward functions with finitely many reward values and approximate any non-Markovian bounded reward function (with infinitely many reward values) with arbitrary precision. We also provide a lower bound for the episode length such that the proposed RL approach almost surely converges to an optimal policy in the limit. We test this approach on two RL environments with non-Markovian reward functions, choosing a variety of tasks with increasing complexity for each environment. We compare our algorithm with the state-of-the-art RL algorithms for non-Markovian reward functions, such as Joint Inference of Reward machines and Policies for RL (JIRP), Learning Reward Machine (LRM), and Proximal Policy Optimization (PPO2). Our results show that our algorithm converges to an optimal policy faster than other baseline methods.

UR - http://www.scopus.com/inward/record.url?scp=85115186677&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85115186677&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-84060-0_8

DO - 10.1007/978-3-030-84060-0_8

M3 - Conference contribution

AN - SCOPUS:85115186677

SN - 9783030840594

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 115

EP - 135

BT - Machine Learning and Knowledge Extraction - 5th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2021, Proceedings

A2 - Holzinger, Andreas

A2 - Kieseberg, Peter

A2 - Tjoa, A Min

A2 - Weippl, Edgar

PB - Springer Science and Business Media Deutschland GmbH

T2 - 5th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference on Machine Learning and Knowledge Extraction, CD-MAKE 2021

Y2 - 17 August 2021 through 20 August 2021

ER -

Active Finite Reward Automaton Inference and Reinforcement Learning Using Queries and Counterexamples

Abstract

Publication series

Conference

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this