MODEL-BASED OFFLINE META-REINFORCEMENT LEARNING WITH REGULARIZATION

Sen Lin; Jialin Wan; Tengyu Xu; Yingbin Liang; Junshan Zhang

MODEL-BASED OFFLINE META-REINFORCEMENT LEARNING WITH REGULARIZATION

Sen Lin, Jialin Wan, Tengyu Xu, Yingbin Liang, Junshan Zhang

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Contribution to conference › Paper › peer-review

4 Scopus citations

Abstract

Existing offline reinforcement learning (RL) methods face a few major challenges, particularly the distributional shift between the learned policy and the behavior policy. Offline Meta-RL is emerging as a promising approach to address these challenges, aiming to learn an informative meta-policy from a collection of tasks. Nevertheless, as shown in our empirical studies, offline Meta-RL could be outperformed by offline single-task RL methods on tasks with good quality of datasets, indicating that a right balance has to be delicately calibrated between “exploring” the out-of-distribution state-actions by following the meta-policy and “exploiting” the offline dataset by staying close to the behavior policy. Motivated by such empirical analysis, we propose model-based offline Meta-RL with regularized Policy Optimization (MerPO), which learns a meta-model for efficient task structure inference and an informative meta-policy for safe exploration of out-of-distribution state-actions. In particular, we devise a new meta-Regularized model-based Actor-Critic (RAC) method for within-task policy optimization, as a key building block of MerPO, using both conservative policy evaluation and regularized policy improvement; and the intrinsic tradeoff therein is achieved via striking the right balance between two regularizers, one based on the behavior policy and the other on the meta-policy. We theoretically show that the learnt policy offers guaranteed improvement over both the behavior policy and the meta-policy, thus ensuring the performance improvement on new tasks via offline Meta-RL. Experiments corroborate the superior performance of MerPO over existing offline Meta-RL methods.

Original language	English (US)
State	Published - 2022
Event	10th International Conference on Learning Representations, ICLR 2022 - Virtual, Online Duration: Apr 25 2022 → Apr 29 2022

Conference

Conference	10th International Conference on Learning Representations, ICLR 2022
City	Virtual, Online
Period	4/25/22 → 4/29/22

ASJC Scopus subject areas

Language and Linguistics
Computer Science Applications
Education
Linguistics and Language

Cite this

@conference{de6efa41107e40858f223d1bba84ac4d,

title = "MODEL-BASED OFFLINE META-REINFORCEMENT LEARNING WITH REGULARIZATION",

abstract = "Existing offline reinforcement learning (RL) methods face a few major challenges, particularly the distributional shift between the learned policy and the behavior policy. Offline Meta-RL is emerging as a promising approach to address these challenges, aiming to learn an informative meta-policy from a collection of tasks. Nevertheless, as shown in our empirical studies, offline Meta-RL could be outperformed by offline single-task RL methods on tasks with good quality of datasets, indicating that a right balance has to be delicately calibrated between “exploring” the out-of-distribution state-actions by following the meta-policy and “exploiting” the offline dataset by staying close to the behavior policy. Motivated by such empirical analysis, we propose model-based offline Meta-RL with regularized Policy Optimization (MerPO), which learns a meta-model for efficient task structure inference and an informative meta-policy for safe exploration of out-of-distribution state-actions. In particular, we devise a new meta-Regularized model-based Actor-Critic (RAC) method for within-task policy optimization, as a key building block of MerPO, using both conservative policy evaluation and regularized policy improvement; and the intrinsic tradeoff therein is achieved via striking the right balance between two regularizers, one based on the behavior policy and the other on the meta-policy. We theoretically show that the learnt policy offers guaranteed improvement over both the behavior policy and the meta-policy, thus ensuring the performance improvement on new tasks via offline Meta-RL. Experiments corroborate the superior performance of MerPO over existing offline Meta-RL methods.",

author = "Sen Lin and Jialin Wan and Tengyu Xu and Yingbin Liang and Junshan Zhang",

note = "Funding Information: This work is supported in part by NSF Grants CNS-2003081, CNS-2203239, CPS-1739344, and CCSS-2121222. Publisher Copyright: {\textcopyright} 2022 ICLR 2022 - 10th International Conference on Learning Representationss. All rights reserved.; 10th International Conference on Learning Representations, ICLR 2022 ; Conference date: 25-04-2022 Through 29-04-2022",

year = "2022",

language = "English (US)",

}

TY - CONF

T1 - MODEL-BASED OFFLINE META-REINFORCEMENT LEARNING WITH REGULARIZATION

AU - Lin, Sen

AU - Wan, Jialin

AU - Xu, Tengyu

AU - Liang, Yingbin

AU - Zhang, Junshan

N1 - Funding Information: This work is supported in part by NSF Grants CNS-2003081, CNS-2203239, CPS-1739344, and CCSS-2121222. Publisher Copyright: © 2022 ICLR 2022 - 10th International Conference on Learning Representationss. All rights reserved.

PY - 2022

Y1 - 2022

N2 - Existing offline reinforcement learning (RL) methods face a few major challenges, particularly the distributional shift between the learned policy and the behavior policy. Offline Meta-RL is emerging as a promising approach to address these challenges, aiming to learn an informative meta-policy from a collection of tasks. Nevertheless, as shown in our empirical studies, offline Meta-RL could be outperformed by offline single-task RL methods on tasks with good quality of datasets, indicating that a right balance has to be delicately calibrated between “exploring” the out-of-distribution state-actions by following the meta-policy and “exploiting” the offline dataset by staying close to the behavior policy. Motivated by such empirical analysis, we propose model-based offline Meta-RL with regularized Policy Optimization (MerPO), which learns a meta-model for efficient task structure inference and an informative meta-policy for safe exploration of out-of-distribution state-actions. In particular, we devise a new meta-Regularized model-based Actor-Critic (RAC) method for within-task policy optimization, as a key building block of MerPO, using both conservative policy evaluation and regularized policy improvement; and the intrinsic tradeoff therein is achieved via striking the right balance between two regularizers, one based on the behavior policy and the other on the meta-policy. We theoretically show that the learnt policy offers guaranteed improvement over both the behavior policy and the meta-policy, thus ensuring the performance improvement on new tasks via offline Meta-RL. Experiments corroborate the superior performance of MerPO over existing offline Meta-RL methods.

AB - Existing offline reinforcement learning (RL) methods face a few major challenges, particularly the distributional shift between the learned policy and the behavior policy. Offline Meta-RL is emerging as a promising approach to address these challenges, aiming to learn an informative meta-policy from a collection of tasks. Nevertheless, as shown in our empirical studies, offline Meta-RL could be outperformed by offline single-task RL methods on tasks with good quality of datasets, indicating that a right balance has to be delicately calibrated between “exploring” the out-of-distribution state-actions by following the meta-policy and “exploiting” the offline dataset by staying close to the behavior policy. Motivated by such empirical analysis, we propose model-based offline Meta-RL with regularized Policy Optimization (MerPO), which learns a meta-model for efficient task structure inference and an informative meta-policy for safe exploration of out-of-distribution state-actions. In particular, we devise a new meta-Regularized model-based Actor-Critic (RAC) method for within-task policy optimization, as a key building block of MerPO, using both conservative policy evaluation and regularized policy improvement; and the intrinsic tradeoff therein is achieved via striking the right balance between two regularizers, one based on the behavior policy and the other on the meta-policy. We theoretically show that the learnt policy offers guaranteed improvement over both the behavior policy and the meta-policy, thus ensuring the performance improvement on new tasks via offline Meta-RL. Experiments corroborate the superior performance of MerPO over existing offline Meta-RL methods.

UR - http://www.scopus.com/inward/record.url?scp=85146900323&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85146900323&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:85146900323

T2 - 10th International Conference on Learning Representations, ICLR 2022

Y2 - 25 April 2022 through 29 April 2022

ER -

MODEL-BASED OFFLINE META-REINFORCEMENT LEARNING WITH REGULARIZATION

Abstract

Conference

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this