Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization

Tatsuya Matsushima   Hiroki Furuta11footnotemark: 1   
The University of Tokyo
{matsushima, furuta}@weblab.t.u-tokyo.ac.jp
Yutaka Matsuo
The University of Tokyo
matsuo@weblab.t.u-tokyo.ac.jp
&Ofir Nachum
Google Research
ofirnachum@google.com &Shixiang Shane Gu
Google Research
shanegu@google.com
Equal contribution.
Abstract

Most reinforcement learning (RL) algorithms assume online access to the environment, in which one may readily interleave updates to the policy with experience collection using that policy. However, in many real-world applications such as health, education, dialogue agents, and robotics, the cost or potential risk of deploying a new data-collection policy is high, to the point that it can become prohibitive to update the data-collection policy more than a few times during learning. With this view, we propose a novel concept of deployment efficiency, measuring the number of distinct data-collection policies that are used during policy learning. We observe that naïvely applying existing model-free offline RL algorithms recursively does not lead to a practical deployment-efficient and sample-efficient algorithm. We propose a novel model-based algorithm, Behavior-Regularized Model-ENsemble (BREMEN) that can effectively optimize a policy offline using 10-20 times fewer data than prior works. Furthermore, the recursive application of BREMEN is able to achieve impressive deployment efficiency while maintaining the same or better sample efficiency, learning successful policies from scratch on simulated robotic environments with only 5-10 deployments, compared to typical values of hundreds to millions in standard RL baselines. Codes and pre-trained models are available at https://github.com/matsuolab/BREMEN.

1 Introduction

Reinforcement learning (RL) algorithms have recently demonstrated impressive success in learning behaviors for a variety of sequential decision-making tasks [3, 24, 42]. Virtually all of these demonstrations have relied on highly-frequent online access to the environment, with the RL algorithms often interleaving each update to the policy with additional experience collection of that policy acting in the environment. However, in many real-world applications of RL, such as health [40], education [37], dialog agents [26], and robotics [19, 27], the deployment of a new data-collection policy may be associated with a number of costs and risks. If we can learn tasks with a small number of data collection policies, we can substantially reduce these costs and risks.

Based on this idea, we propose a novel measure of RL algorithm performance, namely deployment efficiency, which counts the number of changes in the data-collection policy during learning, as illustrated in Figure 1. This concept may be seen in contrast to sample efficiency or data efficiency [46, 10, 20, 22, 34, 41], which measures the amount of environment interactions incurred during training, without regard to how many distinct policies were deployed to perform those interactions. Even when the data efficiency is high, the deployment efficiency could be low, since many on-policy and off-policy algorithms alternate data collection with each policy update [52, 34, 18, 22]. Such dependence on high-frequency policy deployments is best illustrated in the recent works in offline RL [16, 26, 30, 33, 61], where baseline off-policy algorithms exhibited poor performance when trained on a static dataset. These offline RL works, however, limit their study to a single deployment, which is enough for achieving high performance with data collected from a sub-optimal behavior policy, but often not from a random policy. In contrast to those prior works, we aim to learn successful policies from scratch with minimal amounts of data and deployments.

Many existing model-free offline RL algorithms [33] are tuned and evaluated on large datasets (e.g., one million transitions). In order to develop an algorithm that is both sample-efficient and deployment-efficient, each iteration of the algorithm between successive deployments has to work effectively on much smaller dataset sizes. We believe model-based RL is better suited to this setting due to its higher demonstrated sample efficiency than model-free RL [31, 43]. Although the combination of model-based RL and offline or limited-deployment settings seems straight-forward, we find this naïve approach leads to poor performance. This problem can be attributed to extrapolation errors [16] similar to those observed in model-free methods. Specifically, the learned policy may choose sequences of actions which lead it to regions of the state space where the dynamics model cannot predict properly, due to poor coverage of the dataset. This can lead the policy to exploit approximation errors of the dynamics model and be disastrous for learning. In model-free settings, similar data distribution shift problems are typically remedied by regularizing policy updates explicitly with a divergence from the observed data distribution [26, 30, 61], which, however, can overly limit policies’ expressivity [57].

In order to better approach these problems arising in limited deployment settings, we propose Behavior-Regularized Model-ENsemble (BREMEN), which learns an ensemble of dynamics models in conjunction with a policy using imaginary rollouts while implicitly regularizing the learned policy via appropriate parameter initialization and conservative trust-region learning updates. We evaluate BREMEN on high-dimensional continuous control benchmarks and find that it achieves impressive deployment efficiency. BREMEN is able to learn successful policies with only 5-10 deployments, significantly outperforming existing off-policy and offline RL algorithms in this deployment-constrained setting. We further evaluate BREMEN on standard offline RL benchmarks, where only a single static dataset is used. In this fixed-batch setting, our experiments show that BREMEN can not only achieve performance competitive with state-of-the-art when using standard dataset sizes but also learn with 10-20 times smaller datasets, which previous methods are unable to attain.

Refer to caption
Figure 1: Deployment efficiency is defined as the number of changes in the data-collection policy (I𝐼I), which is vital for managing costs and risks of new policy deployment. Online RL algorithms typically require many iterations of policy deployment and data collection, which leads to extremely low deployment efficiency. In contrast, most pure offline algorithms consider updating a policy from a fixed dataset without additional deployment and often fail to learn from a randomly initialized data-collection policy. Interestingly, most state-of-the-art off-policy algorithms are still evaluated in heavily online settings. For example, SAC [22] collects one sample per policy update, amounting to 100,000 to 1 million deployments for learning standard benchmark domains.

2 Preliminaries

We consider a Markov Decision Process (MDP) setting, characterized by the tuple =(𝒮,𝒜,p,r,γ)𝒮𝒜𝑝𝑟𝛾\mathcal{M}=(\mathcal{S},\mathcal{A},p,r,\gamma), where 𝒮𝒮\mathcal{S} is the state space, 𝒜𝒜\mathcal{A} is the action space, p(s|s,a)𝑝conditionalsuperscript𝑠𝑠𝑎p(s^{\prime}|s,a) is the transition probability distribution or dynamics, r(s)𝑟𝑠r(s) is the reward function and γ(0,1)𝛾01\gamma\in(0,1) is the discount factor. A policy π𝜋\pi is a function that determines the agent behavior, mapping from states to probability distributions over actions. The goal is to obtain the optimal policy πsuperscript𝜋\pi^{\ast} as

π=argmaxπη[π]=argmaxπ𝔼π[t=0γtr(st)],superscript𝜋subscriptargmax𝜋𝜂delimited-[]𝜋subscriptargmax𝜋subscript𝔼𝜋delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡𝑟subscript𝑠𝑡\pi^{\ast}=\operatorname*{arg\,max}_{\pi}\eta[\pi]=\operatorname*{arg\,max}_{\pi}\mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t})\right],

where η[π]𝜂delimited-[]𝜋\eta[\pi] is the expectation of the discounted sum of rewards under the policy π𝜋\pi. The transition probability p(s|s,a)𝑝conditionalsuperscript𝑠𝑠𝑎p(s^{\prime}|s,a) is usually unknown, and it is estimated with a parameterized dynamics model fϕsubscript𝑓italic-ϕf_{\phi} (e.g., a neural network) in model-based RL. For simplicity, we assume that the reward function r(s)𝑟𝑠r(s) is known, and the reward can be computed for any arbitrary state, but we can easily extend to the unknown setting and predict it using a parameterized function.

On-policy vs Off-policy, Online vs Offline At high-level, most RL algorithms iterate many times between collecting a batch of transitions (deployments) and optimizing the policy (learning). If the algorithms discard data after each policy update, they are on-policy [52, 53], while if they accumulate data in a buffer 𝒟𝒟\mathcal{D}, i.e. experience replay [35], they are off-policy [39, 34, 18, 20, 22, 16] because not all the data in buffer comes from the current policy. However, we consider all these algorithms to be online RL algorithms, since they involve many deployments during learning, ranging from hundreds to millions. On the other hand, in pure offline RL, one does not assume direct interaction and learns a policy from only a fixed dataset, which effectively corresponds to a single deployment allowed for learning. Classically, interpolating these two extremes were semi-batch RL algorithms [32, 56], which improve the policy through repetitions of collecting a large batch of transitions 𝒟={(s,a,s,r)}𝒟𝑠𝑎superscript𝑠𝑟\mathcal{D}=\{(s,a,s^{\prime},r)\} and performing many or full policy updates. While these semi-batch RL also realize good deployment efficiency, they have not been extensively studied with neural network function approximators or in off-policy settings with experience replay for scalable sample-efficient learning. In our work, we aim to have both high deployment efficiency and sample efficiency by developing an algorithm that can solve the tasks with minimal policy deployments as well as transition samples.

3 Deployment Efficiency

Deploying a new policy for data collection can be associated with a number of costs and risks for many real-world applications like medicine or robotic control [40, 37, 19, 27, 42]. While there is an abundance of works on safety for RL [5, 15, 6, 49, 7], these methods often do not provide guarantees in practice when combined with neural networks and stochastic optimization. It is therefore necessary to validate each policy before deployment. Due to the cost associated with each deployment, it is desirable to minimize the number of distinct deployments needed during the learning process.

In order to focus research on these practical bottlenecks, we propose a novel measure of RL algorithms, namely, deployment efficiency, which counts how many times the data-collection policy has been changed during improvement from random policy to solve the task. For example, if an RL algorithm operates by using its learned policy to collect transitions from the environment I𝐼I times, each time collecting a batch of B𝐵B new transitions, then the number of deployments is I𝐼I, while the total number of samples collected is I×B𝐼𝐵I\times B. The lower I𝐼I is, the more deployment-efficient the algorithm is; in contrast, sample efficiency looks at I×B𝐼𝐵I\times B. Online RL algorithms, whether they are on-policy or off-policy, typically update the policy and acquire new transitions by deploying the newly updated policy at every iteration. This corresponds to performing hundreds to millions of deployments during learning on standard benchmarks [22], which is severely deployment inefficient. On the other hand, offline RL literature only studies the case of 1 deployment. A deployment-efficient algorithm would stand in the middle of these two extremes and ideally learn a successful policy from scratch while deploying only a few distinct policies, as illustrated in Figure 1.

Recent deep RL literature seldom emphasizes deployment efficiency, with few exceptions in specific applications [27] where such a learning procedure is necessary. Although current state-of-the-art algorithms on continuous control have substantially improved sample or data efficiency, they have not optimized for deployment efficiency. For example, SAC [22], an efficient model-free off-policy algorithm, performs half a million to one million policy deployments during learning on MuJoCo [59] benchmarks. ME-TRPO [31], a model-based algorithm, performs a much lower 100-300 policy deployments, although this is still relatively high for practical settings.111We examined the number of deployments by checking their original implementations, while the frequency of data collection is a tunable hyper-parameter. In our work, we demonstrate successful learning on standard benchmark environments with only 5-10 deployments.

4 Behavior-Regularized Model-Ensemble

To achieve high deployment efficiency, we propose Behavior-Regularized Model-ENsemble (BREMEN). BREMEN incorporates Dyna-style [58] model-based RL, learning an ensemble of dynamics models in conjunction with a policy using imaginary rollouts from the ensemble and behavior regularization via conservative trust-region updates.

4.1 Imaginary Rollout from Model Ensemble

As in recent Dyna-style model-based RL methods [31, 60], BREMEN uses an ensemble of K𝐾K deterministic dynamics models f^ϕ={f^ϕ1,,f^ϕK}subscript^𝑓italic-ϕsubscript^𝑓subscriptitalic-ϕ1subscript^𝑓subscriptitalic-ϕ𝐾\hat{f}_{\phi}=\left\{\hat{f}_{\phi_{1}},\dots,\hat{f}_{\phi_{K}}\right\} to alleviate the problem of model bias. Each model f^ϕisubscript^𝑓subscriptitalic-ϕ𝑖\hat{f}_{\phi_{i}} is parameterized by ϕisubscriptitalic-ϕ𝑖{\phi_{i}} and trained by the following objective, which minimizes mean squared error between the prediction of next state f^ϕi(st,at)subscript^𝑓subscriptitalic-ϕ𝑖subscript𝑠𝑡subscript𝑎𝑡\hat{f}_{\phi_{i}}(s_{t},a_{t}) and true next state st+1subscript𝑠𝑡1s_{t+1} over a dataset 𝒟𝒟\mathcal{D}:

minϕi1|𝒟|(st,at,st+1)𝒟12st+1f^ϕi(st,at)22.subscriptsubscriptitalic-ϕ𝑖1𝒟subscriptsubscript𝑠𝑡subscript𝑎𝑡subscript𝑠𝑡1𝒟12superscriptsubscriptnormsubscript𝑠𝑡1subscript^𝑓subscriptitalic-ϕ𝑖subscript𝑠𝑡subscript𝑎𝑡22\min_{\phi_{i}}\frac{1}{|\mathcal{D}|}\sum_{\left(s_{t},a_{t},s_{t+1}\right)\in\mathcal{D}}\frac{1}{2}\left\|s_{t+1}-\hat{f}_{\phi_{i}}\left(s_{t},a_{t}\right)\right\|_{2}^{2}. (1)

During training of a policy πθsubscript𝜋𝜃\pi_{\theta}, imagined trajectories of states and actions are generated sequentially, using a dynamics model f^ϕisubscript^𝑓subscriptitalic-ϕ𝑖\hat{f}_{\phi_{i}} that is randomly selected at each time step:

atπθ(|s^t),s^t+1=f^ϕi(s^t,at)wherei{1K}.a_{t}\sim\pi_{\theta}(\cdot|\hat{s}_{t}),\quad\hat{s}_{t+1}=\hat{f}_{\phi_{i}}(\hat{s}_{t},a_{t})\quad\text{where}\quad i\sim\{1\cdots K\}. (2)

4.2 Policy Update with Behavior Regularization

In order to manage the discrepancy between the true dynamics and the learned model caused by the distribution shift in batch settings, we propose to use iterative policy updates via a trust-region constraint, re-initialized with a behavior-cloned policy after every deployment. Specifically, after each deployment, we are given an updated dataset of experience transitions 𝒟𝒟\mathcal{D}. With this dataset, we approximate the true behavior policy πbsubscript𝜋𝑏\pi_{b} through behavior cloning (BC), utilizing a neural network π^βsubscript^𝜋𝛽\hat{\pi}_{\beta} parameterized by β𝛽\beta, where we implicitly assume a fixed variance, a common practice in BC [47]:

minβ1|𝒟|(st,at)𝒟12atπ^β(st)22.subscript𝛽1𝒟subscriptsubscript𝑠𝑡subscript𝑎𝑡𝒟12superscriptsubscriptnormsubscript𝑎𝑡subscript^𝜋𝛽subscript𝑠𝑡22\min_{\beta}\frac{1}{|\mathcal{D}|}\sum_{\left(s_{t},a_{t}\right)\in\mathcal{D}}\frac{1}{2}\left\|a_{t}-\hat{\pi}_{\beta}\left(s_{t}\right)\right\|_{2}^{2}. (3)

After obtaining the estimated behavior policy, we initialize the target policy πθsubscript𝜋𝜃\pi_{\theta} as a Gaussian policy with mean from π^βsubscript^𝜋𝛽\hat{\pi}_{\beta} and standard deviation of 111. This BC initialization in conjunction with gradient descent based optimization may be seen as implicitly biasing the optimized πθsubscript𝜋𝜃\pi_{\theta} to be close to the data-collection policy [44], and thus works as a remedy for the distribution shift problem [50]. To further bias the learned policy to be close to the data-collection policy, we opt to use a KL-based trust-region optimization [52]. Therefore, the optimization of BREMEN becomes

θk+1subscript𝜃𝑘1\displaystyle\theta_{k+1} =argmaxθ𝔼s,aπθk,f^ϕi[πθ(a|s)πθk(a|s)Aπθk(s,a)]absentsubscriptargmax𝜃formulae-sequencesimilar-to𝑠𝑎subscript𝜋subscript𝜃𝑘subscript^𝑓subscriptitalic-ϕ𝑖𝔼delimited-[]subscript𝜋𝜃conditional𝑎𝑠subscript𝜋subscript𝜃𝑘conditional𝑎𝑠superscript𝐴subscript𝜋subscript𝜃𝑘𝑠𝑎\displaystyle=\operatorname*{arg\,max}_{\theta}\underset{s,a\sim\pi_{\theta_{k}},\hat{f}_{\phi_{i}}}{\mathbb{E}}\left[\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{k}}(a|s)}A^{\pi_{\theta_{k}}}(s,a)\right] (4)
s.t.𝔼sπθk,f^ϕi[DKL(πθ(|s)πθk(|s))]δ,πθ0=Normal(π^β,1),\displaystyle\text{s.t.}\quad\underset{s\sim\pi_{\theta_{k}},\hat{f}_{\phi_{i}}}{\mathbb{E}}\left[D_{\mathrm{KL}}\left(\pi_{\theta}(\cdot|s)\|\pi_{\theta_{k}}(\cdot|s)\right)\right]\leq\delta,\quad\pi_{\theta_{0}}=\mathrm{Normal}(\hat{\pi}_{\beta},1),

where Aπθk(s,a)superscript𝐴subscript𝜋subscript𝜃𝑘𝑠𝑎A^{\pi_{\theta_{k}}}(s,a) is the advantage of πθksubscript𝜋subscript𝜃𝑘\pi_{\theta_{k}} computed using model-based rollouts in the learned dynamics model and δ𝛿\delta is the maximum step size.

The combination of BC for initialization and finite iterative trust-region updates serves as an implicit KL regularization, as discussed in Section 4.3. This is in contrast to many previous offline RL algorithms that augment the value function with a penalty of explicit KL divergence [54, 61] or maximum mean discrepancy [30]. Empirically, we found that our regularization technique outperforms the explicit KL penalty (see Section 5.3).

By recursively performing offline procedure, BREMEN can be used for deployment-efficient learning as shown in Algorithm 1, starting from a randomly initialized policy, collecting experience data, and performing offline policy updates.

Algorithm 1 BREMEN for Deployment-Efficient RL
0:  Empty dataset 𝒟allsubscript𝒟𝑎𝑙𝑙\mathcal{D}_{all}, 𝒟𝒟\mathcal{D}, Initial parameters ϕ={ϕ1,,ϕK}italic-ϕsubscriptitalic-ϕ1subscriptitalic-ϕ𝐾\phi=\{\phi_{1},\cdots,\phi_{K}\}, β𝛽\beta, Number of policy optimization T𝑇T, Number of deployments I𝐼I.
1:  Randomly initialize the target policy πθsubscript𝜋𝜃\pi_{\theta}
2:  for deployment i=1,,I𝑖1𝐼i=1,\cdots,I do
3:     Collect B𝐵B transitions in the true environment using πθsubscript𝜋𝜃\pi_{\theta} and add them to dataset 𝒟all𝒟all{st,at,rt,st+1}subscript𝒟𝑎𝑙𝑙subscript𝒟𝑎𝑙𝑙subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡subscript𝑠𝑡1\mathcal{D}_{all}\leftarrow\mathcal{D}_{all}\cup\{s_{t},a_{t},r_{t},s_{t+1}\}, 𝒟{st,at,rt,st+1}𝒟subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡subscript𝑠𝑡1\mathcal{D}\leftarrow\{s_{t},a_{t},r_{t},s_{t+1}\}
4:     Train K𝐾K dynamics models f^ϕsubscript^𝑓italic-ϕ\hat{f}_{\phi} using 𝒟allsubscript𝒟𝑎𝑙𝑙\mathcal{D}_{all} via Eq. 1.
5:     Train estimated behavior policy π^βsubscript^𝜋𝛽\hat{\pi}_{\beta} using 𝒟𝒟\mathcal{D} by behavior cloning via Eq. 3.
6:     Re-initialize target policy πθ0=Normal(π^β,1)subscript𝜋subscript𝜃0Normalsubscript^𝜋𝛽1\pi_{\theta_{0}}=\mathrm{Normal}(\hat{\pi}_{\beta},1).
7:     for policy optimization k=1,,T𝑘1𝑇k=1,\cdots,T do
8:        Generate imaginary rollout via Eq. 2.
9:        Optimize target policy πθsubscript𝜋𝜃\pi_{\theta} satisfying Eq. 4 with the rollout.

4.3 Implicit KL Control from a Mathematical Perspective

We can intuitively understand that behavior cloning initialization with trust-region updates works as a regularization of distributional shift, and this can be supported by theory. Following the notation of Janner et al. [25], we denote the generalization error of a dynamics model on the state distribution under the true behavior policy as ϵm=maxt𝔼sdtπbDTV(p(st+1|st,at)||pϕ(st+1|st,at))\epsilon_{m}=\max_{t}\mathbb{E}_{s\sim d^{\pi_{b}}_{t}}D_{TV}(p(s_{t+1}|s_{t},a_{t})||p_{\phi}(s_{t+1}|s_{t},a_{t})), where DTVsubscript𝐷𝑇𝑉D_{TV} represents the total variation distance between true dynamics p𝑝p and learned model pϕsubscript𝑝italic-ϕp_{\phi}. We also denote the distribution shift on the target policy as maxsDTV(πb||π)ϵπ\max_{s}D_{TV}(\pi_{b}||\pi)\leq\epsilon_{\pi}. A bound relating the true returns η[π]𝜂delimited-[]𝜋\eta[\pi] and the model returns η^[π]^𝜂delimited-[]𝜋\hat{\eta}[\pi] on the target policy is given in Janner et al. [25] as,

η[π]η^[π][2γrmax(ϵm+2ϵπ)(1γ)2+4rmaxϵπ(1γ)].𝜂delimited-[]𝜋^𝜂delimited-[]𝜋delimited-[]2𝛾subscript𝑟𝑚𝑎𝑥subscriptitalic-ϵ𝑚2subscriptitalic-ϵ𝜋superscript1𝛾24subscript𝑟𝑚𝑎𝑥subscriptitalic-ϵ𝜋1𝛾\eta[\pi]\geq\hat{\eta}[\pi]-\left[\frac{2\gamma r_{max}(\epsilon_{m}+2\epsilon_{\pi})}{(1-\gamma)^{2}}+\frac{4r_{max}\epsilon_{\pi}}{(1-\gamma)}\right]. (5)

This bound guarantees the improvement under the true returns as long as the improvement under the model returns increases by more than the slack in the bound due to ϵm,ϵπsubscriptitalic-ϵ𝑚subscriptitalic-ϵ𝜋\epsilon_{m},\epsilon_{\pi} [25, 33].

We may relate this bound to the specific learning employed by BREMEN, which includes dynamics model learning, behavior cloning policy initialization, and conservative KL-based trust-region policy updates. To do so, we consider an idealized version of BREMEN, where the expectations over states in equations 134 are replaced with supremums and the dynamics model is set to have unit variance.

Proposition 1 (Policy and model error bound).

Suppose we apply the idealized BREMEN on a dataset 𝒟𝒟\mathcal{D}, and define ϵβ,ϵϕsubscriptitalic-ϵ𝛽subscriptitalic-ϵitalic-ϕ\epsilon_{\beta},\epsilon_{\phi} in terms of the behavior cloning and dynamics model losses as,

ϵβsubscriptitalic-ϵ𝛽\displaystyle\epsilon_{\beta} :=sups𝔼a𝒟(|s)[atπ^β(st)22/2](πb(|s))\displaystyle:=\sup_{s}\mathbb{E}_{a\sim\mathcal{D}(-|s)}[\left\|a_{t}-\hat{\pi}_{\beta}\left(s_{t}\right)\right\|_{2}^{2}/2]-\mathcal{H}(\pi_{b}(-|s))
ϵϕsubscriptitalic-ϵitalic-ϕ\displaystyle\epsilon_{\phi} :=sups,a𝔼s𝒟(|s,a)[sf^ϕ(s,a)22/2](p(|s,a)),\displaystyle:=\sup_{s,a}\mathbb{E}_{s^{\prime}\sim\mathcal{D}(-|s,a)}\left[\|s^{\prime}-\hat{f}_{\phi}(s,a)\|_{2}^{2}/2\right]-\mathcal{H}(p(-|s,a)),

where \mathcal{H} denotes the Shannon entropy. If one then applies T𝑇T KL-based trust-region steps of step size δ𝛿\delta (equation 4) using stochastic dynamics models with mean f^ϕsubscript^𝑓italic-ϕ\hat{f}_{\phi} and standard deviation 1, then

ϵπ12ϵβ+14log2π+T12δ;ϵm12ϵϕ+14log2π.formulae-sequencesubscriptitalic-ϵ𝜋12subscriptitalic-ϵ𝛽142𝜋𝑇12𝛿subscriptitalic-ϵ𝑚12subscriptitalic-ϵitalic-ϕ142𝜋\epsilon_{\pi}\leq\sqrt{\frac{1}{2}\epsilon_{\beta}+\frac{1}{4}\log 2\pi}+T\sqrt{\frac{1}{2}\delta}~{};~{}~{}~{}\epsilon_{m}\leq\sqrt{\frac{1}{2}\epsilon_{\phi}+\frac{1}{4}\log 2\pi}.
Proof.

See Appendix A. ∎

5 Experiments

We evaluate BREMEN in both deployment-efficient settings, where the algorithm must learn a policy from scratch via a limited number of deployments, and offline RL, where the algorithm is given only a single static dataset. We use four standard continuous control benchmarks for offline RL [30, 61], namely, Ant, HalfCheetah, Hopper, and Walker2d on the MuJoCo physics simulator [59]. See Appendix B and C for further details and results.

5.1 Evaluating Deployment Efficiency

We compare BREMEN to ME-TRPO, SAC, BCQ, and BRAC applied to limited deployment settings. To adapt offline methods (BCQ, BRAC) to this setting, we simply apply them in a recursive fashion;222Recursive BCQ and BRAC also do behavioral cloning-based policy initialization after each deployment. at each deployment iteration, we collect a batch of data with the most recent policy and then run the offline update with this dataset. As for SAC, we simply change the replay buffer to update only at specific deployment intervals. For the sake of comparison, we align the number of deployments and the amount of data collection at each deployment (either 100,000 or 200,000) for all methods.

Figure 2 shows the results with 200,000 (top) and 100,000 (bottom) batched transitions per deployment. Regardless of the environments and the batch size per update, BREMEN achieves remarkable performance while existing online and offline RL methods struggle to make any progress in the limited deployment settings. As a point of comparison, we also include results for online SAC and ME-TRPO without limits on the number of deployments but using the same number of transitions.

Refer to caption
Figure 2: Evaluation of BREMEN with the existing methods (ME-TRPO, SAC, BCQ, BRAC) under deployment constraints (to 5-10 deployments with batch sizes of 200k and 100k). The average cumulative rewards and their standard deviations with 5 random seeds are shown. Vertical dotted lines represent where each policy deployment and data collection happen. BREMEN is able to learn successful policies with only 5-10 deployments, while the state-of-the-art off-policy (SAC), model-based (ME-TRPO), and recursively-applied offline RL algorithms (BCQ, BRAC) often struggle to make any progress. For completeness, we show ME-TRPO(online) and SAC(online) which are their original optimal learning curves without deployment constraints, plotted with respect to samples normalized by the batch size. While SAC(online) substantially outperforms BREMEN in sample efficiency, it uses 1 deployment per sample, leading to 100k-500k deployments required for learning. Interestingly, BREMEN achieves even better performance than the original ME-TRPO(online), suggesting the effectiveness of implicit behavior regularization. For SAC and ME-TRPO under deployment-constrained evaluation, their batch size between policy deployments differs substantially from their standard settings, and therefore we performed extensive hyper-parameter search on the relevant parameters such as the number of policy updates between deployments, as discussed in Appendix B.2.1.

5.2 Evaluating Offline Learning

We also evaluate BREMEN on standard offline RL benchmarks following Wu et al. [61]. We first train online SAC to a certain cumulative reward threshold, 4,000 in HalfCheetah, 1,000 in Ant, Hopper, and Walker2d, and collect offline datasets. We evaluate agents with the offline dataset of one million (1M) transitions, which is standard for BCQ and BRAC [61]. We then evaluate them on much smaller datasets of 50k and 100k transitions, 5similar-to\sim10 % of prior works.

Table 1 shows that BREMEN can achieve performance competitive with state-of-the-art model-free offline RL algorithms when using the standard dataset size of 1M. Moreover, BREMEN can also appropriately learn with 10-20 times smaller datasets, where BCQ and BRAC are unable to exceed even BC baseline. As a result, our recursive BREMEN algorithm is not only deployment-efficient but also sample-efficient, and significantly outperforms the baselines.

Table 1: Comparison of BREMEN to the existing offline methods on static datasets. Each cell shows the average cumulative reward and their standard deviation, where the number of samples is 1M, 100K, and 50K, respectively. The maximum steps per episode is 1,000. BRAC applies a primal form of KL value penalty, and BRAC (max Q) means its variant of sampling multiple actions and taking the maximum according to the learned Q function.
 
1,000,000 (1M) transitions
Method Ant HalfCheetah Hopper Walker2d
Dataset 1191 4126 1128 1376
BC 1321±plus-or-minus\pm141 4281±plus-or-minus\pm12 1341±plus-or-minus\pm161 1421±plus-or-minus\pm147
BCQ [16] 2021±plus-or-minus\pm31 5783±plus-or-minus\pm272 1130±plus-or-minus\pm127 2153±plus-or-minus\pm753
BRAC [61] 2072±plus-or-minus\pm285 7192±plus-or-minus\pm115 1422±plus-or-minus\pm90 2239±plus-or-minus\pm1124
BRAC (max Q) 2369±plus-or-minus\pm234 7320±plus-or-minus\pm91 1916±plus-or-minus\pm343 2409±plus-or-minus\pm1210
BREMEN (Ours) 3328±plus-or-minus\pm275 8055±plus-or-minus\pm103 2058±plus-or-minus\pm852 2346±plus-or-minus\pm230
ME-TRPO (offline) [31] 1258±plus-or-minus\pm550 1804±plus-or-minus\pm924 518±plus-or-minus\pm91 211±plus-or-minus\pm154
 
100,000 (100K) transitions
Method Ant HalfCheetah Hopper Walker2d
Dataset 1191 4066 1128 1376
BC 1330±plus-or-minus\pm81 4266±plus-or-minus\pm21 1322±plus-or-minus\pm109 1426±plus-or-minus\pm47
BCQ 1363±plus-or-minus\pm199 3915±plus-or-minus\pm411 1129±plus-or-minus\pm238 2187±plus-or-minus\pm196
BRAC -157±plus-or-minus\pm383 2505±plus-or-minus\pm2501 1310±plus-or-minus\pm70 2162±plus-or-minus\pm1109
BRAC (max Q) -226±plus-or-minus\pm387 2332±plus-or-minus\pm2422 1422±plus-or-minus\pm101 2164±plus-or-minus\pm1114
BREMEN (Ours) 1633±plus-or-minus\pm127 6095±plus-or-minus\pm370 2191±plus-or-minus\pm455 2132±plus-or-minus\pm301
ME-TRPO (offline) 974±plus-or-minus\pm4 2±plus-or-minus\pm434 307±plus-or-minus\pm170 10±plus-or-minus\pm61
 
50,000 (50K) transitions
Method Ant HalfCheetah Hopper Walker2d
Dataset 1191 4138 1128 1376
BC 1270±plus-or-minus\pm65 4230±plus-or-minus\pm49 1249±plus-or-minus\pm61 1420±plus-or-minus\pm194
BCQ 1329±plus-or-minus\pm95 1319±plus-or-minus\pm626 1178±plus-or-minus\pm235 1841±plus-or-minus\pm439
BRAC -878±plus-or-minus\pm244 -597±plus-or-minus\pm73 1277±plus-or-minus\pm102 976±plus-or-minus\pm1207
BRAC (max Q) -843±plus-or-minus\pm279 -590±plus-or-minus\pm56 1276±plus-or-minus\pm225 903±plus-or-minus\pm1137
BREMEN (Ours) 1347±plus-or-minus\pm283 5823±plus-or-minus\pm146 1632±plus-or-minus\pm796 2280±plus-or-minus\pm647
ME-TRPO (offline) 938±plus-or-minus\pm32 -73±plus-or-minus\pm95 152±plus-or-minus\pm13 176±plus-or-minus\pm343
 

5.3 Evaluating Effectiveness of Implicit KL Control

In this section, we present an experiment to better understand the effect of BREMEN’s implicit regularization. Figure 3 shows the KL divergence of learned policies from the last deployed policy. We compare BREMEN to variants of BREMEN that use an explicit KL penalty on value instead of BC initialization (conservative KL trust-region updates are still used). We find that the explicit KL without behavior initialization variants learn policies that move farther away from the last deployed policy than behavior initialized policies. This suggests that the implicit behavior regularization employed by BREMEN is more effective as a conservative policy learning protocol.

Refer to caption
Figure 3: We examine average cumulative rewards (top) and corresponding KL divergence between the last deployed policy and the target policy (bottom) with batch size 200K in limited deployment settings. The behavior initialized policy remains close to the last deployed policy during improvement without explicit value penalty αDKL(πθπ^β)𝛼subscript𝐷KLconditionalsubscript𝜋𝜃subscript^𝜋𝛽-\alpha D_{\mathrm{KL}}(\pi_{\theta}\|\hat{\pi}_{\beta}). The explicit penalty is controlled by a coefficient α𝛼\alpha.

6 Related Work

Deployment Efficiency and Offline RL

Although we are not aware of any previous works which explicitly proposed the concept of deployment efficiency, its necessity in many real-world applications has been generally known. One may consider previously proposed semi-batch RL algorithms [12, 32, 55, 51] or theoretical analysis of switching cost under the tabular PAC-MDP settings [2, 21] as approaching this issue. More recently, a related but distinct problem known as offline RL has gained popularity [33, 61, 1]. These offline RL works consider an extreme version of 1 deployment, and typically collect the static batch with a partially trained policy rather than a random policy. While offline RL has shown promising results for a variety of real-world applications, such as robotics [38], dialogue systems [26], or medical treatments [17], these algorithms struggle when learning a policy from scratch or when the dataset is small. Nevertheless, common themes of many offline RL algorithms – regularizing the learned policy to the behavior policy [16, 26, 30, 54, 61] and utilizing ensembles to handle uncertainty [30, 61] – served as inspirations for the proposed BREMEN algorithm. A major difference of BREMEN from prior works is that the target policy is not explicitly forced to stick close to the estimated behavior policy through the policy update. Rather, BREMEN employs a more implicit regularization by initializing the learned policy with a behavior cloned policy and then applying conservative trust-region updates. Another major difference is the application of model-based approaches to fully offline settings, which has not been extensively studied in prior works [33], except the two concurrent works from Kidambi et al. [28] and Yu et al. [62] that study pessimistic or uncertainty penalized MDPs with guarantees – closely related to Liu et al. [36]. By contrast, our work shows that a simple technique can already enable model-based offline algorithms to significantly outperform the prior model-free methods, and is, to the best of our knowledge, the first to define and extensively evaluate deployment efficiency with recursive experiments.

Model-Based RL

There are many types of model-based RL algorithms [58, 11, 23]. A simple algorithmic choice is Dyna-style [58], which uses a parameterized dynamics model to estimate the true MDP transition function, stochastically mapping states and actions to next states. The dynamics model can then serve as a simulator of the environment during policy updates. Dyna-style algorithms often suffer from the distributional shift, also known as model bias, which leads RL agents to exploit regions where the data is insufficient, and significant performance degradation. A variety of remedies have been proposed to relieve the problem of model bias, such as the use of multiple dynamics models as an ensemble [8, 31, 25], meta-learning [9], energy-based model regularizer [4], game-theoretic framework [48], and explicit reward penalty for unknown state [28, 62]. Notably, we have employed a subset of these remedies – model ensembles and trust-region updates [31] – for BREMEN. Compared to existing works, our work is notable for using BC initialization in conjunction with trust-region updates to alleviate the distribution shift of the learned policy from the dataset used to train the dynamics model.

7 Conclusion

In this work, we introduced deployment efficiency, a novel measure for RL performance that counts the number of changes in the data-collection policy during learning. To enhance deployment efficiency, we proposed Behavior-Regularized Model-ENsemble (BREMEN), a novel model-based offline algorithm with implicit KL regularization via appropriate policy initialization and trust-region updates. BREMEN shows impressive results in limited deployment settings, obtaining successful policies from scratch in only 5-10 deployments, as it can improve policies offline even when the batch size is 10-20 times smaller than prior works. Not only can this help alleviate costs and risks in real-world applications, but it can also reduce the amount of communication required during distributed learning and could form the basis for communication-efficient large-scale RL in contrast to prior works [45, 13, 14]. Most critically, we show that under deployment efficiency constraints, most prior algorithms – model-free or model-based, online or offline – fail to achieve successful learning. We hope our work can gear the research community to value deployment efficiency as an important criterion for RL algorithms, and to eventually achieve similar sample efficiency and asymptotic performance as the state-of-the-art algorithms like SAC [22] while having the deployment efficiency well-suited for safe and practical real-world reinforcement learning.

Broader Impact

Deployment efficiency is a key-concept for real-world applications for RL because excessive policy deployments may be harmful or costly in robotics, health care, dialog agents, or education. However, in most prior deep RL literatures and benchmarks, this metric is seldom mentioned and sometimes its disregard is exploited for the best sample-efficiency while effectively allowing 100k-1M of deployments. Our proposed algorithm BREMEN achieves high deployment efficiency of 5-10 deployments for learning the standard OpenAI Gym MuJoCo benchmarks that no other algorithms can match. While the final performances are sometimes worse than deployment-unconstrained SAC, we hope our definition and benchmark for deployment efficiency can motivate further research by the community. Other impact questions are not applicable for this paper.

On the other hand, BREMEN still requires a few online deployments, which may still involve some risks. Fully safe and efficient RL that can learn from scratch remains an open problem.

Acknowledgments

We thank Yusuke Iwasawa, Emma Brunskill, Lihong Li, Sergey Levine, and George Tucker for insightful comments and discussion.

References

  • Agarwal et al. [2019] Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. arXiv preprint arXiv:1907.04543, 2019.
  • Bai et al. [2019] Yu Bai, Tengyang Xie, Nan Jiang, and Yu-Xiang Wang. Provably efficient q-learning with low switching cost. In Advances in Neural Information Processing Systems, 2019.
  • Barth-Maron et al. [2018] Gabriel Barth-Maron, Matthew Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva TB, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. In International Conference on Learning Representations, 2018.
  • Boney et al. [2019] Rinu Boney, Juho Kannala, and Alexander Ilin. Regularizing model-based planning with energy-based models. In Conference on Robot Learning, 2019.
  • Chow et al. [2015] Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-making: a cvar optimization approach. In Advances in Neural Information Processing Systems, 2015.
  • Chow et al. [2018] Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. In Advances in neural information processing systems, 2018.
  • Chow et al. [2019] Yinlam Chow, Ofir Nachum, Aleksandra Faust, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. Lyapunov-based safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031, 2019.
  • Chua et al. [2018] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, 2018.
  • Clavera et al. [2018] Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. Model-based reinforcement learning via meta-policy optimization. In Conference on Robot Learning, 2018.
  • Degris et al. [2012] Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. arXiv preprint arXiv:1205.4839, 2012.
  • Deisenroth and Rasmussen [2011] Marc Deisenroth and Carl E Rasmussen. PILCO: A model-based and data-efficient approach to policy search. In International Conference on Machine Learning, 2011.
  • Ernst et al. [2005] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 2005.
  • Espeholt et al. [2018] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. IMPALA: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, 2018.
  • Espeholt et al. [2019] Lasse Espeholt, Raphaël Marinier, Piotr Stanczyk, Ke Wang, and Marcin Michalski. SEED RL: Scalable and efficient deep-rl with accelerated central inference. arXiv preprint arXiv:1910.06591, 2019.
  • Eysenbach et al. [2018] Benjamin Eysenbach, Shixiang Gu, Julian Ibarz, and Sergey Levine. Leave no trace: Learning to reset for safe and autonomous reinforcement learning. International Conference on Learning Representations, 2018.
  • Fujimoto et al. [2019] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, 2019.
  • Gottesman et al. [2018] Omer Gottesman, Fredrik Johansson, Joshua Meier, Jack Dent, Donghun Lee, Srivatsan Srinivasan, Linying Zhang, Yi Ding, David Wihl, Xuefeng Peng, Jiayu Yao, Isaac Lage, Christopher Mosch, Li wei H. Lehman, Matthieu Komorowski, Matthieu Komorowski, Aldo Faisal, Leo Anthony Celi, David Sontag, and Finale Doshi-Velez. Evaluating reinforcement learning algorithms in observational health settings. arXiv preprint arXiv:1805.12298, 2018.
  • Gu et al. [2016] Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, 2016.
  • Gu et al. [2017a] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In International Conference on Robotics and Automation, 2017a.
  • Gu et al. [2017b] Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, and Sergey Levine. Q-Prop: Sample-efficient policy gradient with an off-policy critic. In International Conference on Learning Representations, 2017b.
  • Guo and Brunskill [2015] Zhaohan Guo and Emma Brunskill. Concurrent pac rl. In AAAI Conference on Artificial Intelligence, 2015.
  • Haarnoja et al. [2018] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018.
  • Heess et al. [2015] Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, 2015.
  • Hessel et al. [2018] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In AAAI Conference on Artificial Intelligence, 2018.
  • Janner et al. [2019] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, 2019.
  • Jaques et al. [2019] Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019.
  • Kalashnikov et al. [2018] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, 2018.
  • Kidambi et al. [2020] Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. MOReL : Model-based offline reinforcement learning. arXiv preprint arXiv:2005.05951, 2020.
  • Kingma and Ba [2014] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2014.
  • Kumar et al. [2019] Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, 2019.
  • Kurutach et al. [2018] Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-Ensemble Trust-Region Policy Optimization. In International Conference on Learning Representations, 2018.
  • Lange et al. [2012] Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. In Reinforcement learning. Springer, 2012.
  • Levine et al. [2020] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  • Lillicrap et al. [2016] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
  • Lin [1992] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 1992.
  • Liu et al. [2019] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473, 2019.
  • Mandel et al. [2014] Travis Mandel, Yun-En Liu, Sergey Levine, Emma Brunskill, and Zoran Popovic. Offline policy evaluation across representations with applications to educational games. In International Conference on Autonomous Agents and Multiagent Systems, 2014.
  • Mandlekar et al. [2019] Ajay Mandlekar, Fabio Ramos, Byron Boots, Li Fei-Fei, Animesh Garg, and Dieter Fox. IRIS: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. arXiv preprint arXiv:1911.05321, 2019.
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 2015.
  • Murphy et al. [2001] Susan A Murphy, Mark J van der Laan, James M Robins, and Conduct Problems Prevention Research Group. Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 2001.
  • Nachum et al. [2018] Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, 2018.
  • Nachum et al. [2019] Ofir Nachum, Michael Ahn, Hugo Ponte, Shixiang Gu, and Vikash Kumar. Multi-agent manipulation via locomotion using hierarchical sim2real. In Conference on Robot Learning, 2019.
  • Nagabandi et al. [2018] Anusha Nagabandi, Gregory Kahn, Ronald S. Fearing, and Sergey Levine. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning. International Conference on Robotics and Automation, 2018.
  • Nagarajan and Kolter [2019] Vaishnavh Nagarajan and J Zico Kolter. Generalization in deep networks: The role of distance from initialization. arXiv preprint arXiv:1901.01672, 2019.
  • Nair et al. [2015] Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, et al. Massively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296, 2015.
  • Precup et al. [2001] Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. Off-policy temporal-difference learning with function approximation. In International Conference on Machine Learning, 2001.
  • Rajeswaran et al. [2017] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087, 2017.
  • Rajeswaran et al. [2020] Aravind Rajeswaran, Igor Mordatch, and Vikash Kumar. A game theoretic framework for model based reinforcement learning. arXiv preprint arXiv:2004.07804, 2020.
  • Ray et al. [2019] Alex Ray, Joshua Achiam, and Dario Amodei. Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708, 2019.
  • Ross et al. [2011] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In International conference on artificial intelligence and statistics, 2011.
  • Roux [2016] Nicolas Le Roux. Efficient iterative policy optimization. arXiv preprint arXiv:1612.08967, 2016.
  • Schulman et al. [2015] John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust region policy optimization. In International Conference on Machine Learning, 2015.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Siegel et al. [2020] Noah Y. Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, and Martin A. Riedmiller. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. In International Conference on Learning Representations, 2020.
  • Singh et al. [1994] Satinder P Singh, Tommi Jaakkola, and Michael I Jordan. Learning without state-estimation in partially observable markovian decision processes. In Machine Learning Proceedings. Elsevier, 1994.
  • Singh et al. [1995] Satinder P. Singh, Tommi Jaakkola, and Michael I. Jordan. Reinforcement learning with soft state aggregation. In Advances in Neural Information Processing Systems, 1995.
  • Sohn et al. [2020] Sungryull Sohn, Yinlam Chow, Jayden Ooi, Ofir Nachum, Honglak Lee, Ed Chi, and Craig Boutilier. BRPO: Batch residual policy optimization. arXiv preprint arXiv:2002.05522, 2020.
  • Sutton [1991] Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 1991.
  • Todorov et al. [2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, 2012.
  • Wang et al. [2019] Tingwu Wang, Xuchan Bao, Ignasi Clavera, Jerrick Hoang, Yeming Wen, Eric Langlois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel, and Jimmy Ba. Benchmarking model-based reinforcement learning. arXiv preprint arXiv:1907.02057, 2019.
  • Wu et al. [2019] Yifan Wu, George Tucker, and Ofir Nachum. Behavior Regularized Offline Reinforcement Learning. arXiv preprint arXiv:1911.11361, 2019.
  • Yu et al. [2020] Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. MOPO: Model-based offline policy optimization. arXiv preprint arXiv:2005.13239, 2020.

Appendix

A Proof of Proposition 1

We first consider ϵπsubscriptitalic-ϵ𝜋\epsilon_{\pi}. The behavior cloning objective in its supremum form is,

ϵβsubscriptitalic-ϵ𝛽\displaystyle\epsilon_{\beta} =\displaystyle= sups𝒟𝔼a𝒟(|s)[atπ^β(st)22/2](πb(|s))\displaystyle\sup_{s\in\mathcal{D}}\mathbb{E}_{a\sim\mathcal{D}(-|s)}[\left\|a_{t}-\hat{\pi}_{\beta}\left(s_{t}\right)\right\|_{2}^{2}/2]-\mathcal{H}(\pi_{b}(-|s))
=\displaystyle= sups𝒟𝔼a𝒟(|s)[logπθ0(a|s)](πb(|s))12log2π\displaystyle\sup_{s\in\mathcal{D}}\mathbb{E}_{a\sim\mathcal{D}(-|s)}\left[-\log\pi_{\theta_{0}}(a|s)\right]-\mathcal{H}(\pi_{b}(-|s))-\frac{1}{2}\log 2\pi
=\displaystyle= sups𝒟DKL(πb(|s)||πθ0(|s))12log2π.\displaystyle\sup_{s\in\mathcal{D}}D_{KL}(\pi_{b}(-|s)||\pi_{\theta_{0}}(-|s))-\frac{1}{2}\log 2\pi.

We apply Pinsker’s inequality to the true and estimated behavior policy to yield

supsDTV(πb(|s)||πθ0(|s))12ϵβ+14log2π.\sup_{s}D_{TV}(\pi_{b}(-|s)||\pi_{\theta_{0}}(-|s))\leq\sqrt{\frac{1}{2}\epsilon_{\beta}+\frac{1}{4}\log 2\pi}.

By the same Pinsker’s inequality, we have,

supsDTV(πθk(|s)||πθk+1(|s))δ/2.\sup_{s}D_{TV}(\pi_{\theta_{k}}(-|s)||\pi_{\theta_{k+1}}(-|s))\leq\sqrt{\delta/2}.

Therefore, by triangle inequality, we have

ϵπsupsDTV(πb(|s)||πθT(|s))12ϵβ+14log2π+T12δ,\epsilon_{\pi}\leq\sup_{s}D_{TV}(\pi_{b}(-|s)||\pi_{\theta_{T}}(-|s))\leq\sqrt{\frac{1}{2}\epsilon_{\beta}+\frac{1}{4}\log 2\pi}+T\sqrt{\frac{1}{2}\delta},

as desired.

We perform similarly for ϵmsubscriptitalic-ϵ𝑚\epsilon_{m}. The model dynamics loss is

ϵϕsubscriptitalic-ϵitalic-ϕ\displaystyle\epsilon_{\phi} =\displaystyle= sups,a𝔼s𝒟(|s,a)[sf^ϕ(s,a)22/2](p(|s,a))\displaystyle\sup_{s,a}\mathbb{E}_{s^{\prime}\sim\mathcal{D}(-|s,a)}\left[\|s^{\prime}-\hat{f}_{\phi}(s,a)\|_{2}^{2}/2\right]-\mathcal{H}(p(-|s,a))
=\displaystyle= sups,a𝔼s𝒟(|s,a)[logpϕ(s|s,a)](p(|s,a))12log2π\displaystyle\sup_{s,a}\mathbb{E}_{s^{\prime}\sim\mathcal{D}(-|s,a)}\left[\log p_{\phi}(s^{\prime}|s,a)\right]-\mathcal{H}(p(-|s,a))-\frac{1}{2}\log 2\pi
=\displaystyle= sups,aDKL(p(|s,a)||pϕ(|s,a))12log2π.\displaystyle\sup_{s,a}D_{KL}(p(-|s,a)||p_{\phi}(-|s,a))-\frac{1}{2}\log 2\pi.

We apply Pinsker’s inequality to the true dynamics and learned model to yield

ϵmsups,aDTV(p(|s,a)||pϕ(|s,a))12ϵϕ+14log2π,\epsilon_{m}\leq\sup_{s,a}D_{TV}(p(-|s,a)||p_{\phi}(-|s,a))\leq\sqrt{\frac{1}{2}\epsilon_{\phi}+\frac{1}{4}\log 2\pi},

as desired.

B Details of Experimental Settings

B.1 Implementation Details

For our baseline methods, we use the open-source implementations of SAC, BC, BCQ, and BRAC published in Wu et al. [61]. SAC and BRAC have (300, 300) Q-Network and (200, 200) policy network. BC has (200, 200) policy network, and BCQ has (300, 300) Q-Network, (300, 300) policy network, and (750, 750) conditional VAE. As for online ME-TRPO, we utilize the codebase of model-based RL benchmark [60]. BREMEN and online ME-TRPO use the policy consisting of two hidden layers with 200 units. The dynamics model also consists of two hidden layers with 1,024 units. We use Adam [29] as the optimizer with the learning rate of 0.001 for the dynamics model, and 0.0005 for behavior cloning in BREMEN. Especially in BREMEN and online ME-TRPO, we adopt a linear feature value function to stabilize the training. BREMEN in deployment-efficient settings takes about two or three hours per deployment on an NVIDIA TITAN V.

To leverage neural networks as Dyna-style [58] dynamics models, we modify reward and termination function so that they are not dependent on the internal physics engine for calculation, following model-based benchmark codebase [60]; see Table 2. Note that the score of baselines (e.g., BCQ, BRAC) is slightly different from Wu et al. [61] due to this modification of the reward function. We re-run each algorithm in our environments and got appropriate convergence.

The maximum length of one episode is 1,000 steps without any termination in Ant and HalfCheetah; however, termination function is enabled in Hopper and Walker2d. The batch size of transitions for policy update is 50,000 in BREMEN and ME-TRPO, following Kurutach et al. [31]. The batch size of BC and BRAC is 256, and BCQ is 100, also following Wu et al. [61].

Refer to caption
(a) Ant
Refer to caption
(b) HalfCheetah
Refer to caption
(c) Hopper
Refer to caption
(d) Walker2d
Figure 4: Four standard MuJoCo benchmark environments used in our experiments.
 
Environment Reward function Termination in rollouts
Ant x˙t0.1𝒂t223.0×(zt0.57)2+1subscript˙𝑥𝑡0.1superscriptsubscriptnormsubscript𝒂𝑡223.0superscriptsubscript𝑧𝑡0.5721\dot{x}_{t}-0.1\|\bm{a}_{t}\|_{2}^{2}-3.0\times(z_{t}-0.57)^{2}+1 False
HalfCheetah x˙t0.1𝒂t22subscript˙𝑥𝑡0.1superscriptsubscriptnormsubscript𝒂𝑡22\dot{x}_{t}-0.1\|\bm{a}_{t}\|_{2}^{2} False
Hopper x˙t0.001𝒂t22+1subscript˙𝑥𝑡0.001superscriptsubscriptnormsubscript𝒂𝑡221\dot{x}_{t}-0.001\|\bm{a}_{t}\|_{2}^{2}+1 True
Walker2d x˙t0.001𝒂t22+1subscript˙𝑥𝑡0.001superscriptsubscriptnormsubscript𝒂𝑡221\dot{x}_{t}-0.001\|\bm{a}_{t}\|_{2}^{2}+1 True
 
Table 2: Reward function and termination in rollouts in the experiments. We remove all contact information from observation of Ant, basically following Wang et al. [60].

B.2 Hyper Parameters

In this section, we describe the hyper-parameters in both deployment-efficient RL (Section B.2.1) and offline RL (Section B.2.2) settings. We run all of our experiments with five random seed, and the results are averaged.

B.2.1 Deployment-Efficient RL

Table 3 shows the hyper-parameters of BREMEN. The rollout length is searched from {250, 500, 1000}, and max step size δ𝛿\delta is searched from {0.001, 0.01, 0.05, 0.1, 1.0}. As for the discount factor γ𝛾\gamma and GAE λ𝜆\lambda, we follow Wang et al. [60].

 
Parameter Ant HalfCheetah Hopper Walker2d
Iteration per batch 2,000 2,000 6,000 2,000
Deployment 5 5 10 10
Total iteration 10,000 10,000 60,000 20,000
Rollouts length 250 250 1,000 1,000
Max step size δ𝛿\delta 0.05 0.1 0.05 0.05
Discount factor γ𝛾\gamma 0.99 0.99 0.99 0.99
GAE λ𝜆\lambda 0.97 0.95 0.95 0.95
Stationary noise σ𝜎\sigma 0.1 0.1 0.1 0.1
 
Table 3: Hyper-parameters of BREMEN in deployment-efficient settings.
Number of Iterations for Policy Optimization

To achieve high deployment efficiency, the number of iterations for policy optimization between deployments is one of the important hyper-parameters for fast convergence. In the existing methods (BCQ, BRAC, SAC), we search over three values: {10,000, 50,000, 100,000}, and choose 10,000 in BCQ and BRAC, and 100,000 in SAC (Figure 5). For BREMEN, we also search over three values: {2,000, 4,000, 6,000}. Figure 6 shows the results of iteration search, and we choose 2,000 in Ant, HalfCheetah, and Walker2d, and 6,000 in Hopper.

Refer to caption
Figure 5: Search on the number of iterations for SAC policy optimization between deployments. The number of transitions per one data-collection is 200K.
Refer to caption
Figure 6: Search on the number of iterations for BREMEN policy optimization between deployments. The number of transitions per one data-collection is 200K.
Stationary Noise in BREMEN

To achieve effective exploration, the stochastic Gaussian policy is a good choice. We found that adding stationary Gaussian noise to the policy in the imaginary trajectories and data collection led to the notable improvement. Stationary Gaussian policy is written as,

at=tanh(μθ(st))+ϵ,ϵ𝒩(0,σ2).formulae-sequencesubscript𝑎𝑡subscript𝜇𝜃subscript𝑠𝑡italic-ϵsimilar-toitalic-ϵ𝒩0superscript𝜎2a_{t}=\tanh(\mu_{\theta}(s_{t}))+\epsilon,\qquad\epsilon\sim\mathcal{N}(0,\sigma^{2}).

Another choice is a learned Gaussian policy, which parameterizes not only μθsubscript𝜇𝜃\mu_{\theta} but also σθsubscript𝜎𝜃\sigma_{\theta}. Learned gaussian policy is also written as,

at=tanh(μθ(st))+σθ(st)ϵ,ϵ𝒩(0,σ2).formulae-sequencesubscript𝑎𝑡subscript𝜇𝜃subscript𝑠𝑡direct-productsubscript𝜎𝜃subscript𝑠𝑡italic-ϵsimilar-toitalic-ϵ𝒩0superscript𝜎2a_{t}=\tanh(\mu_{\theta}(s_{t}))+\sigma_{\theta}(s_{t})\odot\epsilon,\qquad\epsilon\sim\mathcal{N}(0,\sigma^{2}).

We utilize the zero-mean Gaussian 𝒩(0,σ2)𝒩0superscript𝜎2\mathcal{N}(0,\sigma^{2}), and tune up σ𝜎\sigma in Figure 7 with HalfCheetah, comparing stationary and learned strategies. From this experiment, we found that the stationary noise, the scale of 0.1, consistently performs well, and therefore we used it for all our experiments.

Refer to caption
Figure 7: Search on the Gaussian noise parameter σ𝜎\sigma in HalfCheetah. The number of transitions per one data-collection is 200K.
Other Hyper-parameters in the Existing Methods

As for online ME-TRPO, we collect 3,000 steps through online interaction with the environment per 25 iterations and split these transitions into a 2-to-1 ratio of training and validation dataset for learning dynamics models. In batch size 100,000 settings, we collect 2,000 steps and split with a 1-to-1 ratio. Totally, we iterate 12,500 times policy optimization, which is equivalent to 500 deployments of the policy. Note that we carefully tune up the hyper-parameters of online ME-TRPO, and then its performance was improved from Wang et al. [60].

Table 4 and Table 5 shows the tunable hyper-parameters of BCQ and BRAC, respectively. We refer Wu et al. [61] to choose these values. In this work, BRAC applies a primal form of KL value penalty, and BRAC (max Q) means sampling multiple actions and taking the maximum according to the learned Q function.

 
Parameter Ant HalfCheetah Hopper Walker2d
Policy learning rate 3e-05 3e-04 3e-06 3e-05
Perturbation range ΦΦ\Phi 0.15 0.5 0.15 0.15
 
Table 4: Hyper-parameters of BCQ.
 
Parameter Ant HalfCheetah Hopper Walker2d
Policy learning rate 1e-4 1e-3 3e-5 1e-5
Divergence penalty α𝛼\alpha 0.3 0.1 0.3 0.3
 
Table 5: Hyper-parameters of BRAC.

B.2.2 Offline RL

In the offline experiments, we apply the same hyper-parameters as in the deployment-efficient settings described above, except for the iteration per batch. Algorithm 2 is pseudocode for BREMEN in offline RL settings where policies are updated only with one fixed batch dataset. The number of iteration T𝑇T is set to 6,250 in BREMEN, and 500,000 in BC, BCQ, and BRAC.

Algorithm 2 BREMEN for Offline RL
0:  Offline dataset 𝒟={st,at,rt,st+1}𝒟subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡subscript𝑠𝑡1\mathcal{D}=\{s_{t},a_{t},r_{t},s_{t+1}\}, Initial parameters ϕ={ϕ1,,ϕK}italic-ϕsubscriptitalic-ϕ1subscriptitalic-ϕ𝐾\phi=\{\phi_{1},\cdots,\phi_{K}\}, β𝛽\beta, Number of policy optimization T𝑇T.
1:  Train K𝐾K dynamics models f^ϕsubscript^𝑓italic-ϕ\hat{f}_{\phi} using 𝒟𝒟\mathcal{D} via Eq. 1.
2:  Train estimated behavior policy π^βsubscript^𝜋𝛽\hat{\pi}_{\beta} using 𝒟𝒟\mathcal{D} by behavior cloning via Eq. 3.
3:  Initialize target policy πθ0=Normal(π^β,1)subscript𝜋subscript𝜃0Normalsubscript^𝜋𝛽1\pi_{\theta_{0}}=\mathrm{Normal}(\hat{\pi}_{\beta},1).
4:  for policy optimization k=1,,T𝑘1𝑇k=1,\cdots,T do
5:     Generate imaginary rollout.
6:     Optimize target policy πθsubscript𝜋𝜃\pi_{\theta} satisfying Eq. 4 with the rollout.

C Additional Experimental Results

C.1 Performance on the Dataset with Different Noise

Following Wu et al. [61] and Kidambi et al. [28], we additionally compare BREMEN in offline settings to the other baselines (BC, BCQ, BRAC) with five datasets of different exploration noise. Each dataset has also one million transitions.

  • eps1: 40 % of the dataset is collected by data-collection policy (partially trained SAC policy) πbsubscript𝜋𝑏\pi_{b}, 40 % of the dataset is collected by epsilon greedy policy with ϵ=0.1italic-ϵ0.1\epsilon=0.1 to take a random action, and 20 % of dataset is collected by an uniformly random policy.

  • eps3: Same as eps1, 40 % of the dataset is collected by πbsubscript𝜋𝑏\pi_{b}, 40 % is collected by epsilon greedy policy with ϵ=0.3italic-ϵ0.3\epsilon=0.3, and 20 % is collected by an uniformly random policy.

  • gaussian1: 40 % of the dataset is collected by data-collection policy πbsubscript𝜋𝑏\pi_{b}, 40 % is collected by the policy with adding zero-mean Gaussian noise 𝒩(0,0.12)𝒩0superscript0.12\mathcal{N}(0,0.1^{2}) to each action sampled from πbsubscript𝜋𝑏\pi_{b}, and 20 % is collected by an uniformly random policy.

  • gaussian3: 40 % of the dataset is collected by data-collection policy πbsubscript𝜋𝑏\pi_{b}, 40 % is collected by the policy with zero-mean Gaussian noise 𝒩(0,0.32)𝒩0superscript0.32\mathcal{N}(0,0.3^{2}), and 20 % is collected by an uniformly random policy.

  • random: All of the dataset is collected by an uniformly random policy.

Table 6 shows that BREMEN can also achieve performance competitive with state-of-the-art model-free offline RL algorithm even with noisy datasets. The training curves of each experiment are shown in Section C.4.

Table 6: Comparison of BREMEN to the existing offline methods in offline settings, namely, BC, BCQ [16], and BRAC [61]. Each cell shows the average cumulative reward and their standard deviation with 5 seeds. The maximum steps per episode is 1,000. Five different types of exploration noise are introduced during the data collection, eps1, eps3, gaussian1, gaussian3, and random. BRAC applies a primal form of KL value penalty, and BRAC (max Q) means sampling multiple actions and taking the maximum according to the learned Q function.
 
Noise: eps1, 1,000,000 (1M) transitions
Method Ant HalfCheetah Hopper Walker2d
Dataset 1077 2936 791 815
BC 1381±plus-or-minus\pm71 3788±plus-or-minus\pm740 266±plus-or-minus\pm486 1185±plus-or-minus\pm155
BCQ 1937±plus-or-minus\pm116 6046±plus-or-minus\pm276 800±plus-or-minus\pm659 479±plus-or-minus\pm537
BRAC 2693±plus-or-minus\pm155 7003±plus-or-minus\pm118 1243±plus-or-minus\pm162 3204±plus-or-minus\pm103
BRAC (max Q) 2907±plus-or-minus\pm98 7070±plus-or-minus\pm81 1488±plus-or-minus\pm386 3330±plus-or-minus\pm147
BREMEN (Ours) 3519±plus-or-minus\pm129 7585±plus-or-minus\pm425 2818±plus-or-minus\pm76 1710±plus-or-minus\pm429
ME-TRPO (offline) 1514±plus-or-minus\pm503 1009±plus-or-minus\pm731 1301±plus-or-minus\pm654 128±plus-or-minus\pm153
 
Noise: eps3, 1,000,000 (1M) transitions
Method Ant HalfCheetah Hopper Walker2d
Dataset 936 2408 662 648
BC 1364±plus-or-minus\pm121 2877±plus-or-minus\pm797 519±plus-or-minus\pm532 1066±plus-or-minus\pm176
BCQ 1938±plus-or-minus\pm21 5739±plus-or-minus\pm188 1170±plus-or-minus\pm446 1018±plus-or-minus\pm1231
BRAC 2718±plus-or-minus\pm90 6434±plus-or-minus\pm147 1224±plus-or-minus\pm71 2921±plus-or-minus\pm101
BRAC (max Q) 2913±plus-or-minus\pm87 6672±plus-or-minus\pm136 2103±plus-or-minus\pm746 3079±plus-or-minus\pm110
BREMEN (Ours) 3409±plus-or-minus\pm218 7632±plus-or-minus\pm104 2803±plus-or-minus\pm65 1586±plus-or-minus\pm139
ME-TRPO (offline) 1843±plus-or-minus\pm674 5504±plus-or-minus\pm67 1308±plus-or-minus\pm756 354±plus-or-minus\pm329
 
Noise: gaussian1, 1,000,000 (1M) transitions
Method Ant HalfCheetah Hopper Walker2d
Dataset 1072 3150 882 1070
BC 1279±plus-or-minus\pm80 4142±plus-or-minus\pm189 31±plus-or-minus\pm16 1137±plus-or-minus\pm477
BCQ 1958±plus-or-minus\pm76 5854±plus-or-minus\pm498 475±plus-or-minus\pm416 608±plus-or-minus\pm416
BRAC 2905±plus-or-minus\pm81 7026±plus-or-minus\pm168 1456±plus-or-minus\pm161 3030±plus-or-minus\pm103
BRAC (max Q) 2910±plus-or-minus\pm157 7026±plus-or-minus\pm168 1575±plus-or-minus\pm89 3242±plus-or-minus\pm97
BREMEN (Ours) 2912±plus-or-minus\pm165 7928±plus-or-minus\pm313 1999±plus-or-minus\pm617 1402±plus-or-minus\pm290
ME-TRPO (offline) 1275±plus-or-minus\pm656 1275±plus-or-minus\pm656 909±plus-or-minus\pm631 171±plus-or-minus\pm119
 
Noise: gaussian3, 1,000,000 (1M) transitions
Method Ant HalfCheetah Hopper Walker2d
Dataset 1058 2872 781 981
BC 1300±plus-or-minus\pm34 4190±plus-or-minus\pm69 611±plus-or-minus\pm467 1217±plus-or-minus\pm361
BCQ 1982±plus-or-minus\pm97 5781±plus-or-minus\pm543 1137±plus-or-minus\pm582 258±plus-or-minus\pm286
BRAC 3084±plus-or-minus\pm180 3933±plus-or-minus\pm2740 1432±plus-or-minus\pm499 3253±plus-or-minus\pm118
BRAC (max Q) 2916±plus-or-minus\pm99 3997±plus-or-minus\pm2761 1417±plus-or-minus\pm267 3372±plus-or-minus\pm153
BREMEN (Ours) 3432±plus-or-minus\pm185 8124±plus-or-minus\pm145 1867±plus-or-minus\pm354 2299±plus-or-minus\pm474
ME-TRPO (offline) 1237±plus-or-minus\pm310 2141±plus-or-minus\pm872 973±plus-or-minus\pm243 219±plus-or-minus\pm145
 
Noise: random, 1,000,000 (1M) transitions
Method Ant HalfCheetah Hopper Walker2d
Dataset 470 -285 34 2
BC 989±plus-or-minus\pm10 -2±plus-or-minus\pm1 106±plus-or-minus\pm62 108±plus-or-minus\pm110
BCQ 1222±plus-or-minus\pm114 2887±plus-or-minus\pm242 206±plus-or-minus\pm7 228±plus-or-minus\pm12
BRAC 1057±plus-or-minus\pm92 3449±plus-or-minus\pm259 227±plus-or-minus\pm30 29±plus-or-minus\pm54
BRAC (max Q) 683±plus-or-minus\pm57 3418±plus-or-minus\pm171 224±plus-or-minus\pm37 26±plus-or-minus\pm50
BREMEN (Ours) 905±plus-or-minus\pm11 3627±plus-or-minus\pm193 270±plus-or-minus\pm68 254±plus-or-minus\pm6
ME-TRPO (offline) 2221±plus-or-minus\pm665 2701±plus-or-minus\pm120 321±plus-or-minus\pm29 262±plus-or-minus\pm13
 

C.2 Comparison among Different Number of Ensembles

To deal with the distribution shift during policy optimization, also known as model bias, we introduce the dynamics model ensembles. We validate the performance of BREMEN with a different number of dynamics models K𝐾K. Figure 8 and Figure 9 show the performance of BREMEN with the different number of ensembles in deployment-efficient and offline settings. Ensembles with more dynamics models resulted in better performance due to the mitigation of distributional shift except for K=10𝐾10K=10, and then we choose K=5𝐾5K=5.

Refer to caption
Figure 8: Comparison of the number of dynamics models in deployment-efficient settings.
Refer to caption
Figure 9: Comparison of the number of dynamics models in offline settings.

C.3 Implicit KL Control in Offline Settings

Similar to Section 5.3, we present offline RL experiments to better understand the effect of implicit KL regularization. In contrast to the implicit KL regularization with Eq. 4, the optimization of BREMEN with explicit KL value penalty becomes

θk+1subscript𝜃𝑘1\displaystyle\theta_{k+1} =argmaxθEs,aπθk,f^ϕi[πθ(a|s)πθk(a|s)(Aπθk(s,a)αDKL(πθ(|s)π^β(|s)))]\displaystyle=\operatorname*{arg\,max}_{\theta}\underset{s,a\sim\pi_{\theta_{k}},\hat{f}_{\phi_{i}}}{\operatorname{E}}\left[\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{k}}(a|s)}\left(A^{\pi_{\theta_{k}}}(s,a)-\alpha D_{\mathrm{KL}}(\pi_{\theta}(\cdot|s)\|\hat{\pi}_{\beta}(\cdot|s))\right)\right] (6)
s.t.𝔼sπθk[DKL(πθ(|s)πθk(|s))]δ,\displaystyle\text{s.t.}\quad\underset{s\sim\pi_{\theta_{k}}}{\mathbb{E}}\left[D_{\mathrm{KL}}\left(\pi_{\theta}(\cdot|s)\|\pi_{\theta_{k}}(\cdot|s)\right)\right]\leq\delta,

where Aπθk(s,a)superscript𝐴subscript𝜋subscript𝜃𝑘𝑠𝑎A^{\pi_{\theta_{k}}}(s,a) is the advantage of πθksubscript𝜋subscript𝜃𝑘\pi_{\theta_{k}} computed using imaginary rollouts with the learned dynamics model and δ𝛿\delta is the maximum step size. Note that BREMEN with explicit KL penalty does not utilize behavior cloning initialization.

We empirically conclude that the explicit constraint αDKL(πθ(|s)π^β(|s))-\alpha D_{\mathrm{KL}}(\pi_{\theta}(\cdot|s)\|\hat{\pi}_{\beta}(\cdot|s)) is unnecessary and just TRPO update with behavior-initialization as implicit regularization is sufficient in BREMEN algorithm. Figure 10 shows the KL divergence between learned policies and the last deployed policies (top row) and model errors measured by a mean squared error of predicted next state from the true state (second row). We find that behavior initialized policy with conservative KL trust-region updates well stuck to the last deployed policy during improvement without explicit KL penalty. The policy initialized with behavior cloning also tended to suppress the increase of model error, which implies that behavior initialization alleviates the effect of the distribution shift. In Walker2d, the model error of BREMEN is relatively large, which may relate to the poor performance with noisy datasets in Section C.1.

Refer to caption
Figure 10: Average cumulative rewards (top row) and corresponding KL divergence of learned policies from the last deployed policy (second row) and model errors (bottom row) in offline settings with 1M dataset (no noise). Behavior initialized policy (purple line) tends to suppress the policy and model error during training better than no-initialization (red line) or explicit KL penalty (green line).

C.4 Training Curves for Offline RL with Different Noises

In this section, we present training curves of our all experiments in offline settings. Figure 11 shows the results in Section 5.2. Figure 12131415, and 16 also show the results in Section C.1.

Refer to caption
Figure 11: Performance in Offline RL experiments (Table 1). (top row) dataset size is 1M, (second row) 100K, and (bottom row) 50K, respectively. Note that x-axis is the number of iterations with policy optimization in a log-scale.
Refer to caption
Figure 12: Performance in Offline RL experiments with ϵitalic-ϵ\epsilon-greedy dataset noise ϵ=0.1italic-ϵ0.1\epsilon=0.1. Dataset size is 1M.
Refer to caption
Figure 13: Performance in Offline RL experiments with ϵitalic-ϵ\epsilon-greedy dataset noise ϵ=0.3italic-ϵ0.3\epsilon=0.3. Dataset size is 1M.
Refer to caption
Figure 14: Performance in Offline RL experiments with gaussian dataset noise 𝒩(0,0.12)𝒩0superscript0.12\mathcal{N}(0,0.1^{2}). Dataset size is 1M.
Refer to caption
Figure 15: Performance in Offline RL experiments with gaussian dataset noise 𝒩(0,0.32)𝒩0superscript0.32\mathcal{N}(0,0.3^{2}). Dataset size is 1M.
Refer to caption
Figure 16: Performance in Offline RL experiments with completely random behaviors. Dataset size is 1M.

C.5 Deployment-Efficient RL Experiment with Different Reward Function

In addition to the main results in Section 5.1 (Figure 2), we also evaluate BREMEN in deployment-efficient setting with different reward function. We modified HalfCheetah environment into the one similar to cheetah-run task in Deep Mind Control Suite.333https://github.com/deepmind/dm_control/blob/master/dm_control/suite/cheetah.py The reward function is defined as

rt={0.1x˙t(0x˙t10)1(x˙t>10),subscript𝑟𝑡cases0.1subscript˙𝑥𝑡0subscript˙𝑥𝑡101subscript˙𝑥𝑡10r_{t}=\begin{cases}0.1\dot{x}_{t}&(0\leq\dot{x}_{t}\leq 10)\\ 1&(\dot{x}_{t}>10),\end{cases}

and the termination is turned off. Figure 17 shows the performance of BREMEN and existing methods. BREMEN also shows better deployment efficiency than other existing offline methods and online ME-TRPO, except for SAC, which is the same trend as that of main results.

Refer to caption
Figure 17: Performance in Deployment-Efficient RL experiments with different reward function of HalfCheetah.