Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization

Tatsuya Matsushima Hiroki Furuta¹¹footnotemark: 1
The University of Tokyo
{matsushima, furuta}@weblab.t.u-tokyo.ac.jp
Yutaka Matsuo
The University of Tokyo
matsuo@weblab.t.u-tokyo.ac.jp
&Ofir Nachum
Google Research
ofirnachum@google.com &Shixiang Shane Gu
Google Research
shanegu@google.com
Equal contribution.

Abstract

Most reinforcement learning (RL) algorithms assume online access to the environment, in which one may readily interleave updates to the policy with experience collection using that policy. However, in many real-world applications such as health, education, dialogue agents, and robotics, the cost or potential risk of deploying a new data-collection policy is high, to the point that it can become prohibitive to update the data-collection policy more than a few times during learning. With this view, we propose a novel concept of deployment efficiency, measuring the number of distinct data-collection policies that are used during policy learning. We observe that naïvely applying existing model-free offline RL algorithms recursively does not lead to a practical deployment-efficient and sample-efficient algorithm. We propose a novel model-based algorithm, Behavior-Regularized Model-ENsemble (BREMEN) that can effectively optimize a policy offline using 10-20 times fewer data than prior works. Furthermore, the recursive application of BREMEN is able to achieve impressive deployment efficiency while maintaining the same or better sample efficiency, learning successful policies from scratch on simulated robotic environments with only 5-10 deployments, compared to typical values of hundreds to millions in standard RL baselines. Codes and pre-trained models are available at https://github.com/matsuolab/BREMEN.

1 Introduction

Reinforcement learning (RL) algorithms have recently demonstrated impressive success in learning behaviors for a variety of sequential decision-making tasks [3, 24, 42]. Virtually all of these demonstrations have relied on highly-frequent online access to the environment, with the RL algorithms often interleaving each update to the policy with additional experience collection of that policy acting in the environment. However, in many real-world applications of RL, such as health [40], education [37], dialog agents [26], and robotics [19, 27], the deployment of a new data-collection policy may be associated with a number of costs and risks. If we can learn tasks with a small number of data collection policies, we can substantially reduce these costs and risks.

Based on this idea, we propose a novel measure of RL algorithm performance, namely deployment efficiency, which counts the number of changes in the data-collection policy during learning, as illustrated in Figure 1. This concept may be seen in contrast to sample efficiency or data efficiency [46, 10, 20, 22, 34, 41], which measures the amount of environment interactions incurred during training, without regard to how many distinct policies were deployed to perform those interactions. Even when the data efficiency is high, the deployment efficiency could be low, since many on-policy and off-policy algorithms alternate data collection with each policy update [52, 34, 18, 22]. Such dependence on high-frequency policy deployments is best illustrated in the recent works in offline RL [16, 26, 30, 33, 61], where baseline off-policy algorithms exhibited poor performance when trained on a static dataset. These offline RL works, however, limit their study to a single deployment, which is enough for achieving high performance with data collected from a sub-optimal behavior policy, but often not from a random policy. In contrast to those prior works, we aim to learn successful policies from scratch with minimal amounts of data and deployments.

Many existing model-free offline RL algorithms [33] are tuned and evaluated on large datasets (e.g., one million transitions). In order to develop an algorithm that is both sample-efficient and deployment-efficient, each iteration of the algorithm between successive deployments has to work effectively on much smaller dataset sizes. We believe model-based RL is better suited to this setting due to its higher demonstrated sample efficiency than model-free RL [31, 43]. Although the combination of model-based RL and offline or limited-deployment settings seems straight-forward, we find this naïve approach leads to poor performance. This problem can be attributed to extrapolation errors [16] similar to those observed in model-free methods. Specifically, the learned policy may choose sequences of actions which lead it to regions of the state space where the dynamics model cannot predict properly, due to poor coverage of the dataset. This can lead the policy to exploit approximation errors of the dynamics model and be disastrous for learning. In model-free settings, similar data distribution shift problems are typically remedied by regularizing policy updates explicitly with a divergence from the observed data distribution [26, 30, 61], which, however, can overly limit policies’ expressivity [57].

In order to better approach these problems arising in limited deployment settings, we propose Behavior-Regularized Model-ENsemble (BREMEN), which learns an ensemble of dynamics models in conjunction with a policy using imaginary rollouts while implicitly regularizing the learned policy via appropriate parameter initialization and conservative trust-region learning updates. We evaluate BREMEN on high-dimensional continuous control benchmarks and find that it achieves impressive deployment efficiency. BREMEN is able to learn successful policies with only 5-10 deployments, significantly outperforming existing off-policy and offline RL algorithms in this deployment-constrained setting. We further evaluate BREMEN on standard offline RL benchmarks, where only a single static dataset is used. In this fixed-batch setting, our experiments show that BREMEN can not only achieve performance competitive with state-of-the-art when using standard dataset sizes but also learn with 10-20 times smaller datasets, which previous methods are unable to attain.

Refer to caption — Figure 1: Deployment efficiency is defined as the number of changes in the data-collection policy ( $I$ ), which is vital for managing costs and risks of new policy deployment. Online RL algorithms typically require many iterations of policy deployment and data collection, which leads to extremely low deployment efficiency. In contrast, most pure offline algorithms consider updating a policy from a fixed dataset without additional deployment and often fail to learn from a randomly initialized data-collection policy. Interestingly, most state-of-the-art off-policy algorithms are still evaluated in heavily online settings. For example, SAC [22] collects one sample per policy update, amounting to 100,000 to 1 million deployments for learning standard benchmark domains.

2 Preliminaries

We consider a Markov Decision Process (MDP) setting, characterized by the tuple $\mathcal{M}=(\mathcal{S},\mathcal{A},p,r,\gamma)$ , where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $p(s^{\prime}|s,a)$ is the transition probability distribution or dynamics, $r(s)$ is the reward function and $\gamma\in(0,1)$ is the discount factor. A policy $\pi$ is a function that determines the agent behavior, mapping from states to probability distributions over actions. The goal is to obtain the optimal policy $\pi^{\ast}$ as

\pi^{\ast}=\operatorname*{arg\,max}_{\pi}\eta[\pi]=\operatorname*{arg\,max}_{\pi}\mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t})\right],

where $\eta[\pi]$ is the expectation of the discounted sum of rewards under the policy $\pi$ . The transition probability $p(s^{\prime}|s,a)$ is usually unknown, and it is estimated with a parameterized dynamics model $f_{\phi}$ (e.g., a neural network) in model-based RL. For simplicity, we assume that the reward function $r(s)$ is known, and the reward can be computed for any arbitrary state, but we can easily extend to the unknown setting and predict it using a parameterized function.

On-policy vs Off-policy, Online vs Offline At high-level, most RL algorithms iterate many times between collecting a batch of transitions (deployments) and optimizing the policy (learning). If the algorithms discard data after each policy update, they are on-policy [52, 53], while if they accumulate data in a buffer $\mathcal{D}$ , i.e. experience replay [35], they are off-policy [39, 34, 18, 20, 22, 16] because not all the data in buffer comes from the current policy. However, we consider all these algorithms to be online RL algorithms, since they involve many deployments during learning, ranging from hundreds to millions. On the other hand, in pure offline RL, one does not assume direct interaction and learns a policy from only a fixed dataset, which effectively corresponds to a single deployment allowed for learning. Classically, interpolating these two extremes were semi-batch RL algorithms [32, 56], which improve the policy through repetitions of collecting a large batch of transitions $\mathcal{D}=\{(s,a,s^{\prime},r)\}$ and performing many or full policy updates. While these semi-batch RL also realize good deployment efficiency, they have not been extensively studied with neural network function approximators or in off-policy settings with experience replay for scalable sample-efficient learning. In our work, we aim to have both high deployment efficiency and sample efficiency by developing an algorithm that can solve the tasks with minimal policy deployments as well as transition samples.

3 Deployment Efficiency

Deploying a new policy for data collection can be associated with a number of costs and risks for many real-world applications like medicine or robotic control [40, 37, 19, 27, 42]. While there is an abundance of works on safety for RL [5, 15, 6, 49, 7], these methods often do not provide guarantees in practice when combined with neural networks and stochastic optimization. It is therefore necessary to validate each policy before deployment. Due to the cost associated with each deployment, it is desirable to minimize the number of distinct deployments needed during the learning process.

In order to focus research on these practical bottlenecks, we propose a novel measure of RL algorithms, namely, deployment efficiency, which counts how many times the data-collection policy has been changed during improvement from random policy to solve the task. For example, if an RL algorithm operates by using its learned policy to collect transitions from the environment $I$ times, each time collecting a batch of $B$ new transitions, then the number of deployments is $I$ , while the total number of samples collected is $I\times B$ . The lower $I$ is, the more deployment-efficient the algorithm is; in contrast, sample efficiency looks at $I\times B$ . Online RL algorithms, whether they are on-policy or off-policy, typically update the policy and acquire new transitions by deploying the newly updated policy at every iteration. This corresponds to performing hundreds to millions of deployments during learning on standard benchmarks [22], which is severely deployment inefficient. On the other hand, offline RL literature only studies the case of 1 deployment. A deployment-efficient algorithm would stand in the middle of these two extremes and ideally learn a successful policy from scratch while deploying only a few distinct policies, as illustrated in Figure 1.

Recent deep RL literature seldom emphasizes deployment efficiency, with few exceptions in specific applications [27] where such a learning procedure is necessary. Although current state-of-the-art algorithms on continuous control have substantially improved sample or data efficiency, they have not optimized for deployment efficiency. For example, SAC [22], an efficient model-free off-policy algorithm, performs half a million to one million policy deployments during learning on MuJoCo [59] benchmarks. ME-TRPO [31], a model-based algorithm, performs a much lower 100-300 policy deployments, although this is still relatively high for practical settings.¹¹1We examined the number of deployments by checking their original implementations, while the frequency of data collection is a tunable hyper-parameter. In our work, we demonstrate successful learning on standard benchmark environments with only 5-10 deployments.

4 Behavior-Regularized Model-Ensemble

To achieve high deployment efficiency, we propose Behavior-Regularized Model-ENsemble (BREMEN). BREMEN incorporates Dyna-style [58] model-based RL, learning an ensemble of dynamics models in conjunction with a policy using imaginary rollouts from the ensemble and behavior regularization via conservative trust-region updates.

4.1 Imaginary Rollout from Model Ensemble

As in recent Dyna-style model-based RL methods [31, 60], BREMEN uses an ensemble of $K$ deterministic dynamics models $\hat{f}_{\phi}=\left\{\hat{f}_{\phi_{1}},\dots,\hat{f}_{\phi_{K}}\right\}$ to alleviate the problem of model bias. Each model $\hat{f}_{\phi_{i}}$ is parameterized by ${\phi_{i}}$ and trained by the following objective, which minimizes mean squared error between the prediction of next state $\hat{f}_{\phi_{i}}(s_{t},a_{t})$ and true next state $s_{t+1}$ over a dataset $\mathcal{D}$ :

\min_{\phi_{i}}\frac{1}{|\mathcal{D}|}\sum_{\left(s_{t},a_{t},s_{t+1}\right)\in\mathcal{D}}\frac{1}{2}\left\|s_{t+1}-\hat{f}_{\phi_{i}}\left(s_{t},a_{t}\right)\right\|_{2}^{2}.

(1)

During training of a policy $\pi_{\theta}$ , imagined trajectories of states and actions are generated sequentially, using a dynamics model $\hat{f}_{\phi_{i}}$ that is randomly selected at each time step:

a_{t}\sim\pi_{\theta}(\cdot|\hat{s}_{t}),\quad\hat{s}_{t+1}=\hat{f}_{\phi_{i}}(\hat{s}_{t},a_{t})\quad\text{where}\quad i\sim\{1\cdots K\}.

(2)

4.2 Policy Update with Behavior Regularization

In order to manage the discrepancy between the true dynamics and the learned model caused by the distribution shift in batch settings, we propose to use iterative policy updates via a trust-region constraint, re-initialized with a behavior-cloned policy after every deployment. Specifically, after each deployment, we are given an updated dataset of experience transitions $\mathcal{D}$ . With this dataset, we approximate the true behavior policy $\pi_{b}$ through behavior cloning (BC), utilizing a neural network $\hat{\pi}_{\beta}$ parameterized by $\beta$ , where we implicitly assume a fixed variance, a common practice in BC [47]:

\min_{\beta}\frac{1}{|\mathcal{D}|}\sum_{\left(s_{t},a_{t}\right)\in\mathcal{D}}\frac{1}{2}\left\|a_{t}-\hat{\pi}_{\beta}\left(s_{t}\right)\right\|_{2}^{2}.

(3)

After obtaining the estimated behavior policy, we initialize the target policy $\pi_{\theta}$ as a Gaussian policy with mean from $\hat{\pi}_{\beta}$ and standard deviation of $1$ . This BC initialization in conjunction with gradient descent based optimization may be seen as implicitly biasing the optimized $\pi_{\theta}$ to be close to the data-collection policy [44], and thus works as a remedy for the distribution shift problem [50]. To further bias the learned policy to be close to the data-collection policy, we opt to use a KL-based trust-region optimization [52]. Therefore, the optimization of BREMEN becomes

	$\displaystyle\theta_{k+1}$	$\displaystyle=\operatorname*{arg\,max}_{\theta}\underset{s,a\sim\pi_{\theta_{k}},\hat{f}_{\phi_{i}}}{\mathbb{E}}\left[\frac{\pi_{\theta}(a\|s)}{\pi_{\theta_{k}}(a\|s)}A^{\pi_{\theta_{k}}}(s,a)\right]$		(4)
		$\displaystyle\text{s.t.}\quad\underset{s\sim\pi_{\theta_{k}},\hat{f}_{\phi_{i}}}{\mathbb{E}}\left[D_{\mathrm{KL}}\left(\pi_{\theta}(\cdot\|s)\\|\pi_{\theta_{k}}(\cdot\|s)\right)\right]\leq\delta,\quad\pi_{\theta_{0}}=\mathrm{Normal}(\hat{\pi}_{\beta},1),$

where $A^{\pi_{\theta_{k}}}(s,a)$ is the advantage of $\pi_{\theta_{k}}$ computed using model-based rollouts in the learned dynamics model and $\delta$ is the maximum step size.

The combination of BC for initialization and finite iterative trust-region updates serves as an implicit KL regularization, as discussed in Section 4.3. This is in contrast to many previous offline RL algorithms that augment the value function with a penalty of explicit KL divergence [54, 61] or maximum mean discrepancy [30]. Empirically, we found that our regularization technique outperforms the explicit KL penalty (see Section 5.3).

By recursively performing offline procedure, BREMEN can be used for deployment-efficient learning as shown in Algorithm 1, starting from a randomly initialized policy, collecting experience data, and performing offline policy updates.

Algorithm 1 BREMEN for Deployment-Efficient RL

0: Empty dataset

\mathcal{D}_{all}

\mathcal{D}

, Initial parameters

\phi=\{\phi_{1},\cdots,\phi_{K}\}

\beta

, Number of policy optimization

T

, Number of deployments

I

1: Randomly initialize the target policy

\pi_{\theta}

2: for deployment

i=1,\cdots,I

3: Collect

B

transitions in the true environment using

\pi_{\theta}

and add them to dataset

\mathcal{D}_{all}\leftarrow\mathcal{D}_{all}\cup\{s_{t},a_{t},r_{t},s_{t+1}\}

\mathcal{D}\leftarrow\{s_{t},a_{t},r_{t},s_{t+1}\}

4: Train

K

dynamics models

\hat{f}_{\phi}

using

\mathcal{D}_{all}

via Eq. 1.

5: Train estimated behavior policy

\hat{\pi}_{\beta}

using

\mathcal{D}

by behavior cloning via Eq. 3.

6: Re-initialize target policy

\pi_{\theta_{0}}=\mathrm{Normal}(\hat{\pi}_{\beta},1)

7: for policy optimization

k=1,\cdots,T

8: Generate imaginary rollout via Eq. 2.

9: Optimize target policy

\pi_{\theta}

satisfying Eq. 4 with the rollout.

4.3 Implicit KL Control from a Mathematical Perspective

We can intuitively understand that behavior cloning initialization with trust-region updates works as a regularization of distributional shift, and this can be supported by theory. Following the notation of Janner et al. [25], we denote the generalization error of a dynamics model on the state distribution under the true behavior policy as $\epsilon_{m}=\max_{t}\mathbb{E}_{s\sim d^{\pi_{b}}_{t}}D_{TV}(p(s_{t+1}|s_{t},a_{t})||p_{\phi}(s_{t+1}|s_{t},a_{t}))$ , where $D_{TV}$ represents the total variation distance between true dynamics $p$ and learned model $p_{\phi}$ . We also denote the distribution shift on the target policy as $\max_{s}D_{TV}(\pi_{b}||\pi)\leq\epsilon_{\pi}$ . A bound relating the true returns $\eta[\pi]$ and the model returns $\hat{\eta}[\pi]$ on the target policy is given in Janner et al. [25] as,

\eta[\pi]\geq\hat{\eta}[\pi]-\left[\frac{2\gamma r_{max}(\epsilon_{m}+2\epsilon_{\pi})}{(1-\gamma)^{2}}+\frac{4r_{max}\epsilon_{\pi}}{(1-\gamma)}\right].

(5)

This bound guarantees the improvement under the true returns as long as the improvement under the model returns increases by more than the slack in the bound due to $\epsilon_{m},\epsilon_{\pi}$ [25, 33].

We may relate this bound to the specific learning employed by BREMEN, which includes dynamics model learning, behavior cloning policy initialization, and conservative KL-based trust-region policy updates. To do so, we consider an idealized version of BREMEN, where the expectations over states in equations 1, 3, 4 are replaced with supremums and the dynamics model is set to have unit variance.

Proposition 1 (Policy and model error bound).

Suppose we apply the idealized BREMEN on a dataset $\mathcal{D}$ , and define $\epsilon_{\beta},\epsilon_{\phi}$ in terms of the behavior cloning and dynamics model losses as,

	$\displaystyle\epsilon_{\beta}$	$\displaystyle:=\sup_{s}\mathbb{E}_{a\sim\mathcal{D}(-\|s)}[\left\\|a_{t}-\hat{\pi}_{\beta}\left(s_{t}\right)\right\\|_{2}^{2}/2]-\mathcal{H}(\pi_{b}(-\|s))$
	$\displaystyle\epsilon_{\phi}$	$\displaystyle:=\sup_{s,a}\mathbb{E}_{s^{\prime}\sim\mathcal{D}(-\|s,a)}\left[\\|s^{\prime}-\hat{f}_{\phi}(s,a)\\|_{2}^{2}/2\right]-\mathcal{H}(p(-\|s,a)),$

where $\mathcal{H}$ denotes the Shannon entropy. If one then applies $T$ KL-based trust-region steps of step size $\delta$ (equation 4) using stochastic dynamics models with mean $\hat{f}_{\phi}$ and standard deviation 1, then

\epsilon_{\pi}\leq\sqrt{\frac{1}{2}\epsilon_{\beta}+\frac{1}{4}\log 2\pi}+T\sqrt{\frac{1}{2}\delta}~{};~{}~{}~{}\epsilon_{m}\leq\sqrt{\frac{1}{2}\epsilon_{\phi}+\frac{1}{4}\log 2\pi}.

Proof.

See Appendix A. ∎

5 Experiments

We evaluate BREMEN in both deployment-efficient settings, where the algorithm must learn a policy from scratch via a limited number of deployments, and offline RL, where the algorithm is given only a single static dataset. We use four standard continuous control benchmarks for offline RL [30, 61], namely, Ant, HalfCheetah, Hopper, and Walker2d on the MuJoCo physics simulator [59]. See Appendix B and C for further details and results.

5.1 Evaluating Deployment Efficiency

We compare BREMEN to ME-TRPO, SAC, BCQ, and BRAC applied to limited deployment settings. To adapt offline methods (BCQ, BRAC) to this setting, we simply apply them in a recursive fashion;²²2Recursive BCQ and BRAC also do behavioral cloning-based policy initialization after each deployment. at each deployment iteration, we collect a batch of data with the most recent policy and then run the offline update with this dataset. As for SAC, we simply change the replay buffer to update only at specific deployment intervals. For the sake of comparison, we align the number of deployments and the amount of data collection at each deployment (either 100,000 or 200,000) for all methods.

Figure 2 shows the results with 200,000 (top) and 100,000 (bottom) batched transitions per deployment. Regardless of the environments and the batch size per update, BREMEN achieves remarkable performance while existing online and offline RL methods struggle to make any progress in the limited deployment settings. As a point of comparison, we also include results for online SAC and ME-TRPO without limits on the number of deployments but using the same number of transitions.

5.2 Evaluating Offline Learning

We also evaluate BREMEN on standard offline RL benchmarks following Wu et al. [61]. We first train online SAC to a certain cumulative reward threshold, 4,000 in HalfCheetah, 1,000 in Ant, Hopper, and Walker2d, and collect offline datasets. We evaluate agents with the offline dataset of one million (1M) transitions, which is standard for BCQ and BRAC [61]. We then evaluate them on much smaller datasets of 50k and 100k transitions, 5 $\sim$ 10 % of prior works.

Table 1 shows that BREMEN can achieve performance competitive with state-of-the-art model-free offline RL algorithms when using the standard dataset size of 1M. Moreover, BREMEN can also appropriately learn with 10-20 times smaller datasets, where BCQ and BRAC are unable to exceed even BC baseline. As a result, our recursive BREMEN algorithm is not only deployment-efficient but also sample-efficient, and significantly outperforms the baselines.

Table 1: Comparison of BREMEN to the existing offline methods on static datasets. Each cell shows the average cumulative reward and their standard deviation, where the number of samples is 1M, 100K, and 50K, respectively. The maximum steps per episode is 1,000. BRAC applies a primal form of KL value penalty, and BRAC (max Q) means its variant of sampling multiple actions and taking the maximum according to the learned Q function.


1,000,000 (1M) transitions
Method	Ant	HalfCheetah	Hopper	Walker2d
Dataset	1191	4126	1128	1376
BC	1321 $\pm$ 141	4281 $\pm$ 12	1341 $\pm$ 161	1421 $\pm$ 147
BCQ [16]	2021 $\pm$ 31	5783 $\pm$ 272	1130 $\pm$ 127	2153 $\pm$ 753
BRAC [61]	2072 $\pm$ 285	7192 $\pm$ 115	1422 $\pm$ 90	2239 $\pm$ 1124
BRAC (max Q)	2369 $\pm$ 234	7320 $\pm$ 91	1916 $\pm$ 343	2409 $\pm$ 1210
BREMEN (Ours)	3328 $\pm$ 275	8055 $\pm$ 103	2058 $\pm$ 852	2346 $\pm$ 230
ME-TRPO (offline) [31]	1258 $\pm$ 550	1804 $\pm$ 924	518 $\pm$ 91	211 $\pm$ 154

100,000 (100K) transitions
Method	Ant	HalfCheetah	Hopper	Walker2d
Dataset	1191	4066	1128	1376
BC	1330 $\pm$ 81	4266 $\pm$ 21	1322 $\pm$ 109	1426 $\pm$ 47
BCQ	1363 $\pm$ 199	3915 $\pm$ 411	1129 $\pm$ 238	2187 $\pm$ 196
BRAC	-157 $\pm$ 383	2505 $\pm$ 2501	1310 $\pm$ 70	2162 $\pm$ 1109
BRAC (max Q)	-226 $\pm$ 387	2332 $\pm$ 2422	1422 $\pm$ 101	2164 $\pm$ 1114
BREMEN (Ours)	1633 $\pm$ 127	6095 $\pm$ 370	2191 $\pm$ 455	2132 $\pm$ 301
ME-TRPO (offline)	974 $\pm$ 4	2 $\pm$ 434	307 $\pm$ 170	10 $\pm$ 61

50,000 (50K) transitions
Method	Ant	HalfCheetah	Hopper	Walker2d
Dataset	1191	4138	1128	1376
BC	1270 $\pm$ 65	4230 $\pm$ 49	1249 $\pm$ 61	1420 $\pm$ 194
BCQ	1329 $\pm$ 95	1319 $\pm$ 626	1178 $\pm$ 235	1841 $\pm$ 439
BRAC	-878 $\pm$ 244	-597 $\pm$ 73	1277 $\pm$ 102	976 $\pm$ 1207
BRAC (max Q)	-843 $\pm$ 279	-590 $\pm$ 56	1276 $\pm$ 225	903 $\pm$ 1137
BREMEN (Ours)	1347 $\pm$ 283	5823 $\pm$ 146	1632 $\pm$ 796	2280 $\pm$ 647
ME-TRPO (offline)	938 $\pm$ 32	-73 $\pm$ 95	152 $\pm$ 13	176 $\pm$ 343

5.3 Evaluating Effectiveness of Implicit KL Control

In this section, we present an experiment to better understand the effect of BREMEN’s implicit regularization. Figure 3 shows the KL divergence of learned policies from the last deployed policy. We compare BREMEN to variants of BREMEN that use an explicit KL penalty on value instead of BC initialization (conservative KL trust-region updates are still used). We find that the explicit KL without behavior initialization variants learn policies that move farther away from the last deployed policy than behavior initialized policies. This suggests that the implicit behavior regularization employed by BREMEN is more effective as a conservative policy learning protocol.

6 Related Work

Deployment Efficiency and Offline RL

Although we are not aware of any previous works which explicitly proposed the concept of deployment efficiency, its necessity in many real-world applications has been generally known. One may consider previously proposed semi-batch RL algorithms [12, 32, 55, 51] or theoretical analysis of switching cost under the tabular PAC-MDP settings [2, 21] as approaching this issue. More recently, a related but distinct problem known as offline RL has gained popularity [33, 61, 1]. These offline RL works consider an extreme version of 1 deployment, and typically collect the static batch with a partially trained policy rather than a random policy. While offline RL has shown promising results for a variety of real-world applications, such as robotics [38], dialogue systems [26], or medical treatments [17], these algorithms struggle when learning a policy from scratch or when the dataset is small. Nevertheless, common themes of many offline RL algorithms – regularizing the learned policy to the behavior policy [16, 26, 30, 54, 61] and utilizing ensembles to handle uncertainty [30, 61] – served as inspirations for the proposed BREMEN algorithm. A major difference of BREMEN from prior works is that the target policy is not explicitly forced to stick close to the estimated behavior policy through the policy update. Rather, BREMEN employs a more implicit regularization by initializing the learned policy with a behavior cloned policy and then applying conservative trust-region updates. Another major difference is the application of model-based approaches to fully offline settings, which has not been extensively studied in prior works [33], except the two concurrent works from Kidambi et al. [28] and Yu et al. [62] that study pessimistic or uncertainty penalized MDPs with guarantees – closely related to Liu et al. [36]. By contrast, our work shows that a simple technique can already enable model-based offline algorithms to significantly outperform the prior model-free methods, and is, to the best of our knowledge, the first to define and extensively evaluate deployment efficiency with recursive experiments.

Model-Based RL

There are many types of model-based RL algorithms [58, 11, 23]. A simple algorithmic choice is Dyna-style [58], which uses a parameterized dynamics model to estimate the true MDP transition function, stochastically mapping states and actions to next states. The dynamics model can then serve as a simulator of the environment during policy updates. Dyna-style algorithms often suffer from the distributional shift, also known as model bias, which leads RL agents to exploit regions where the data is insufficient, and significant performance degradation. A variety of remedies have been proposed to relieve the problem of model bias, such as the use of multiple dynamics models as an ensemble [8, 31, 25], meta-learning [9], energy-based model regularizer [4], game-theoretic framework [48], and explicit reward penalty for unknown state [28, 62]. Notably, we have employed a subset of these remedies – model ensembles and trust-region updates [31] – for BREMEN. Compared to existing works, our work is notable for using BC initialization in conjunction with trust-region updates to alleviate the distribution shift of the learned policy from the dataset used to train the dynamics model.

7 Conclusion

In this work, we introduced deployment efficiency, a novel measure for RL performance that counts the number of changes in the data-collection policy during learning. To enhance deployment efficiency, we proposed Behavior-Regularized Model-ENsemble (BREMEN), a novel model-based offline algorithm with implicit KL regularization via appropriate policy initialization and trust-region updates. BREMEN shows impressive results in limited deployment settings, obtaining successful policies from scratch in only 5-10 deployments, as it can improve policies offline even when the batch size is 10-20 times smaller than prior works. Not only can this help alleviate costs and risks in real-world applications, but it can also reduce the amount of communication required during distributed learning and could form the basis for communication-efficient large-scale RL in contrast to prior works [45, 13, 14]. Most critically, we show that under deployment efficiency constraints, most prior algorithms – model-free or model-based, online or offline – fail to achieve successful learning. We hope our work can gear the research community to value deployment efficiency as an important criterion for RL algorithms, and to eventually achieve similar sample efficiency and asymptotic performance as the state-of-the-art algorithms like SAC [22] while having the deployment efficiency well-suited for safe and practical real-world reinforcement learning.

Broader Impact

Deployment efficiency is a key-concept for real-world applications for RL because excessive policy deployments may be harmful or costly in robotics, health care, dialog agents, or education. However, in most prior deep RL literatures and benchmarks, this metric is seldom mentioned and sometimes its disregard is exploited for the best sample-efficiency while effectively allowing 100k-1M of deployments. Our proposed algorithm BREMEN achieves high deployment efficiency of 5-10 deployments for learning the standard OpenAI Gym MuJoCo benchmarks that no other algorithms can match. While the final performances are sometimes worse than deployment-unconstrained SAC, we hope our definition and benchmark for deployment efficiency can motivate further research by the community. Other impact questions are not applicable for this paper.

On the other hand, BREMEN still requires a few online deployments, which may still involve some risks. Fully safe and efficient RL that can learn from scratch remains an open problem.

Acknowledgments

We thank Yusuke Iwasawa, Emma Brunskill, Lihong Li, Sergey Levine, and George Tucker for insightful comments and discussion.

References

Agarwal et al. [2019] Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. arXiv preprint arXiv:1907.04543, 2019.
Bai et al. [2019] Yu Bai, Tengyang Xie, Nan Jiang, and Yu-Xiang Wang. Provably efficient q-learning with low switching cost. In Advances in Neural Information Processing Systems, 2019.
Barth-Maron et al. [2018] Gabriel Barth-Maron, Matthew Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva TB, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. In International Conference on Learning Representations, 2018.
Boney et al. [2019] Rinu Boney, Juho Kannala, and Alexander Ilin. Regularizing model-based planning with energy-based models. In Conference on Robot Learning, 2019.
Chow et al. [2015] Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-making: a cvar optimization approach. In Advances in Neural Information Processing Systems, 2015.
Chow et al. [2018] Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. In Advances in neural information processing systems, 2018.
Chow et al. [2019] Yinlam Chow, Ofir Nachum, Aleksandra Faust, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. Lyapunov-based safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031, 2019.
Chua et al. [2018] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, 2018.
Clavera et al. [2018] Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. Model-based reinforcement learning via meta-policy optimization. In Conference on Robot Learning, 2018.
Degris et al. [2012] Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. arXiv preprint arXiv:1205.4839, 2012.
Deisenroth and Rasmussen [2011] Marc Deisenroth and Carl E Rasmussen. PILCO: A model-based and data-efficient approach to policy search. In International Conference on Machine Learning, 2011.
Ernst et al. [2005] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 2005.
Espeholt et al. [2018] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. IMPALA: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, 2018.
Espeholt et al. [2019] Lasse Espeholt, Raphaël Marinier, Piotr Stanczyk, Ke Wang, and Marcin Michalski. SEED RL: Scalable and efficient deep-rl with accelerated central inference. arXiv preprint arXiv:1910.06591, 2019.
Eysenbach et al. [2018] Benjamin Eysenbach, Shixiang Gu, Julian Ibarz, and Sergey Levine. Leave no trace: Learning to reset for safe and autonomous reinforcement learning. International Conference on Learning Representations, 2018.
Fujimoto et al. [2019] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, 2019.
Gottesman et al. [2018] Omer Gottesman, Fredrik Johansson, Joshua Meier, Jack Dent, Donghun Lee, Srivatsan Srinivasan, Linying Zhang, Yi Ding, David Wihl, Xuefeng Peng, Jiayu Yao, Isaac Lage, Christopher Mosch, Li wei H. Lehman, Matthieu Komorowski, Matthieu Komorowski, Aldo Faisal, Leo Anthony Celi, David Sontag, and Finale Doshi-Velez. Evaluating reinforcement learning algorithms in observational health settings. arXiv preprint arXiv:1805.12298, 2018.
Gu et al. [2016] Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, 2016.
Gu et al. [2017a] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In International Conference on Robotics and Automation, 2017a.
Gu et al. [2017b] Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, and Sergey Levine. Q-Prop: Sample-efficient policy gradient with an off-policy critic. In International Conference on Learning Representations, 2017b.
Guo and Brunskill [2015] Zhaohan Guo and Emma Brunskill. Concurrent pac rl. In AAAI Conference on Artificial Intelligence, 2015.
Haarnoja et al. [2018] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018.
Heess et al. [2015] Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, 2015.
Hessel et al. [2018] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In AAAI Conference on Artificial Intelligence, 2018.
Janner et al. [2019] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, 2019.
Jaques et al. [2019] Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019.
Kalashnikov et al. [2018] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, 2018.
Kidambi et al. [2020] Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. MOReL : Model-based offline reinforcement learning. arXiv preprint arXiv:2005.05951, 2020.
Kingma and Ba [2014] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2014.
Kumar et al. [2019] Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, 2019.
Kurutach et al. [2018] Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-Ensemble Trust-Region Policy Optimization. In International Conference on Learning Representations, 2018.
Lange et al. [2012] Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. In Reinforcement learning. Springer, 2012.
Levine et al. [2020] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
Lillicrap et al. [2016] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
Lin [1992] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 1992.
Liu et al. [2019] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473, 2019.
Mandel et al. [2014] Travis Mandel, Yun-En Liu, Sergey Levine, Emma Brunskill, and Zoran Popovic. Offline policy evaluation across representations with applications to educational games. In International Conference on Autonomous Agents and Multiagent Systems, 2014.
Mandlekar et al. [2019] Ajay Mandlekar, Fabio Ramos, Byron Boots, Li Fei-Fei, Animesh Garg, and Dieter Fox. IRIS: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. arXiv preprint arXiv:1911.05321, 2019.
Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 2015.
Murphy et al. [2001] Susan A Murphy, Mark J van der Laan, James M Robins, and Conduct Problems Prevention Research Group. Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 2001.
Nachum et al. [2018] Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, 2018.
Nachum et al. [2019] Ofir Nachum, Michael Ahn, Hugo Ponte, Shixiang Gu, and Vikash Kumar. Multi-agent manipulation via locomotion using hierarchical sim2real. In Conference on Robot Learning, 2019.
Nagabandi et al. [2018] Anusha Nagabandi, Gregory Kahn, Ronald S. Fearing, and Sergey Levine. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning. International Conference on Robotics and Automation, 2018.
Nagarajan and Kolter [2019] Vaishnavh Nagarajan and J Zico Kolter. Generalization in deep networks: The role of distance from initialization. arXiv preprint arXiv:1901.01672, 2019.
Nair et al. [2015] Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, et al. Massively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296, 2015.
Precup et al. [2001] Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. Off-policy temporal-difference learning with function approximation. In International Conference on Machine Learning, 2001.
Rajeswaran et al. [2017] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087, 2017.
Rajeswaran et al. [2020] Aravind Rajeswaran, Igor Mordatch, and Vikash Kumar. A game theoretic framework for model based reinforcement learning. arXiv preprint arXiv:2004.07804, 2020.
Ray et al. [2019] Alex Ray, Joshua Achiam, and Dario Amodei. Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708, 2019.
Ross et al. [2011] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In International conference on artificial intelligence and statistics, 2011.
Roux [2016] Nicolas Le Roux. Efficient iterative policy optimization. arXiv preprint arXiv:1612.08967, 2016.
Schulman et al. [2015] John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust region policy optimization. In International Conference on Machine Learning, 2015.
Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Siegel et al. [2020] Noah Y. Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, and Martin A. Riedmiller. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. In International Conference on Learning Representations, 2020.
Singh et al. [1994] Satinder P Singh, Tommi Jaakkola, and Michael I Jordan. Learning without state-estimation in partially observable markovian decision processes. In Machine Learning Proceedings. Elsevier, 1994.
Singh et al. [1995] Satinder P. Singh, Tommi Jaakkola, and Michael I. Jordan. Reinforcement learning with soft state aggregation. In Advances in Neural Information Processing Systems, 1995.
Sohn et al. [2020] Sungryull Sohn, Yinlam Chow, Jayden Ooi, Ofir Nachum, Honglak Lee, Ed Chi, and Craig Boutilier. BRPO: Batch residual policy optimization. arXiv preprint arXiv:2002.05522, 2020.
Sutton [1991] Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 1991.
Todorov et al. [2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, 2012.
Wang et al. [2019] Tingwu Wang, Xuchan Bao, Ignasi Clavera, Jerrick Hoang, Yeming Wen, Eric Langlois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel, and Jimmy Ba. Benchmarking model-based reinforcement learning. arXiv preprint arXiv:1907.02057, 2019.
Wu et al. [2019] Yifan Wu, George Tucker, and Ofir Nachum. Behavior Regularized Offline Reinforcement Learning. arXiv preprint arXiv:1911.11361, 2019.
Yu et al. [2020] Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. MOPO: Model-based offline policy optimization. arXiv preprint arXiv:2005.13239, 2020.

Appendix

A Proof of Proposition 1

We first consider $\epsilon_{\pi}$ . The behavior cloning objective in its supremum form is,

$\displaystyle\epsilon_{\beta}$	$\displaystyle=$	$\displaystyle\sup_{s\in\mathcal{D}}\mathbb{E}_{a\sim\mathcal{D}(-\|s)}[\left\\|a_{t}-\hat{\pi}_{\beta}\left(s_{t}\right)\right\\|_{2}^{2}/2]-\mathcal{H}(\pi_{b}(-\|s))$
	$\displaystyle=$	$\displaystyle\sup_{s\in\mathcal{D}}\mathbb{E}_{a\sim\mathcal{D}(-\|s)}\left[-\log\pi_{\theta_{0}}(a\|s)\right]-\mathcal{H}(\pi_{b}(-\|s))-\frac{1}{2}\log 2\pi$
	$\displaystyle=$	$\displaystyle\sup_{s\in\mathcal{D}}D_{KL}(\pi_{b}(-\|s)\|\|\pi_{\theta_{0}}(-\|s))-\frac{1}{2}\log 2\pi.$

We apply Pinsker’s inequality to the true and estimated behavior policy to yield

\sup_{s}D_{TV}(\pi_{b}(-|s)||\pi_{\theta_{0}}(-|s))\leq\sqrt{\frac{1}{2}\epsilon_{\beta}+\frac{1}{4}\log 2\pi}.

By the same Pinsker’s inequality, we have,

\sup_{s}D_{TV}(\pi_{\theta_{k}}(-|s)||\pi_{\theta_{k+1}}(-|s))\leq\sqrt{\delta/2}.

Therefore, by triangle inequality, we have

\epsilon_{\pi}\leq\sup_{s}D_{TV}(\pi_{b}(-|s)||\pi_{\theta_{T}}(-|s))\leq\sqrt{\frac{1}{2}\epsilon_{\beta}+\frac{1}{4}\log 2\pi}+T\sqrt{\frac{1}{2}\delta},

as desired.

We perform similarly for $\epsilon_{m}$ . The model dynamics loss is

$\displaystyle\epsilon_{\phi}$	$\displaystyle=$	$\displaystyle\sup_{s,a}\mathbb{E}_{s^{\prime}\sim\mathcal{D}(-\|s,a)}\left[\\|s^{\prime}-\hat{f}_{\phi}(s,a)\\|_{2}^{2}/2\right]-\mathcal{H}(p(-\|s,a))$
	$\displaystyle=$	$\displaystyle\sup_{s,a}\mathbb{E}_{s^{\prime}\sim\mathcal{D}(-\|s,a)}\left[\log p_{\phi}(s^{\prime}\|s,a)\right]-\mathcal{H}(p(-\|s,a))-\frac{1}{2}\log 2\pi$
	$\displaystyle=$	$\displaystyle\sup_{s,a}D_{KL}(p(-\|s,a)\|\|p_{\phi}(-\|s,a))-\frac{1}{2}\log 2\pi.$

We apply Pinsker’s inequality to the true dynamics and learned model to yield

\epsilon_{m}\leq\sup_{s,a}D_{TV}(p(-|s,a)||p_{\phi}(-|s,a))\leq\sqrt{\frac{1}{2}\epsilon_{\phi}+\frac{1}{4}\log 2\pi},

as desired.

B Details of Experimental Settings

B.1 Implementation Details

For our baseline methods, we use the open-source implementations of SAC, BC, BCQ, and BRAC published in Wu et al. [61]. SAC and BRAC have (300, 300) Q-Network and (200, 200) policy network. BC has (200, 200) policy network, and BCQ has (300, 300) Q-Network, (300, 300) policy network, and (750, 750) conditional VAE. As for online ME-TRPO, we utilize the codebase of model-based RL benchmark [60]. BREMEN and online ME-TRPO use the policy consisting of two hidden layers with 200 units. The dynamics model also consists of two hidden layers with 1,024 units. We use Adam [29] as the optimizer with the learning rate of 0.001 for the dynamics model, and 0.0005 for behavior cloning in BREMEN. Especially in BREMEN and online ME-TRPO, we adopt a linear feature value function to stabilize the training. BREMEN in deployment-efficient settings takes about two or three hours per deployment on an NVIDIA TITAN V.

To leverage neural networks as Dyna-style [58] dynamics models, we modify reward and termination function so that they are not dependent on the internal physics engine for calculation, following model-based benchmark codebase [60]; see Table 2. Note that the score of baselines (e.g., BCQ, BRAC) is slightly different from Wu et al. [61] due to this modification of the reward function. We re-run each algorithm in our environments and got appropriate convergence.

The maximum length of one episode is 1,000 steps without any termination in Ant and HalfCheetah; however, termination function is enabled in Hopper and Walker2d. The batch size of transitions for policy update is 50,000 in BREMEN and ME-TRPO, following Kurutach et al. [31]. The batch size of BC and BRAC is 256, and BCQ is 100, also following Wu et al. [61].


Environment	Reward function	Termination in rollouts
Ant	$\dot{x}_{t}-0.1\\|\bm{a}_{t}\\|_{2}^{2}-3.0\times(z_{t}-0.57)^{2}+1$	False
HalfCheetah	$\dot{x}_{t}-0.1\\|\bm{a}_{t}\\|_{2}^{2}$	False
Hopper	$\dot{x}_{t}-0.001\\|\bm{a}_{t}\\|_{2}^{2}+1$	True
Walker2d	$\dot{x}_{t}-0.001\\|\bm{a}_{t}\\|_{2}^{2}+1$	True

Table 2: Reward function and termination in rollouts in the experiments. We remove all contact information from observation of Ant, basically following Wang et al. [60].

B.2 Hyper Parameters

In this section, we describe the hyper-parameters in both deployment-efficient RL (Section B.2.1) and offline RL (Section B.2.2) settings. We run all of our experiments with five random seed, and the results are averaged.

B.2.1 Deployment-Efficient RL

Table 3 shows the hyper-parameters of BREMEN. The rollout length is searched from {250, 500, 1000}, and max step size $\delta$ is searched from {0.001, 0.01, 0.05, 0.1, 1.0}. As for the discount factor $\gamma$ and GAE $\lambda$ , we follow Wang et al. [60].


Parameter	Ant	HalfCheetah	Hopper	Walker2d
Iteration per batch	2,000	2,000	6,000	2,000
Deployment	5	5	10	10
Total iteration	10,000	10,000	60,000	20,000
Rollouts length	250	250	1,000	1,000
Max step size $\delta$	0.05	0.1	0.05	0.05
Discount factor $\gamma$	0.99	0.99	0.99	0.99
GAE $\lambda$	0.97	0.95	0.95	0.95
Stationary noise $\sigma$	0.1	0.1	0.1	0.1

Table 3: Hyper-parameters of BREMEN in deployment-efficient settings.

Number of Iterations for Policy Optimization

To achieve high deployment efficiency, the number of iterations for policy optimization between deployments is one of the important hyper-parameters for fast convergence. In the existing methods (BCQ, BRAC, SAC), we search over three values: {10,000, 50,000, 100,000}, and choose 10,000 in BCQ and BRAC, and 100,000 in SAC (Figure 5). For BREMEN, we also search over three values: {2,000, 4,000, 6,000}. Figure 6 shows the results of iteration search, and we choose 2,000 in Ant, HalfCheetah, and Walker2d, and 6,000 in Hopper.

Stationary Noise in BREMEN

To achieve effective exploration, the stochastic Gaussian policy is a good choice. We found that adding stationary Gaussian noise to the policy in the imaginary trajectories and data collection led to the notable improvement. Stationary Gaussian policy is written as,

a_{t}=\tanh(\mu_{\theta}(s_{t}))+\epsilon,\qquad\epsilon\sim\mathcal{N}(0,\sigma^{2}).

Another choice is a learned Gaussian policy, which parameterizes not only $\mu_{\theta}$ but also $\sigma_{\theta}$ . Learned gaussian policy is also written as,

a_{t}=\tanh(\mu_{\theta}(s_{t}))+\sigma_{\theta}(s_{t})\odot\epsilon,\qquad\epsilon\sim\mathcal{N}(0,\sigma^{2}).

We utilize the zero-mean Gaussian $\mathcal{N}(0,\sigma^{2})$ , and tune up $\sigma$ in Figure 7 with HalfCheetah, comparing stationary and learned strategies. From this experiment, we found that the stationary noise, the scale of 0.1, consistently performs well, and therefore we used it for all our experiments.

Other Hyper-parameters in the Existing Methods

As for online ME-TRPO, we collect 3,000 steps through online interaction with the environment per 25 iterations and split these transitions into a 2-to-1 ratio of training and validation dataset for learning dynamics models. In batch size 100,000 settings, we collect 2,000 steps and split with a 1-to-1 ratio. Totally, we iterate 12,500 times policy optimization, which is equivalent to 500 deployments of the policy. Note that we carefully tune up the hyper-parameters of online ME-TRPO, and then its performance was improved from Wang et al. [60].

Table 4 and Table 5 shows the tunable hyper-parameters of BCQ and BRAC, respectively. We refer Wu et al. [61] to choose these values. In this work, BRAC applies a primal form of KL value penalty, and BRAC (max Q) means sampling multiple actions and taking the maximum according to the learned Q function.


Parameter	Ant	HalfCheetah	Hopper	Walker2d
Policy learning rate	3e-05	3e-04	3e-06	3e-05
Perturbation range $\Phi$	0.15	0.5	0.15	0.15

Table 4: Hyper-parameters of BCQ.


Parameter	Ant	HalfCheetah	Hopper	Walker2d
Policy learning rate	1e-4	1e-3	3e-5	1e-5
Divergence penalty $\alpha$	0.3	0.1	0.3	0.3

Table 5: Hyper-parameters of BRAC.

B.2.2 Offline RL

In the offline experiments, we apply the same hyper-parameters as in the deployment-efficient settings described above, except for the iteration per batch. Algorithm 2 is pseudocode for BREMEN in offline RL settings where policies are updated only with one fixed batch dataset. The number of iteration $T$ is set to 6,250 in BREMEN, and 500,000 in BC, BCQ, and BRAC.

Algorithm 2 BREMEN for Offline RL

0: Offline dataset

\mathcal{D}=\{s_{t},a_{t},r_{t},s_{t+1}\}

, Initial parameters

\phi=\{\phi_{1},\cdots,\phi_{K}\}

\beta

, Number of policy optimization

T

1: Train

K

dynamics models

\hat{f}_{\phi}

using

\mathcal{D}

via Eq. 1.

2: Train estimated behavior policy

\hat{\pi}_{\beta}

using

\mathcal{D}

by behavior cloning via Eq. 3.

3: Initialize target policy

\pi_{\theta_{0}}=\mathrm{Normal}(\hat{\pi}_{\beta},1)

4: for policy optimization

k=1,\cdots,T

5: Generate imaginary rollout.

6: Optimize target policy

\pi_{\theta}

satisfying Eq. 4 with the rollout.

C Additional Experimental Results

C.1 Performance on the Dataset with Different Noise

Following Wu et al. [61] and Kidambi et al. [28], we additionally compare BREMEN in offline settings to the other baselines (BC, BCQ, BRAC) with five datasets of different exploration noise. Each dataset has also one million transitions.

•

eps1: 40 % of the dataset is collected by data-collection policy (partially trained SAC policy) $\pi_{b}$ , 40 % of the dataset is collected by epsilon greedy policy with $\epsilon=0.1$ to take a random action, and 20 % of dataset is collected by an uniformly random policy.
•

eps3: Same as eps1, 40 % of the dataset is collected by $\pi_{b}$ , 40 % is collected by epsilon greedy policy with $\epsilon=0.3$ , and 20 % is collected by an uniformly random policy.
•

gaussian1: 40 % of the dataset is collected by data-collection policy $\pi_{b}$ , 40 % is collected by the policy with adding zero-mean Gaussian noise $\mathcal{N}(0,0.1^{2})$ to each action sampled from $\pi_{b}$ , and 20 % is collected by an uniformly random policy.
•

gaussian3: 40 % of the dataset is collected by data-collection policy $\pi_{b}$ , 40 % is collected by the policy with zero-mean Gaussian noise $\mathcal{N}(0,0.3^{2})$ , and 20 % is collected by an uniformly random policy.
•

random: All of the dataset is collected by an uniformly random policy.

Table 6 shows that BREMEN can also achieve performance competitive with state-of-the-art model-free offline RL algorithm even with noisy datasets. The training curves of each experiment are shown in Section C.4.

Table 6: Comparison of BREMEN to the existing offline methods in offline settings, namely, BC, BCQ [16], and BRAC [61]. Each cell shows the average cumulative reward and their standard deviation with 5 seeds. The maximum steps per episode is 1,000. Five different types of exploration noise are introduced during the data collection, eps1, eps3, gaussian1, gaussian3, and random. BRAC applies a primal form of KL value penalty, and BRAC (max Q) means sampling multiple actions and taking the maximum according to the learned Q function.


Noise: eps1, 1,000,000 (1M) transitions
Method	Ant	HalfCheetah	Hopper	Walker2d
Dataset	1077	2936	791	815
BC	1381 $\pm$ 71	3788 $\pm$ 740	266 $\pm$ 486	1185 $\pm$ 155
BCQ	1937 $\pm$ 116	6046 $\pm$ 276	800 $\pm$ 659	479 $\pm$ 537
BRAC	2693 $\pm$ 155	7003 $\pm$ 118	1243 $\pm$ 162	3204 $\pm$ 103
BRAC (max Q)	2907 $\pm$ 98	7070 $\pm$ 81	1488 $\pm$ 386	3330 $\pm$ 147
BREMEN (Ours)	3519 $\pm$ 129	7585 $\pm$ 425	2818 $\pm$ 76	1710 $\pm$ 429
ME-TRPO (offline)	1514 $\pm$ 503	1009 $\pm$ 731	1301 $\pm$ 654	128 $\pm$ 153

Noise: eps3, 1,000,000 (1M) transitions
Method	Ant	HalfCheetah	Hopper	Walker2d
Dataset	936	2408	662	648
BC	1364 $\pm$ 121	2877 $\pm$ 797	519 $\pm$ 532	1066 $\pm$ 176
BCQ	1938 $\pm$ 21	5739 $\pm$ 188	1170 $\pm$ 446	1018 $\pm$ 1231
BRAC	2718 $\pm$ 90	6434 $\pm$ 147	1224 $\pm$ 71	2921 $\pm$ 101
BRAC (max Q)	2913 $\pm$ 87	6672 $\pm$ 136	2103 $\pm$ 746	3079 $\pm$ 110
BREMEN (Ours)	3409 $\pm$ 218	7632 $\pm$ 104	2803 $\pm$ 65	1586 $\pm$ 139
ME-TRPO (offline)	1843 $\pm$ 674	5504 $\pm$ 67	1308 $\pm$ 756	354 $\pm$ 329

Noise: gaussian1, 1,000,000 (1M) transitions
Method	Ant	HalfCheetah	Hopper	Walker2d
Dataset	1072	3150	882	1070
BC	1279 $\pm$ 80	4142 $\pm$ 189	31 $\pm$ 16	1137 $\pm$ 477
BCQ	1958 $\pm$ 76	5854 $\pm$ 498	475 $\pm$ 416	608 $\pm$ 416
BRAC	2905 $\pm$ 81	7026 $\pm$ 168	1456 $\pm$ 161	3030 $\pm$ 103
BRAC (max Q)	2910 $\pm$ 157	7026 $\pm$ 168	1575 $\pm$ 89	3242 $\pm$ 97
BREMEN (Ours)	2912 $\pm$ 165	7928 $\pm$ 313	1999 $\pm$ 617	1402 $\pm$ 290
ME-TRPO (offline)	1275 $\pm$ 656	1275 $\pm$ 656	909 $\pm$ 631	171 $\pm$ 119

Noise: gaussian3, 1,000,000 (1M) transitions
Method	Ant	HalfCheetah	Hopper	Walker2d
Dataset	1058	2872	781	981
BC	1300 $\pm$ 34	4190 $\pm$ 69	611 $\pm$ 467	1217 $\pm$ 361
BCQ	1982 $\pm$ 97	5781 $\pm$ 543	1137 $\pm$ 582	258 $\pm$ 286
BRAC	3084 $\pm$ 180	3933 $\pm$ 2740	1432 $\pm$ 499	3253 $\pm$ 118
BRAC (max Q)	2916 $\pm$ 99	3997 $\pm$ 2761	1417 $\pm$ 267	3372 $\pm$ 153
BREMEN (Ours)	3432 $\pm$ 185	8124 $\pm$ 145	1867 $\pm$ 354	2299 $\pm$ 474
ME-TRPO (offline)	1237 $\pm$ 310	2141 $\pm$ 872	973 $\pm$ 243	219 $\pm$ 145

Noise: random, 1,000,000 (1M) transitions
Method	Ant	HalfCheetah	Hopper	Walker2d
Dataset	470	-285	34	2
BC	989 $\pm$ 10	-2 $\pm$ 1	106 $\pm$ 62	108 $\pm$ 110
BCQ	1222 $\pm$ 114	2887 $\pm$ 242	206 $\pm$ 7	228 $\pm$ 12
BRAC	1057 $\pm$ 92	3449 $\pm$ 259	227 $\pm$ 30	29 $\pm$ 54
BRAC (max Q)	683 $\pm$ 57	3418 $\pm$ 171	224 $\pm$ 37	26 $\pm$ 50
BREMEN (Ours)	905 $\pm$ 11	3627 $\pm$ 193	270 $\pm$ 68	254 $\pm$ 6
ME-TRPO (offline)	2221 $\pm$ 665	2701 $\pm$ 120	321 $\pm$ 29	262 $\pm$ 13

C.2 Comparison among Different Number of Ensembles

To deal with the distribution shift during policy optimization, also known as model bias, we introduce the dynamics model ensembles. We validate the performance of BREMEN with a different number of dynamics models $K$ . Figure 8 and Figure 9 show the performance of BREMEN with the different number of ensembles in deployment-efficient and offline settings. Ensembles with more dynamics models resulted in better performance due to the mitigation of distributional shift except for $K=10$ , and then we choose $K=5$ .

C.3 Implicit KL Control in Offline Settings

Similar to Section 5.3, we present offline RL experiments to better understand the effect of implicit KL regularization. In contrast to the implicit KL regularization with Eq. 4, the optimization of BREMEN with explicit KL value penalty becomes

	$\displaystyle\theta_{k+1}$	$\displaystyle=\operatorname*{arg\,max}_{\theta}\underset{s,a\sim\pi_{\theta_{k}},\hat{f}_{\phi_{i}}}{\operatorname{E}}\left[\frac{\pi_{\theta}(a\|s)}{\pi_{\theta_{k}}(a\|s)}\left(A^{\pi_{\theta_{k}}}(s,a)-\alpha D_{\mathrm{KL}}(\pi_{\theta}(\cdot\|s)\\|\hat{\pi}_{\beta}(\cdot\|s))\right)\right]$		(6)
		$\displaystyle\text{s.t.}\quad\underset{s\sim\pi_{\theta_{k}}}{\mathbb{E}}\left[D_{\mathrm{KL}}\left(\pi_{\theta}(\cdot\|s)\\|\pi_{\theta_{k}}(\cdot\|s)\right)\right]\leq\delta,$

where $A^{\pi_{\theta_{k}}}(s,a)$ is the advantage of $\pi_{\theta_{k}}$ computed using imaginary rollouts with the learned dynamics model and $\delta$ is the maximum step size. Note that BREMEN with explicit KL penalty does not utilize behavior cloning initialization.

We empirically conclude that the explicit constraint $-\alpha D_{\mathrm{KL}}(\pi_{\theta}(\cdot|s)\|\hat{\pi}_{\beta}(\cdot|s))$ is unnecessary and just TRPO update with behavior-initialization as implicit regularization is sufficient in BREMEN algorithm. Figure 10 shows the KL divergence between learned policies and the last deployed policies (top row) and model errors measured by a mean squared error of predicted next state from the true state (second row). We find that behavior initialized policy with conservative KL trust-region updates well stuck to the last deployed policy during improvement without explicit KL penalty. The policy initialized with behavior cloning also tended to suppress the increase of model error, which implies that behavior initialization alleviates the effect of the distribution shift. In Walker2d, the model error of BREMEN is relatively large, which may relate to the poor performance with noisy datasets in Section C.1.

C.4 Training Curves for Offline RL with Different Noises

In this section, we present training curves of our all experiments in offline settings. Figure 11 shows the results in Section 5.2. Figure 12, 13, 14, 15, and 16 also show the results in Section C.1.

C.5 Deployment-Efficient RL Experiment with Different Reward Function

In addition to the main results in Section 5.1 (Figure 2), we also evaluate BREMEN in deployment-efficient setting with different reward function. We modified HalfCheetah environment into the one similar to cheetah-run task in Deep Mind Control Suite.³³3https://github.com/deepmind/dm_control/blob/master/dm_control/suite/cheetah.py The reward function is defined as

r_{t}=\begin{cases}0.1\dot{x}_{t}&(0\leq\dot{x}_{t}\leq 10)\\ 1&(\dot{x}_{t}>10),\end{cases}

and the termination is turned off. Figure 17 shows the performance of BREMEN and existing methods. BREMEN also shows better deployment efficiency than other existing offline methods and online ME-TRPO, except for SAC, which is the same trend as that of main results.