Behavior Cloning with World Models

Abstract

In this project, we combine behavior cloning and world models, two techniques that are frequently used for their in-environment sample efficiency. Building on IRIS¹, a recent transformer-based world model system, we evaluate our approach using 3 Atari environments of differing difficulty levels from the Atari100k² benchmark. We explore different methods of using behavior cloning and our findings show that IRIS agents trained with behavior cloning outperform the baseline model in all of our environments.

Introduction

One key challenge in reinforcement learning is its sample inefficiency. Popular on-policy algorithms like REINFORCE or PPO frequently require millions of optimization steps and interactions with an environment, making them a subpar choice for computationally expensive environments or real-world rollouts.

Recent approaches in model-based RL, like ‘Dreamer-v3’³ or ‘IRIS,’ ¹ have attempted to tackle this problem by learning a ‘world model,’ a model of the environment that predicts future states of the environment given an action. Such approaches have significantly reduced the number of interactions with the environment, with IRIS using only 100k steps, roughly 2 hours of real gameplay time, on its Atari benchmarks.

Simultaneously, some research has used behavior cloning (BC), which reframes decision-making as a supervised learning task, to avoid expensive and sample-inefficient reinforcement learning training runs altogether. Behavior cloning is also frequently used to initialize agents for further training, which can improve agents and speed up their training time. Combining these approaches leads to multiple potential benefits:
Firstly, during initial epochs, the world models are typically trained using rollouts from an untrained actor. While this strategy might lead to more exploration, the world model will not be able to train on later parts of gameplay and good actions until the actor starts improving. Using a BC actor may allow us to collect better and more representative trajectories to learn the world model. Secondly, while BC is often a fast way to gain a strong initialization, it frequently fails to generalize well to states outside its training dataset. By continuing to train the actor using on-policy learning in the world model, we can make the actor more resilient and perform better.

We implement and evaluate this combined training using IRIS and three different Atari environments from the Atari100k benchmark. The Atari100k benchmark evaluates agents' performance after 100k in-environment steps, approximately equivalent to 2 hours of real gameplay time.

In this blog post, we will first outline our experiments learning world models in state space and why we abandoned this approach. Then, we will elaborate on our approach to behavior cloning and how we collected samples. We will then explain how our system is put together end-to-end and what steps were taken to improve on a naive BC initialization. Finally, we will discuss our results and takeaways.

IRIS

We built our system around IRIS because of its relative simplicity of architecture. While Dreamer-v3 uses many complicated tricks and components to train its model, IRIS is primarily built out of three components: an autoencoder, a GPT-like decoder-only transformer, which acts as the world model, and an actor network.

Figure 1: Training IRIS in imagined trajectories generated by the world model

As shown in Figure 1, the autoencoder outputs a series of latent representations of atari observations and decodes them back to the observation space, which the actor uses to predict actions. The world model takes the latent representations and predicted actions and outputs a predicted latent representation of the future state.

IRIS spends 50 epochs training the world model and tokenizer using their autoencoder and world model loss, before train the agent using the REINFORCE algorithm with advantage estimation and an entropy bonus on imagined trajectories.

State Space or Pixel Space Observations?

The original Atari2600 console represented each game’s state using 128 bytes in its RAM, a significantly smaller representation than pixel space observations. As an initial stage in our project, we wanted to investigate the feasibility of learning a world model using only the environment’s RAM output instead of images. By taking this approach, we could potentially cut down on the amount of compute needed and eliminate the autoencoder from IRIS’ architecture, thereby reducing the number of components that need to be trained in tandem.

Our experiments for this used RAM-only environments and matched the frame stacking logic used in pixel-space IRIS to ensure the model has time-based data. We used expert rollouts in pixel space for our BC data collection and accessed the underlying ALE (Arcade Learning Environment) to collect the RAM observations.

However, our experimental runs showed that when attempting to learn from RAM only, we fail to learn a good world model and thus fail to learn a good agent.

Unfortunately, this result seems in line with other empirical findings on using the Atari RAM observations when training traditional RL agents⁴: Even though it is tempting to think that a more dense representation will lead to better results and better learning, we found that learning IRIS agents using RAM observations takes significantly more training steps. Since our initial hypothesis of faster training times was false, we discarded this approach and moved forward with training in pixel space.

Behavior Cloning Data Collection

To build our system for BC pre-training, we need expert datasets. Since we are working on Atari environments, we used available pre-trained agents. This choice was made for simplicity, but using human expert data or other data sources would be preferred in a scenario without an expert.

To be able to learn well using behavior cloning, we need to ensure that the training data covers as much of the state space as possible. We additionally need to avoid some common pitfalls, like including contradictory data (different expert actions given the same observation). And, since we intend to use this BC pre-training as initialization for downstream world model training, we need to match the shape of our collected observations to the shape of observations that IRIS uses.

While collecting the data, we included improvements based on prior work in BC for Atari ⁵:

Sticky actions: If we take repeated steps in the environment using a set sticky action probability while continuing to store the expert-predicted actions in our dataset, the expert data will cover a larger amount of the state space.
Initial random steps: By spending a set number of initial steps acting randomly, the expert dataset will reflect a more diverse range of starting positions.

These parameters need to be tuned based on the chosen environment. For example, using a large number of initial random steps on Breakout will simply cause the expert to lose.

IRIS uses 20 context frames for its world model and actor when working in imagination. At the beginning of the imagined trajectory, the actor’s LSTM hidden states and context are initialized with these 20 frames. After testing different context lengths for our BC training, we found that the actor learns better when given more context, so we collected sufficient information to match the context length provided by IRIS for each target action.

Technical Considerations

Unlike most current on-policy Atari agents, IRIS uses 64 x 64 observations with three color channels, which leads to additional complexity when collecting data through the gym environment.

We attempted to use experts from stable-baselines zoo, stable-baselines3 zoo, and Atari Agents before finally settling on AtariARI⁶'s PPO agents. Throughout this process, we found that most agents were trained using grayscale observations with the default resized dimensions of 84 x 84.

We overcame this hurdle by running two environments initialized with the same seed in parallel. We used the first environment to provide grayscale observations to the expert and collected color observations for our dataset from the second environment.

Pretraining with Behavior Cloning

When initially pretraining our agent using behavior cloning, we quickly achieve high train accuracies of up to 99%, but with significant overfitting, where evaluation accuracy lags significantly behind. There are several possible reasons for this: Firstly, the environments are simple, but the state space is large, and despite using tricks when collecting the train set, we may not have sufficient coverage for the model to generalize well. Secondly, there may be ambiguous observations, where the agent could take multiple legitimate moves, for which the model learns to memorize the correct answer in the train set.

Figure 2: Behavior Cloning Pre-training

By looking at the actor's performance in evaluation trajectories, we can see that performance is acceptable but not great, suggesting that we are correctly learning some behaviors but not generalizing well. To boost the performance of our agent and reduce overfitting, we used common regularization techniques like weight decay and commonly used frame augmentations like Random Shifts. Additionally, increasing the dataset size also had a significant positive impact on how well the model generalizes.

The best setting for pretraining included a BC dataset with 150k samples trained for 6 hours across all environments.

Training IRIS with Pretrained Agent

We first implemented a naive version of BC initialization for IRIS. After completing our BC agent pre-training, we loaded the pre-trained agent’s weights and launched a standard IRIS training run.

However, there are additional considerations: Notably, the IRIS actor uses a value function for advantage estimation, which is untrained at the start of IRIS training. The world model itself may also not be performing well yet. The combination of an untrained value function and not fully trained world model could lead to subpar gradient updates and a collapse of the trained agent, similar to what we observed in our homework assignment 2.

Thus, we experimented with including an additional BC loss term based on our collected dataset as part of the actor’s policy loss. A weighting factor, alpha, determines the balance between the BC and imagination losses. Concretely, the policy loss becomes:

where α is the weighting factor, epochi refers to the current epoch, and epoch0 and epochn refer to the first and last epoch of actor-critic training respectively. γ is set to 0.95 for all experiments using exponential scheduling.

We additionally wanted to evaluate different strategies for decaying the weighting factor. In our experiment, we chose to test a cosine decaying, exponentially decaying, and constant factor for alpha. For our final evaluation, we run an unchanged IRIS baseline as well as 4 experiments for each environment:

No BC (IRIS Baseline)
Naive BC
BC regularization with cosine scheduling
BC regularization with exponential scheduling
BC regularization with constant scheduling

We wanted a selection of environments of different difficulty levels. Based on human-normalized scores with respect to the IRIS agent, we selected Breakout (Easy), Demon Attack (Medium) and Ms Pacman (Hard) as our environments for evaluation. Each experiment was trained for 24 hours on a single RTX8000 GPU, and the final results were computed by taking an average over 64 gameplay trajectories.

Results

In Table 1, we show the evaluation results of the different configurations of training across environments, averaged over 64 trajectories. We see that the various BC configurations outperform the IRIS baseline (No BC) in all environments considering 20k environment steps. In Breakout, BC regularization with a constant scheduling factor performs best, while in Demon Attack, Naive BC outperforms the regularized versions. In Ms Pacman, BC regularization with a cosine scheduling factor performs best.

Environment	No BC	Naive BC	BC Regularization (Cosine)	BC Regularization (Exponential)	BC Regularization (Constant)
Breakout (20k steps)	6.062	2.703	12.296	12.031	12.859
Breakout (100k steps) ¹	84	-	-	-	-
Demon Attack (20k steps)	179.453	769.921	283.20	146.718	184.921
Demon Attack (100k steps) ¹	2034	-	-	-	-
Ms Pacman (20k steps)	370.156	445.0	729.062	378.437	544.843
Ms Pacman (100k steps) ¹	999	-	-	-	-

Table 1: Evaluation results of the different configurations of training across environments, averaged over 64 trajectories.

The Naive BC configuration in the Breakout environment does a lot worse as the training collapsed, as shown in the trajectory (Second from left) in Figure 3, and the evaluation returns during training in Figure 6. Conversely, in Demon Attack, the Naive BC configuration actually performs best and exploits the environment's mechanics as shown in the trajectory (second from left) in Figure 4. In our most difficult environment, Ms Pacman, the BC regularization with a cosine scheduling factor does much better than the other configurations since it does not collapse into following the same path or get stuck as often in the trajectories (shown in Figure 5) of the other configurations. Another interesting aspect of this agent is that it more consistently finds the portal to reach the other end of the map to avoid the ghosts.

No BC	Naive BC	Cosine	Exponential	Constant

Figure 3:Breakout Trajectories

Figure 4: Demon Attack Trajectories

Figure 5: Ms. Pacman Trajectories

In Figures 6, 7 and 8, we show the evaluation returns during training for each of the environments. We show the training starting from epoch 50 when we begin training the actor.

Figure 6: Evaluation Returns during Training in the Breakout environment. We plot epochs only from epoch 50 as we begin training the actor at epoch 50.

Figure 7: Evaluation Returns during Training in the Demon Attack environment. We plot epochs only from epoch 50 as we begin training the actor at epoch 50.

Figure 8: Evaluation Returns during Training in the Ms Pacman environment. We plot epochs only from epoch 50 as we begin training the actor at epoch 50.

Conclusion

Based on the results, we find that initializing world model training using behavior cloning can have significant positive effects on training. In all three environments, we recorded improvements over the IRIS baseline while maintaining the benefits of both approaches for in-environment sample efficiency. This effect tends to exist for nearly all experiments using BC, though significant variability exists between runs.

Even though we only train using 20k in-environment steps, roughly equivalent to 24 minutes of real gameplay time and a fraction of the training time (1 day vs seven days), our agent makes significant headway towards the IRIS 100k benchmark on Ms Pacman and starts progressing significantly faster than our own IRIS baseline in Demon Attack.

However, the training results do not directly indicate if adding a BC term to the policy loss (BC regularization) makes a significant difference, as Naive BC outperforms the regularized versions on Demon Attack, whereas using regularization with a cosine decaying factor performed best on Ms Pacman. Furthermore, since we could not train our approach for a length comparable to the IRIS 100k benchmark (7 days), we cannot conclude whether this approach can lead to an overall higher performance or if we will reach a plateau sooner.

While our results are positive, there are several caveats: Firstly, modern approaches to behavior cloning will use architectures like VAE or BeT to capture multimodal data from experts more adequately. By contrast, most world model approaches will use simpler networks for on-policy learning, meaning there is a limit to how well we can capture expert data. World models also tend to use very specific tricks to work. In the case of IRIS, many context frames and resized, full-color observations are used, which sets it apart from most other Atari agents. Since we generated our data, we can match these requirements. Still, in many scenarios where BC is applicable, we need to use whichever data is already available, making using downstream world model training more difficult. Finally, each environment has particular quirks, meaning we need to adjust our approach for each environment we are trying to learn.

References

Micheli, V., Alonso, E., & Fleuret, F. (2022). Transformers are sample-efficient world models. arXiv preprint arXiv:2209.00588.
Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R. H., Czechowski, K., ... & Michalewski, H. (2019). Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374.
Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2023). Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104.
Sygnowski, J., & Michalewski, H. (2017). Learning from the memory of Atari 2600. In Computer Games: 5th Workshop on Computer Games, CGW 2016, and 5th Workshop on General Intelligence in Game-Playing Agents, GIGA 2016, Held in Conjunction with the 25th International Conference on Artificial Intelligence, IJCAI 2016, New York, USA, July 9-10, 2016, Revised Selected Papers 5 (pp. 71-85). Springer International Publishing.
Chen, B., Tandon, S., Gorsich, D., Gorodetsky, A., & Veerapaneni, S. (2021, June). Behavioral cloning in atari games using a combined variational autoencoder and predictor model. In 2021 IEEE Congress on Evolutionary Computation (CEC) (pp. 2077-2084). IEEE.
Anand, A., Racah, E., Ozair, S., Bengio, Y., Côté, M. A., & Hjelm, R. D. (2019). Unsupervised state representation learning in atari. Advances in neural information processing systems, 32.