In this project, we combine behavior cloning and world models, two techniques that are frequently used for their in-environment sample efficiency.
Building on IRIS
One key challenge in reinforcement learning is its sample inefficiency. Popular on-policy algorithms like REINFORCE or PPO frequently require millions of optimization steps and interactions with an environment, making them a subpar choice for computationally expensive environments or real-world rollouts.
Recent approaches in model-based RL, like ‘Dreamer-v3’
Simultaneously, some research has used behavior cloning (BC), which reframes decision-making as a supervised learning task, to avoid expensive and sample-inefficient reinforcement learning training runs altogether. Behavior cloning is also frequently used to initialize agents for further training, which can improve agents and speed up their training time.
Combining these approaches leads to multiple potential benefits:
Firstly, during initial epochs, the world models are typically trained using rollouts from an untrained actor. While this strategy might lead to more exploration, the world model will not be able to train on later parts of gameplay and good actions until the actor starts improving. Using a BC actor may allow us to collect better and more representative trajectories to learn the world model.
Secondly, while BC is often a fast way to gain a strong initialization, it frequently fails to generalize well to states outside its training dataset. By continuing to train the actor using on-policy learning in the world model, we can make the actor more resilient and perform better.
We implement and evaluate this combined training using IRIS and three different Atari environments from the Atari100k benchmark. The Atari100k benchmark evaluates agents' performance after 100k in-environment steps, approximately equivalent to 2 hours of real gameplay time.
In this blog post, we will first outline our experiments learning world models in state space and why we abandoned this approach. Then, we will elaborate on our approach to behavior cloning and how we collected samples. We will then explain how our system is put together end-to-end and what steps were taken to improve on a naive BC initialization. Finally, we will discuss our results and takeaways.
We built our system around IRIS because of its relative simplicity of architecture. While Dreamer-v3 uses many complicated tricks and components to train its model, IRIS is primarily built out of three components: an autoencoder, a GPT-like decoder-only transformer, which acts as the world model, and an actor network.
The original Atari2600 console represented each game’s state using 128 bytes in its RAM, a significantly smaller representation than pixel space observations. As an initial stage in our project, we wanted to investigate the feasibility of learning a world model using only the environment’s RAM output instead of images. By taking this approach, we could potentially cut down on the amount of compute needed and eliminate the autoencoder from IRIS’ architecture, thereby reducing the number of components that need to be trained in tandem.
Our experiments for this used RAM-only environments and matched the frame stacking logic used in pixel-space IRIS to ensure the model has time-based data. We used expert rollouts in pixel space for our BC data collection and accessed the underlying ALE (Arcade Learning Environment) to collect the RAM observations.
However, our experimental runs showed that when attempting to learn from RAM only, we fail to learn a good world model and thus fail to learn a good agent.
Unfortunately, this result seems in line with other empirical findings on using the Atari RAM observations when training traditional RL agents
To build our system for BC pre-training, we need expert datasets. Since we are working on Atari environments, we used available pre-trained agents. This choice was made for simplicity, but using human expert data or other data sources would be preferred in a scenario without an expert.
To be able to learn well using behavior cloning, we need to ensure that the training data covers as much of the state space as possible. We additionally need to avoid some common pitfalls, like including contradictory data (different expert actions given the same observation). And, since we intend to use this BC pre-training as initialization for downstream world model training, we need to match the shape of our collected observations to the shape of observations that IRIS uses.
While collecting the data, we included improvements based on prior work in BC for Atari
Unlike most current on-policy Atari agents, IRIS uses 64 x 64 observations with three color channels, which leads to additional complexity when collecting data through the gym environment.
We attempted to use experts from stable-baselines zoo, stable-baselines3 zoo, and Atari Agents before finally settling on AtariARI
We overcame this hurdle by running two environments initialized with the same seed in parallel. We used the first environment to provide grayscale observations to the expert and collected color observations for our dataset from the second environment.
When initially pretraining our agent using behavior cloning, we quickly achieve high train accuracies of up to 99%, but with significant overfitting, where evaluation accuracy lags significantly behind. There are several possible reasons for this: Firstly, the environments are simple, but the state space is large, and despite using tricks when collecting the train set, we may not have sufficient coverage for the model to generalize well. Secondly, there may be ambiguous observations, where the agent could take multiple legitimate moves, for which the model learns to memorize the correct answer in the train set.
We first implemented a naive version of BC initialization for IRIS. After completing our BC agent pre-training, we loaded the pre-trained agent’s weights and launched a standard IRIS training run.
However, there are additional considerations: Notably, the IRIS actor uses a value function for advantage estimation, which is untrained at the start of IRIS training. The world model itself may also not be performing well yet. The combination of an untrained value function and not fully trained world model could lead to subpar gradient updates and a collapse of the trained agent, similar to what we observed in our homework assignment 2.
Thus, we experimented with including an additional BC loss term based on our collected dataset as part of the actor’s policy loss. A weighting factor, alpha, determines the balance between the BC and imagination losses. Concretely, the policy loss becomes:
In Table 1, we show the evaluation results of the different configurations of training across environments, averaged over 64 trajectories. We see that
the various BC configurations outperform the IRIS baseline (No BC) in all environments considering 20k environment steps. In Breakout, BC regularization with a constant scheduling factor performs best, while in Demon Attack, Naive BC outperforms the regularized versions. In Ms Pacman, BC regularization with a cosine scheduling factor performs best.
Environment | No BC | Naive BC | BC Regularization (Cosine) | BC Regularization (Exponential) | BC Regularization (Constant) |
---|---|---|---|---|---|
Breakout (20k steps) | 6.062 | 2.703 | 12.296 | 12.031 | 12.859 |
Breakout (100k steps) 1 | 84 | - | - | - | - |
Demon Attack (20k steps) | 179.453 | 769.921 | 283.20 | 146.718 | 184.921 |
Demon Attack (100k steps) 1 | 2034 | - | - | - | - |
Ms Pacman (20k steps) | 370.156 | 445.0 | 729.062 | 378.437 | 544.843 |
Ms Pacman (100k steps) 1 | 999 | - | - | - | - |
Table 1: Evaluation results of the different configurations of training across environments, averaged over 64 trajectories.
No BC | Naive BC | Cosine | Exponential | Constant |
---|
Figure 6: Evaluation Returns during Training in the Breakout environment. We plot epochs only from epoch 50 as we begin training the actor at epoch 50.
Figure 7: Evaluation Returns during Training in the Demon Attack environment. We plot epochs only from epoch 50 as we begin training the actor at epoch 50.
Figure 8: Evaluation Returns during Training in the Ms Pacman environment. We plot epochs only from epoch 50 as we begin training the actor at epoch 50.
Based on the results, we find that initializing world model training using behavior cloning can have significant positive effects on training. In all three environments, we recorded improvements over the IRIS baseline while maintaining the benefits of both approaches for in-environment sample efficiency. This effect tends to exist for nearly all experiments using BC, though significant variability exists between runs.
Even though we only train using 20k in-environment steps, roughly equivalent to 24 minutes of real gameplay time and a fraction of the training time (1 day vs seven days), our agent makes significant headway towards the IRIS 100k benchmark on Ms Pacman and starts progressing significantly faster than our own IRIS baseline in Demon Attack.
However, the training results do not directly indicate if adding a BC term to the policy loss (BC regularization) makes a significant difference, as Naive BC outperforms the regularized versions on Demon Attack, whereas using regularization with a cosine decaying factor performed best on Ms Pacman.
Furthermore, since we could not train our approach for a length comparable to the IRIS 100k benchmark (7 days), we cannot conclude whether this approach can lead to an overall higher performance or if we will reach a plateau sooner.
While our results are positive, there are several caveats: Firstly, modern approaches to behavior cloning will use architectures like VAE or BeT to capture multimodal data from experts more adequately. By contrast, most world model approaches will use simpler networks for on-policy learning, meaning there is a limit to how well we can capture expert data.
World models also tend to use very specific tricks to work. In the case of IRIS, many context frames and resized, full-color observations are used, which sets it apart from most other Atari agents. Since we generated our data, we can match these requirements. Still, in many scenarios where BC is applicable, we need to use whichever data is already available, making using downstream world model training more difficult.
Finally, each environment has particular quirks, meaning we need to adjust our approach for each environment we are trying to learn.