I had a lot of fun competing in the Unity Obstacle Tower Challenge. I was at the top of the leaderboard for the majority of the competition, and for the entirety of Round 2. By the end of the competition, my agent was ranked at an average floor of 19.4, greater than the human baseline (15.6) in Juliani et al., and greater than my own personal average performance. This submission outranked all of the other submissions, by a very large margin in most cases.
So how did I do it? The simple answer is that I used human demonstrations in a clever way. There were a handful of other tricks involved as well, and this post will briefly touch on all of them. But first, I want to take a step back and describe how I arrived at my final solution.
Before I looked at the Obstacle Tower environment itself, I assumed that generalization would be the main bottleneck of the competition. This assumption mostly stemmed from my experience creating baselines for the OpenAI Retro Contest, where every model generalized terribly. As such, I started by trying a few primitive solutions that would inject as little information into the model as possible. These solutions included:
- Evolving a policy with CMA-ES
- Using PPO to train a policy that looked at tiny observations (e.g. 5×5 images)
- Using CEM to learn an open-loop action distribution that maximized rewards.
However, none of these solutions reached the fifth floor, and I quickly realized that a PPO baseline did better. Once I started tuning PPO, it quickly became clear that generalization was not the actual bottleneck. It turned out that the 100 training seeds were enough for standard RL algorithms to generalize fairly well. So, instead of focusing on generalization, I simply aimed to make progress on the training set (with a few exceptions, e.g. data augmentation).
My early PPO implementation, which was based on anyrl-py, was hitting a wall at the 10th floor of the environment. It never passed the 10th floor, even by chance, indicating that the environment posed too much of an exploration problem for standard RL. This was when I decided to take a closer look at the environment to see what was going on. It turned out that the 10th floor marked the introduction of the Sokoban puzzle, where the agent must push a block across a room to a square target marked on the floor. This involves taking a consistent set of actions for several seconds (on the order of ~50 timesteps). So well for traditional RL.
At this point, other researchers might have tried something like Curiosity-driven Exploration or Go-Explore. I didn’t even give these methods the time of day. As far as I can tell, these methods all have a strong inductive bias towards visually simple (often 2-dimensional) environments. Exploration is extremely easy with visually simple observations, and even simple image similarity metrics can be used with great success in these environments. On Obstacle Tower, however, observations depend completely on what the camera is pointed at, where the agent is standing in a room, etc. The agent can see two totally different images while standing in the same spot, and it can see two very similar images while standing in two totally different rooms. Moreover, the first instant of pushing a box for the Sokoban puzzle looks very similar to the final moment of pushing the same box. My hypothesis, then, was that traditional exploration algorithms would not be very effective in Obstacle Tower.
If popular exploration algorithms are out, how do we make the agent solve the Sokoban puzzle? There are two approaches I would typically try here: evolutionary algorithms, which explore in parameter space rather than algorithm space, and human demonstrations, which bypass the problem of exploration altogether. With my limited compute capabilities (one machine with one GPU), I decided that human demonstrations would be more practical, since evolution typically burns through a lot of compute to train neural networks on games.
To start myself off with human demonstrations, I created a simple tool to record myself playing Obstacle Tower. After recording a few games, I used behavior cloning (supervised learning) to fit a policy to my demonstrations. Behavior cloning started overfitting very quickly, so I stopped training early and evaluated the resulting policy. It was terrible, but it did perform better than a random agent. I tried fine-tuning this policy with PPO, and was pleased to see that it learned faster than a policy trained from scratch. However, it did not solve the Sokoban puzzle.
Fast-forward a bit, and behavior cloning + fine-tuning still hadn’t broken through the Sokoban puzzle, even with many more demonstrations. At around this time, I rewrote my code in PyTorch so that I could try other imitation learning algorithms more easily. And while algorithms like GAIL did start to push boxes around, I hadn’t seen them reliably solve the 10th floor. I realized that the problem might involve memory, since the agent would often run around in circles doing redundant things, and it had no way of remembering if it had just seen a box or a target.
So, how did I fix the agent’s memory problem? In my experience, recurrent neural networks in RL often don’t remember what you want them to, and they take a huge number of samples to learn to remember anything useful at all. So, instead of using a recurrent neural network to help my agent remember the past, I created a state representation that I could stack up for the past 50 timesteps and then feed to my agent as part of its input. Originally, the state representation was a tuple of (action, reward, has key) values. Even with this simple state representation, behavior cloning worked way better (the test loss reached a much lower point), and the cloned agent had a much better initial score. But I didn’t stop there, because the state representation still said nothing about boxes or targets.
To help the agent remember things that could be useful for solving the Sokoban puzzle, I trained a classifier to identify common objects like boxes, doors, box targets, keys, etc. I then added these classification outputs to the state tuple. This improved behavior cloning even more, and I started to see the behavior cloned agent solve the Sokoban puzzle fairly regularly.
Despite the agent’s improved memory, behavior cloning + fine-tuning was still failing to solve the Sokoban puzzle, and GAIL wasn’t much of an improvement. It seemed that, by the time the agent started reaching the 10th floor, it had totally forgotten how to push boxes! In my experience, this kind of forgetting in RL is often caused by the entropy bonus, which encourages the agent to take random actions as much as possible. This bonus tends to destroy parts of an agent’s pre-trained behavior that do not yield noticeable rewards right away.
This was about the time that prierarchy came in. In addition to my observation about the entropy bonus destroying the agent’s behavior, I also noticed that the behavior cloned agent took reasonable low-level actions, but it did so in ways that were unreasonable in a high-level context. For example, it might push a box all the way to the corner of a room, but it might be the wrong corner. Instead of using an entropy bonus, I wanted a bonus that would keep these low-level actions in tact, while allowing the agent to solve the high-level problems that it was struggling with. This is when I implemented the KL term that makes prierarchy what it is, with the behavior cloned policy as the prior.
Once prierarchy was in place, things were pretty much smooth sailing. At this point, it was mostly a matter of recording some more demonstrations and training the agent for longer (on the order of 500M timesteps). However, there were still a few other tricks that I used to improve performance:
- My actual submissions consisted of two agents. The first one solved floors 1-9, and the second one solved floors 10 and onward. The latter agent was trained with starting floors sampled randomly between 10 and 15. This forced it to learn to solve the Sokoban puzzle immediately, rather than perfecting floors 1-9 first.
- I used a reduced action space, mostly because I found that it made it easier for me to play the game as a human.
- My models were based on the CNN architecture from the IMPALA paper. In my experience with RL on video games, this architecture learns and generalizes better than the architecture used in the original Nature article.
- I used Fixup initialization to help train deeper models.
- I used MixMatch to train the state classifier with fewer labeled examples than I would have needed otherwise.
- For behavior cloning, I used traditional types of image data augmentation. However, I also used a mirroring data augmentation where images and actions were mirrored together. This way, I could effectively double the number of training levels, since every level came with its mirror image as well.
- During prierarchy training, I applied data augmentation to the Obstacle Tower environment to help with overfitting. I never actually verified that this was necessary, and it might not have been, but other contestants definitely struggled with overfitting more than I did.
- I added a small reward bonus for picking up time orbs. It’s unclear how much of an effect this had, since the agent still missed most of the time orbs. This is one area where improvement would definitely result in a better agent.
I basically checked out for the majority of Round 2: I stopped actively working on the contest, and for a lot of the time I wasn’t even training anything for it. Near the end of the contest, when other contestants started solving the Sokoban puzzle, I trained my agent a little bit more and submitted the new version, but it turned out not to have been necessary.
My code can be found on Github. I do not intend on changing the repository much at this point, since I want it to remain a reflection of the solution described in this post.