Competing in the Obstacle Tower Challenge

I had a lot of fun competing in the Unity Obstacle Tower Challenge. I was at the top of the leaderboard for the majority of the competition, and for the entirety of Round 2. By the end of the competition, my agent was ranked at an average floor of 19.4, greater than the human baseline (15.6) in Juliani et al., and greater than my own personal average performance. This submission outranked all of the other submissions, by a very large margin in most cases.

So how did I do it? The simple answer is that I used human demonstrations in a clever way. There were a handful of other tricks involved as well, and this post will briefly touch on all of them. But first, I want to take a step back and describe how I arrived at my final solution.

Before I looked at the Obstacle Tower environment itself, I assumed that generalization would be the main bottleneck of the competition. This assumption mostly stemmed from my experience creating baselines for the OpenAI Retro Contest, where every model generalized terribly. As such, I started by trying a few primitive solutions that would inject as little information into the model as possible. These solutions included:

  • Evolving a policy with CMA-ES
  • Using PPO to train a policy that looked at tiny observations (e.g. 5×5 images)
  • Using CEM to learn an open-loop action distribution that maximized rewards.

However, none of these solutions reached the fifth floor, and I quickly realized that a PPO baseline did better. Once I started tuning PPO, it quickly became clear that generalization was not the actual bottleneck. It turned out that the 100 training seeds were enough for standard RL algorithms to generalize fairly well. So, instead of focusing on generalization, I simply aimed to make progress on the training set (with a few exceptions, e.g. data augmentation).

My early PPO implementation, which was based on anyrl-py, was hitting a wall at the 10th floor of the environment. It never passed the 10th floor, even by chance, indicating that the environment posed too much of an exploration problem for standard RL. This was when I decided to take a closer look at the environment to see what was going on. It turned out that the 10th floor marked the introduction of the Sokoban puzzle, where the agent must push a block across a room to a square target marked on the floor. This involves taking a consistent set of actions for several seconds (on the order of ~50 timesteps). So well for traditional RL.

At this point, other researchers might have tried something like Curiosity-driven Exploration or Go-Explore. I didn’t even give these methods the time of day. As far as I can tell, these methods all have a strong inductive bias towards visually simple (often 2-dimensional) environments. Exploration is extremely easy with visually simple observations, and even simple image similarity metrics can be used with great success in these environments. On Obstacle Tower, however, observations depend completely on what the camera is pointed at, where the agent is standing in a room, etc. The agent can see two totally different images while standing in the same spot, and it can see two very similar images while standing in two totally different rooms. Moreover, the first instant of pushing a box for the Sokoban puzzle looks very similar to the final moment of pushing the same box. My hypothesis, then, was that traditional exploration algorithms would not be very effective in Obstacle Tower.

If popular exploration algorithms are out, how do we make the agent solve the Sokoban puzzle? There are two approaches I would typically try here: evolutionary algorithms, which explore in parameter space rather than algorithm space, and human demonstrations, which bypass the problem of exploration altogether. With my limited compute capabilities (one machine with one GPU), I decided that human demonstrations would be more practical, since evolution typically burns through a lot of compute to train neural networks on games.

To start myself off with human demonstrations, I created a simple tool to record myself playing Obstacle Tower. After recording a few games, I used behavior cloning (supervised learning) to fit a policy to my demonstrations. Behavior cloning started overfitting very quickly, so I stopped training early and evaluated the resulting policy. It was terrible, but it did perform better than a random agent. I tried fine-tuning this policy with PPO, and was pleased to see that it learned faster than a policy trained from scratch. However, it did not solve the Sokoban puzzle.

Fast-forward a bit, and behavior cloning + fine-tuning still hadn’t broken through the Sokoban puzzle, even with many more demonstrations. At around this time, I rewrote my code in PyTorch so that I could try other imitation learning algorithms more easily. And while algorithms like GAIL did start to push boxes around, I hadn’t seen them reliably solve the 10th floor. I realized that the problem might involve memory, since the agent would often run around in circles doing redundant things, and it had no way of remembering if it had just seen a box or a target.

So, how did I fix the agent’s memory problem? In my experience, recurrent neural networks in RL often don’t remember what you want them to, and they take a huge number of samples to learn to remember anything useful at all. So, instead of using a recurrent neural network to help my agent remember the past, I created a state representation that I could stack up for the past 50 timesteps and then feed to my agent as part of its input. Originally, the state representation was a tuple of (action, reward, has key) values. Even with this simple state representation, behavior cloning worked way better (the test loss reached a much lower point), and the cloned agent had a much better initial score. But I didn’t stop there, because the state representation still said nothing about boxes or targets.

To help the agent remember things that could be useful for solving the Sokoban puzzle, I trained a classifier to identify common objects like boxes, doors, box targets, keys, etc. I then added these classification outputs to the state tuple. This improved behavior cloning even more, and I started to see the behavior cloned agent solve the Sokoban puzzle fairly regularly.

Despite the agent’s improved memory, behavior cloning + fine-tuning was still failing to solve the Sokoban puzzle, and GAIL wasn’t much of an improvement. It seemed that, by the time the agent started reaching the 10th floor, it had totally forgotten how to push boxes! In my experience, this kind of forgetting in RL is often caused by the entropy bonus, which encourages the agent to take random actions as much as possible. This bonus tends to destroy parts of an agent’s pre-trained behavior that do not yield noticeable rewards right away.

This was about the time that prierarchy came in. In addition to my observation about the entropy bonus destroying the agent’s behavior, I also noticed that the behavior cloned agent took reasonable low-level actions, but it did so in ways that were unreasonable in a high-level context. For example, it might push a box all the way to the corner of a room, but it might be the wrong corner. Instead of using an entropy bonus, I wanted a bonus that would keep these low-level actions in tact, while allowing the agent to solve the high-level problems that it was struggling with. This is when I implemented the KL term that makes prierarchy what it is, with the behavior cloned policy as the prior.

Once prierarchy was in place, things were pretty much smooth sailing. At this point, it was mostly a matter of recording some more demonstrations and training the agent for longer (on the order of 500M timesteps). However, there were still a few other tricks that I used to improve performance:

  • My actual submissions consisted of two agents. The first one solved floors 1-9, and the second one solved floors 10 and onward. The latter agent was trained with starting floors sampled randomly between 10 and 15. This forced it to learn to solve the Sokoban puzzle immediately, rather than perfecting floors 1-9 first.
  • I used a reduced action space, mostly because I found that it made it easier for me to play the game as a human.
  • My models were based on the CNN architecture from the IMPALA paper. In my experience with RL on video games, this architecture learns and generalizes better than the architecture used in the original Nature article.
  • I used Fixup initialization to help train deeper models.
  • I used MixMatch to train the state classifier with fewer labeled examples than I would have needed otherwise.
  • For behavior cloning, I used traditional types of image data augmentation. However, I also used a mirroring data augmentation where images and actions were mirrored together. This way, I could effectively double the number of training levels, since every level came with its mirror image as well.
  • During prierarchy training, I applied data augmentation to the Obstacle Tower environment to help with overfitting. I never actually verified that this was necessary, and it might not have been, but other contestants definitely struggled with overfitting more than I did.
  • I added a small reward bonus for picking up time orbs. It’s unclear how much of an effect this had, since the agent still missed most of the time orbs. This is one area where improvement would definitely result in a better agent.

I basically checked out for the majority of Round 2: I stopped actively working on the contest, and for a lot of the time I wasn’t even training anything for it. Near the end of the contest, when other contestants started solving the Sokoban puzzle, I trained my agent a little bit more and submitted the new version, but it turned out not to have been necessary.

My code can be found on Github. I do not intend on changing the repository much at this point, since I want it to remain a reflection of the solution described in this post.

Prierarchy: Implicit Hierarchies

On first blush, hierarchical reinforcement learning seems like the holy grail of artificial intelligence. Ideally, it performs long-term exploration, learns over long time-horizons, and has many other benefits. Furthermore, HRL makes it possible to reason about an agent’s behavior in terms of high-level symbols. However, in my opinion, pure HRL is too rigid and doesn’t capture how humans actually operate hierarchically.

Today, I’ll propose an idea that I call “the prierarchy” (a combination of “prior” and “hierarchy”). This idea is an alternative way to look at HRL, without the need to define rigid action boundaries or discrete levels of hierarchy. I am using the prierarchy in the Unity Obstacle Tower Challenge, so I will keep a few details close to my chest for now. However, I intend to update this post once the contest is completed.

Defining the Prierarchy

To understand the prierarchy framework, let’s consider character-level language models. When you run a language model on a sentence, there are some parts of the sentence where the model is very certain what the next token should be (low-entropy), and other parts where it is very uncertain (high-entropy). We can think of this as follows: the high-entropy parts of the sentence mark the starts of high-level actions, while the low-entropy parts represent the execution of those high-level actions. For example, the model likely has more entropy at the starts of words or phrases, while it has less entropy near the ends of words (especially long, unique words). As a consequence of this view, we can say that prompting a language model is the same thing as starting a high-level action and seeing how the model executes this high-level action.

Under the prierarchy framework, we initiate “high-level actions” by taking a sequence of low-probability, low-level actions which kick off a long sequence of high-probability, low-level actions. This assumes that we have some kind of prior distribution over low-level actions, possibly conditioned on observations from the environment. This prior basically encodes the lower-levels of the hierarchy that we want to control with a high-level policy.

The nice thing about the prierarchy is that we never had to make any specific assumptions about action boundaries, since the “high-level actions” aren’t real or concrete, they are just our interpretation of a sequence of continuous entropy values. This means, for one thing, that no low- or high-level policy has to worry about stopping conditions.

Practical Application

In order to implement the prierarchy, you first need a prior. A prior, in this case, is a distribution over low-level actions conditioned on previous observations. This is exactly a “policy” in RL. Thus, you can get a prior in any way you can get a policy: you can train a prior on a different (perhaps more dense but misaligned) reward function; you can train a prior via behavior cloning; you can even hand-craft a prior using some kind of approximate solution to your problem. Whatever method you take, you should make sure that the prior is capable of performing useful behaviors, even if these behaviors occur in the wrong order, for the wrong duration of time, etc.

Once you have a prior, you can train a controller policy fairly easily. First, initialize the policy with the prior, and then perform a policy gradient algorithm with a KL regularizer KL(policy|prior) instead of an entropy bonus. This algorithm is basically saying: “do what the prior says, and inject information into it when you need to in order to achieve rewards”. Note that, if the prior is the uniform distribution over actions, then this is exactly equivalent to traditional policy gradient algorithms.

If the prior is low entropy, then you should be able to significantly increase the discount factor. This is because the noise of policy gradient algorithms scales with the amount of information that is injected into the trajectories by sampling from the policy. Also, if you select a reasonable prior, then the prior is likely to explore much better than a random agent. This can be useful in very sparse-reward environments.

Let’s look at a simple (if not contrived) example. In RL, frame-skip is used to help agents take more consistent actions and thus explore better. It also helps for credit assignment, since the policy makes fewer decisions per span of time in the environment. It should be clear that frame-skip defines a prior distribution over action sequences, where every Nth action completely determines the next N-1 actions. My claim, which I won’t demonstrate today, is that you can achieve a similar effect by pre-training a recurrent policy to mimic the frame-skip action distribution, and then applying a policy gradient algorithm to this policy with an appropriately modified discount factor (e.g. 0.99 might become 0.990.25) and a KL regularizer against the frame-skip prior. Of course, it would be silly to do this in practice, since the prior is so easy to bake into the environment.

There are tons of other potential applications of this idea. For example, you could learn a better action distribution via an evolutionary algorithm like CMA-ES, and then use this as the prior. This way, the long-term exploration benefits of Evolution Strategies could be combined with the short-term exploitation benefits of policy gradient methods. One could also imagine learning a policy that controls a language model prior to write popular tweets, or one that controls a self-driving car prior for the purposes of navigation.

Conclusion

The main benefit of HRL is really just that it compresses sequences of low-level actions, resulting in better exploration and less noise in the high-level policy gradient. That same compression can be achieved with a good low-level prior, without the need to define explicit action boundaries.

As a final note, the prierarchy training algorithm looks almost identical to pre-training + fine-tuning, indicating that it’s not a very special idea at all. Nonetheless, it seems like a powerful one, and perhaps it will save us from wasting a lot of time on HRL.

Solving murder with Go

I recently saw a blog post entitled Solving murder with Prolog. It was a good read, and I recommend it. One interesting aspect of the solution is that it reads somewhat like the original problem. Essentially, the author translated the clues into Prolog, and got a solution out automatically.

This is actually an essential skill for programmers in general: good code should be readable, almost like plain English—and not just if you’re programming in Prolog. So today, I want to re-solve this problem in Go. As I go, I’ll explain how I optimized for readability. For the eager, here is the final solution.

Constants are very useful for readability. Instead of having a magic number sitting around, try to use an intelligible word or phrase! Let’s create some constants for rooms, people, and weapons:

const (
	Bathroom = iota
	DiningRoom
	Kitchen
	LivingRoom
	Pantry
	Study
)

const (
	George = iota
	John
	Robert
	Barbara
	Christine
	Yolanda
)

const (
	Bag = iota
	Firearm
	Gas
	Knife
	Poison
	Rope
)

The way I am solving the puzzle is as follows: we generate each possible scenario, check if it meets all the clues, and if so, print it out. To do this, we can store the scenario in a structure. Here’s how I chose to represent this structure, which I call Configuration:

type Configuration struct {
	People  []int
	Weapons []int
}

Here, we represent the solution by storing the person and weapon for each room. For example, to see the person in the living room, we check cfg.People[LivingRoom]. To see the weapon in the living room, we do cfg.Weapons[LivingRoom]. This will make it easy to implement clues in a readable fashion. For example, to verify the final clue, we simply check that cfg.Weapons[Pantry] == Gas, which reads just like the original clue.

Next, we need some way to iterate over candidate solutions. Really, we just want to iterate over possible permutations of people and weapons. Luckily, I have already made a library function to generate all permutations of N indices. Using this API, we can enumerate all configurations like so:

for people := range approb.Perms(6) {
	for weapons := range approb.Perms(6) {
		cfg := &Configuration{people, weapons}
		// Check configuration here.
	}
}

Note how this reads exactly like what we really want to do. We are looping through permutations of people and weapons, and we can see that easily. This is because we moved all the complex permutation generation logic into its own function; it’s abstracted away in a well-named container.

Now we just have to filter the configurations using all the available clues. We can make a function that checks if each clue is satisfied, giving us this main loop:

for people := range approb.Perms(6) {
	for weapons := range approb.Perms(6) {
		cfg := &Configuration{people, weapons}
		if !clue1(cfg) || !clue2(cfg) || !clue3(cfg) ||
			!clue4(cfg) || !clue5(cfg) ||
			!clue6(cfg) || !clue7(cfg) ||
			!clue8(cfg) || !clue9(cfg) {
			continue
		}
		fmt.Println("killer is:", cfg.People[Pantry])
	}
}

Next, before we implement the actual clues, let’s implement some helper functions. Well-named helper functions greatly improve readability. Note in particular the isMan helper. It’s very clear what it does, and it abstracts away an arbitrary-seeming 3.

func isMan(person int) bool {
	return person < 3
}

func indexOf(l []int, x int) int {
	for i, y := range l {
		if y == x {
			return i
		}
	}
	panic("unreachable")
}

Finally, we can implement all the clues in a human-readable fashion. Here is what they look like:

func clue1(cfg *Configuration) bool {
	return isMan(cfg.People[Kitchen]) && (cfg.Weapons[Kitchen] == Knife || cfg.Weapons[Kitchen] == Gas)
}

func clue2(cfg *Configuration) bool {
	if cfg.People[Study] == Barbara {
		return cfg.People[Bathroom] == Yolanda
	} else if cfg.People[Study] == Yolanda {
		return cfg.People[Bathroom] == Barbara
	} else {
		return false
	}
}

func clue3(cfg *Configuration) bool {
	bagRoom := indexOf(cfg.Weapons, Bag)
	if cfg.People[bagRoom] == Barbara || cfg.People[bagRoom] == George {
		return false
	}
	return bagRoom != Bathroom && bagRoom != DiningRoom
}

func clue4(cfg *Configuration) bool {
	if cfg.Weapons[Study] != Rope {
		return false
	}
	return !isMan(cfg.People[Study])
}

func clue5(cfg *Configuration) bool {
	return cfg.People[LivingRoom] == John || cfg.People[LivingRoom] == George
}

func clue6(cfg *Configuration) bool {
	return cfg.Weapons[DiningRoom] != Knife
}

func clue7(cfg *Configuration) bool {
	return cfg.People[Study] != Yolanda && cfg.People[Pantry] != Yolanda
}

func clue8(cfg *Configuration) bool {
	firearmRoom := indexOf(cfg.Weapons, Firearm)
	return cfg.People[firearmRoom] == George
}

func clue9(cfg *Configuration) bool {
	return cfg.Weapons[Pantry] == Gas
}

Most of these clues read almost like the original. I have to admit that the indexOf logic isn’t perfect. Let me know how/if you’d do it better!

What I Don’t Know

Starting this holiday season, I want to take some time every day to focus on a broader set of topics than just machine learning. While ML is extremely valuable, working on it day after day has given me tunnel vision. I think it’s important to remind myself that there’s more out there in the world of technology.

This post is going to be an aspirational one. I started by listing a ton of things I’m embarrassed to know nothing about. Then, in the spirit of self-improvement, I came up with a list of project ideas for each topic. I find that I learn best by doing, so I hope these projects will help me master areas where I have little or no prior experience. And who knows, maybe others will benefit from this list as well!

Networking

My knowledge of networking is severely lacking. First of all, I have no idea how the OS network stack works on either Linux or macOS. Also, I’ve never configured a complex network (e.g. for a datacenter), so I have a limited understanding of how routing works.

I am so ignorant when it comes to networking that I often struggle to formulate questions about it. Hopefully, my questions and project ideas actually make sense and turn out to be feasible.

Questions:

  • What APIs does your OS provide to intercept or manipulate network traffic?
  • What actually happens when you connect to a WiFi network?
  • What is a network interface? How is traffic routed through network interfaces?
  • How do VPNs work internally (both on the server and on the client)?
  • How flexible are Linux networking primitives?
  • How does iptables work on Linux? What’s it actually do?
  • How do NATs deal with different kinds of traffic (e.g. ICMP)?
  • How does DNS work? How do DNS records propagate? How do custom nameservers (e.g. with NS records) work? How does something like iodine work?

Projects:

  • Re-implement something like Little Snitch.
  • Try using the Berkeley Packet Filter (or some other API) to make a live bandwidth monitor.
  • Implement a packet-level WiFi client that gets all the way up to being able to make DNS queries. I started this with gofi and wifistack, but never finished.
  • Implement a program that exposes a fake LAN with a fake web server on some fake IP address. This will involve writing your own network stack.
  • Implement a user-space NAT that exposes a fake “gateway” through a tunnel interface.
  • Re-implement iodine in a way that parallelizes packet transmission to be faster on satellite internet connections (like on an airplane).
  • Re-implement something like ifconfig using system calls.
  • Connect your computer to both Ethernet and WiFi, and try to write a program that parallelizes an HTTP download over both interfaces simultaneously.
  • Write a script to DOS a VPN server by allocating a ton of IP addresses.
  • Write a simple VPN-like protocol and make a server/client for it.
  • Try to set something up where you can create a new Docker container and assign it its own IP address from a VPN.
  • Try to implement a simple firewall program that hooks into the same level of the network stack as iptables.
  • Implement a fake DNS server and setup your router to use it. The DNS server could forward most requests to a real DNS server, but provide fake addresses for specific domains of your choosing. This would be fun for silly pranks, or for logging domains people visit.
  • Try to bypass WiFi paywalls at hotels, airports, etc.

Cryptocurrency

Cryptocurrencies are extremely popular right now. So, as a tech nerd, I feel kind of lame knowing nothing about them. Maybe I should fix that!

Questions:

  • How do cryptocurrencies actually work?
  • What kinds of network protocols do cryptocurrencies use?
  • What does it mean that Ethereum is a distributed virtual machine?
  • What computations are actually involved in mining cryptocurrencies?

Projects:

  • Implement a toy cryptocurrency.
  • Write a script that, without using high-level APIs, transfers some cryptocurrency (e.g. Bitcoin) from one wallet to another.
  • Write a small program (e.g. “Hello World”) that runs on the Ethereum VM. I honestly don’t even know if this is possible.
  • Try writing a Bitcoin mining program from scratch.

Source Control

Before OpenAI, I worked mostly on solitary projects. As a result, I only used a small subset of the features offered by source control tools like Git. I never had to deal with complex merge conflicts, rebases, etc.

Questions:

  • What are some complicated use-cases for git rebase?
  • How do code reviews typically work on large open source projects?
  • What protocol does git use for remotes? Is a Github repository just a .git directory on a server, or is there more to it than that?
  • What are some common/useful git commands besides git push, git pull, git add, git commit, git merge, git remote, git checkout, git branch, and git rebase? Also, what are some unusual/useful flags for the aforementioned commands?
  • How do you actually set up an editor to work with git add -p?
  • How does git store data internally?

Projects:

  • Try to write a script that uses sockets/HTTPS/SSH to push code to a Github repo. Don’t use any local git commands or APIs.
  • On your own repos, intentionally get yourself into source control messes that you have to figure your way out of.
  • Submit more pull requests to big open source projects.
  • Read a ton of git man pages.
  • Write a program from scratch that converts tarballs to .git directories. The .git directory would represent a repository with a single commit that adds all the files from the tarball.

Machine Learning

Even though I work in ML, I sometimes forget to keep tabs on the field as a whole. Here’s some stuff that I feel I should brush up on:

Questions:

  • How do SOTA object detection systems work?
  • How do OCR systems deal with variable-length strings in images?
  • How do neural style transfer and CycleGAN actually work?

Projects:

  • Make a program that puts boxes around people’s noses in movies.
  • Make a captcha cracker.
  • Make a screenshot-to-text system (easy to generate training data!).
  • Try to make something that takes MNIST digits and makes them look like SVHN digits.

Phones

For something so ubiquitous, phones are still a mystery to me. I’m not sure how easy it is to learn about phones as a user hacking away in his apartment, but I can always try!

Questions:

  • What does the SMS protocol actually look like? How is the data transmitted? Why do different carriers seem to have different character limits?
  • How does modern telephony work? How are calls routed?
  • Why is it so easy to spoof a phone number, and how does one do it?
  • How are Android apps packaged, and how easy is it to reverse engineer them?

Projects:

  • Try reverse engineering SMS apps on your phone. Figure out what APIs deal with SMS messages, and how messages get from the UI all the way to the cellular antenna. Try to encode and send an SMS message from as low a level as possible.
  • Get a job at Verizon/AT&T/Sprint/T-Mobile. This is probably not worth it, but telephony is one of those topics that seem pretty hard to learn about from the outside.

Misc. Tools

I don’t take full advantage of many of the applications I use. I could probably get a decent productivity boost by simply learning more about these tools.

Questions:

  • How do you use ViM macros?
  • What useful keyboard shortcuts does Atom provide?
  • How do custom go get URLs work?
  • How are man pages formatted, and how can I write a good man page?

Projects:

  • Use a ViM macro to un-indent a chunk of lines by one space.
  • For a day, don’t let yourself use the mouse at all in your text editor. Lookup keyboard shortcuts all you want.
  • For a day, use your editor for all development (including running your code). Feasible with e.g. Hydrogen.
  • Make a go get service that allows for semantic versioning wildcards (e.g. go get myservice.org/github.com/unixpickle/somelib/0.1.x).
  • Write man pages for some existing open source projects. I bet everybody will thank you.

Adversarial Train/Test Splits

The MNIST dataset is split into 60K training images and 10K test images. Everyone pretty much assumes that the training images are enough to learn how to classify the test images, but this is not necessarily the case. So today, in the theme of being a contrarian, I’ll show you how to split the 70K MNIST images in an adversarial manner, such that models trained on the adversarial training set perform poorly on the adversarial test set.

Class imbalance is the most obvious way to make an adversarial split. If you put all the 2’s in the test set, then a model trained on the training set will miss all those 2’s. However, since this answer is fairly boring and doesn’t say very much, let’s assume for the rest of this post that we want class balance. In other words, we want the test set to contain 1K examples of each digit. This also means that, just by looking at the labels, you won’t be able to tell that the split is adversarial.

Here’s another idea: we can train a model on all 70K digits, then get the test set by picking the digits that have the worst loss. What we end up with is a test set that contains the “worst” 1K examples of each class. It turns out that this approach gives good results, even with fairly simple models.

When I tried the above idea with a small, single-layer MLP, the results were surprisingly good. The model, while extremely simple, gets 97% test accuracy on the standard MNIST split. On the adversarial split, however, it only gets 81% test accuracy. Additionally, if you actually look at the samples in the training set versus those in the test set, the difference is quite noticeable. The test set consists of a bunch of distorted, smudgy, or partially-drawn digits.

Example images from the adversarial train/test split.
Training samples (left), testing samples (right).

In a sense, the result I just described was obtained by cheating. I used a certain architecture to generate an adversarial split, and then I used the same architecture to test the effectiveness of that split. Perhaps I just made a split that’s hard for an MLP, but not hard in general. To truly demonstrate that the split is adversarial, I needed to test it on a different architecture. To this end, I tried a simple CNN that gets 99% accuracy on the standard test set. On the MLP-generated adversarial split, this model gets a whopping 87% test accuracy. So, it’s pretty clear that the adversarial split works across architectures.

There’s plenty of directions to take this from here. The following approaches could probably be used to generate even worse splits:

  • Use an ensemble of architectures to select the adversarial test set.
  • Use some kind of iterative process, continually making the split worse and worse.
  • Try some kind of clustering algorithm to make a test set that is a fundamentally different distribution than the training set.

Maybe this result doesn’t mean anything, but I still think it’s cool. It shows that the way you choose a train/test split can matter. It also shows that, perhaps, MNIST is not as uniform as people think.

The code for all of this can be found here.

Decision Trees as RL Policies

In supervised learning, there are very good “shallow” models like XGBoost and SVMs. These models can learn powerful classifiers without an artificial neuron in sight. So why, then, is modern reinforcement learning totally dominated by neural networks? My answer: no good reason. And now I want to show everyone that shallow architectures can do RL too.

Right now, using absolutely no feature engineering, I can train an ensemble of decision trees to play various video games from the raw pixels. The performance isn’t comparable to deep RL algorithms yet, and it may never be for vision-based tasks (for good reason!), but it’s fairly impressive nonetheless.

A tree ensemble playing Atari Pong

So how exactly do I train shallow models on RL tasks? You might have a few ideas, and so did I. Today, I’ll just be telling you about the one that actually worked.

The algorithm itself is so simple that I’m almost kind of embarrassed by my previous (failed) attempts at tree-based RL. Essentially, I use gradient boosting with gradients from a policy gradient estimator. I call the resulting algorithm policy gradient boosting. In practice, I use a slightly more complex algorithm (a tree-based variant of PPO), but there is probably plenty of room for simplification.

With policy gradient boosting, we build up an ensemble of trees in an additive fashion. For every batch of experience, we add a few more trees to our model, making minor adjustments in the direction of the policy gradient. After hundreds or even thousands of trees, we can end up with a pretty good policy.

Now that I’ve found an algorithm that works pretty well, I want to figure out better hyper-parameters for it. I doubt that tree-based PPO is the best (or even a good) technique, and I doubt that my regularization heuristic is very good either. Yet, even with these somewhat random choices, my models perform well on very difficult tasks! This has convinced me that shallow architectures could really disrupt the modern RL landscape, given the proper care and attention.

All the code for this project is in my treeagent repository, and there are some video demonstrations up on this YouTube playlist. If you’re interested, feel free to contribute to treeagent on Github or send me a PM on Twitter.

Keeping Tabs On All My Neural Networks

When I’m out in public, I look at my watch a lot. It’s not because I’m nervous, or because I’m obsessed with the time. It’s because I’m checking on my neural networks.

At any given moment, I probably have four different machines crunching through ML tasks (e.g. training neural networks or downloading data). To keep tabs on all these machines, I use my own logging system called StatusHub. With StatusHub, I can use my phone, my watch, my tablet, or my laptop to see logs from every job across all of my machines. On my watch, I see a scrollable list that looks like this:

On my phone or laptop, I can see the same log through a web UI:

I can even view the log through the command-line, but I won’t bore you with a picture of that one.

Pushing log messages

You push log messages to a StatusHub server with the sh-log command. Without sh-log, I might train a neural network like so:

$ go run *.go
2017/07/03 18:32:57 done 4002029568 updates: cost=0.108497
2017/07/03 18:32:58 done 4002168832 updates: cost=0.114127
2017/07/03 18:32:59 done 4002308096 updates: cost=0.109726
...

As you can see, the program already produces log messages, but annoyingly they only go to standard error. To push the messages to StatusHub, we can simply use sh-log:

$ sh-log TrainEmbedding go run *.go
2017/07/03 18:32:57 done 4002029568 updates: cost=0.108497
2017/07/03 18:32:58 done 4002168832 updates: cost=0.114127
2017/07/03 18:32:59 done 4002308096 updates: cost=0.109726
...

In the above example, sh-log executes go run *.go and echoes the standard output/error to a StatusHub server (which is configured via environment variables). The first argument to sh-log is the service name, which helps to distinguish between different jobs in the log. If you look back at the screenshots from the beginning of this post, the service names should stand out right away.

The sh-log command also plays nicely with UNIX pipes. If you don’t provide a command for sh-log to run, it reads directly from standard input. For example, this is how I log information about my GPU:

$ nvidia-smi | head -n 9 | tail -n 1 | cut -b 3-77 | sed -e 's/\s\s*/ /g' | sh-log GPU
23% 35C P8 10W / 250W | 63MiB / 12186MiB | 0% Default

Viewing logs

The simplest way to view a log is via the StatusHub web UI or through the Android Wear application. However, StatusHub also ships with some commands for reading and manipulating logs.

To dump the log for a given service, there is the sh-dump command:

$ sh-dump tweetdump
...
-rw-r--r--  1 alex  staff  12141949085 Jul  3 19:09 tweets.csv
-rw-r--r--  1 alex  staff  12142001648 Jul  3 19:09 tweets.csv
-rw-r--r--  1 alex  staff  12142061169 Jul  3 19:10 tweets.csv
-rw-r--r--  1 alex  staff  12142116283 Jul  3 19:10 tweets.csv

You can also use the sh-stream command to view the output of a service in real time, for example:

$ sh-stream tweetdump
-rw-r--r--  1 alex  staff  12141949085 Jul  3 19:09 tweets.csv
-rw-r--r--  1 alex  staff  12142001648 Jul  3 19:09 tweets.csv
-rw-r--r--  1 alex  staff  12142061169 Jul  3 19:10 tweets.csv
-rw-r--r--  1 alex  staff  12142116283 Jul  3 19:10 tweets.csv
...

My favorite tool, though, is sh-avg. Using sh-avg, you can compute the averages of numerical fields over the last several log messages. For example, to average the results from the “TrainEmbedding” service:

$ sh-avg TrainEmbedding
size 10: cost=0.108642
size 20: cost=0.108811
size 50: cost=0.108578

You can also specify a particular average size (i.e. the number of log records to average):

$ sh-avg TrainEmbedding 32
size 32: cost=0.108722

If you want to be able to quickly see averages from your phone or smartwatch, you can setup a job to log the averages of another job:

$ while (true) do sh-log EmbedAvg sh-avg TrainEmbedding 30; sleep 30; done

As you can see, StatusHub allows you to be a command-line ninja with magical logging powers.

Going crazy

Once you have basic building blocks like sh-log and sh-stream, the possibilities are boundless.

With a pipe-based IRC client like ii, you can push chat logs to StatusHub in one terminal command. This makes it easy to keep tabs on IRC activity, even from devices without an IRC client (e.g. a smartwatch).

You could also pipe sh-stream into ii in order to send log messages to someone on IRC. This might not seem useful, but it actually could be. For example, say you want to be notified when a process finishes running. You could run this in one terminal:

$ ./some_long_task; sh-log Notify echo 'Task done!'

And then in some other terminal, perhaps on some other machine, run something like this:

$ sh-stream Notify | send-irc-messages

Using StatusHub yourself

The StatusHub repository has official installation instructions, but I thought I’d give a gist here as well. There are really three parts to a successful StatusHub installation:

  1. An sh-server process running on some internet-accessible machine. All of your devices should be able to connect to this machine over HTTP/HTTPS, either through a reverse proxy or via port forwarding.
  2. A set of jobs that do logging via the sh-log command. To have sh-log go to the correct server, you will need to set some environment variables.
  3. One or more devices from which you will consume logs. These devices simply need a browser, but you can also install the StatusHub commands and setup your environment variables accordingly.

Why I’m Remaking OpenAI Universe

OpenAI Universe was very exciting when it first came out. With Universe, it was suddenly possible to train AI to play Flash games, book flights online, and perform various other real-world tasks. Unfortunately, Universe never really took off in the AI world. I don’t know why others aren’t using Universe, but I can tell you why I’m not using it and why I am working hard to build an alternative.

Since Universe has a lot of moving pieces, I’ll give a basic overview of how it works. Every Universe environment runs an instance of Chrome inside a Docker container. An AI interacts with the environment using VNC, which can send keyboard and mouse inputs to Chrome and receive back screen images in real time. To train RL agents on Flash games, Universe uses OCR to read the score off the screen and send it back to the agent as a reward signal. A program in the container controls game flow by looking at the screen and waiting for menus and “game over” pages.

In my view, the biggest problem with Universe is that VNC and Flash need to run in real time. This means that any hiccups on your training machine (e.g. a CPU spike due to a software update) might suddenly change the frame rate at which your AI experiences its virtual environment. It also means that you can’t run Universe at all on a slow machine. This rules out many cloud-hosting instances, for example many EC2 t2 instances.

Since Flash games are hard to reverse-engineer, Universe has no choice but to look at the screen to figure out the state of a game. This approach isn’t very stable. Imagine a game where the “game over” screen fades in gradually, but the “Submit Highscore” button on said screen can be pressed before the fade is complete. If you are running a random agent (one which clicks randomly), it might hit “Submit Highscore” before Universe is able to detect the “game over” screen. The result is that a strange new screen will pop up instead of the “game over” screen, and the Universe infrastructure will crash. I talk more about this flaw in this unresolved Github issue. This is a real issue because most RL algorithms start by taking random actions to explore the environment.

On top of the problems I just mentioned, it seems that OpenAI has internally abandoned Universe. When OpenAI unveiled Universe six months ago, they said the following in their blog post:

In upcoming weeks, we’ll release our environment integration tools, so anyone can contribute new environment integrations.

I guess weeks turned into months turned into never. Those tools are still not available. In the same blog post, they also promised demonstration data:

We’re compiling a large dataset of human demonstrations on Universe environments, which will be released publicly.

That never happened either, which saddens me since I would have loved to get my hands on that data.

Enter μniverse, my alternative to OpenAI Universe. Unlike Universe, μniverse focuses on HTML5 games. Since HTML5 games are written in JavaScript, it is possible to inject new code into them. This makes it much easier (and more reliable) to extract the score and “game over” status from a game. Menu automation is also simpler for HTML5 games, since you can trigger event handlers directly without worrying about the UI. The biggest benefit of HTML5 games, though, is that you can spoof time.

All JavaScript applications use the same APIs for time: Date, setTimeout, requestAnimationFrame, etc. If you inject your own implementation of these APIs, you can completely control how fast time appears to pass. This is exactly what I do in μniverse. Using a time spoofing API, games in μniverse run as fast or as slow as you want them to. You have to explicitly tell the environment to advance time; it won’t happen under your nose. This means that the speed of your machine has no bearing on the AI’s perception of its environment.

In μniverse, it is easy to guarantee that all actions are “safe”. In other words, μniverse ensures that there is nothing the agent can do to escape or break its environment. For example, it is easy to disable things like the “Back to Main Menu” button present in many games. Universe attempted to do this by forbidding clicks at certain locations, but it is undocumented and not fully supported.

I’ve been adding games to μniverse for a few weeks now. Right now I’ve integrated 27 games, and I expect that number to keep rising (update Jun 26, 2017: up to 54 games now). As I go, I’ve been training AI agents on some of the new games. It’s very satisfying to see AI perform better than me at some games. Throughout this process, I have to say that μniverse has felt more stable than Universe (I’ve trained games using both systems). μniverse is also much faster in terms of environment boot times, but that’s just a bonus.

I’d like to close with a nerdy technical note. While μniverse still runs Chrome inside of Docker, it runs Chrome in the new headless mode. This means that VNC is not needed. Instead, μniverse uses the Chrome DevTools protocol to capture screenshots, trigger events, navigate to webpages, and inject JavaScript. Hooking directly into Chrome like this is simpler and much more stable. It feels like a very elegant solution, and it has proven to be so in practice.

The Meta-Learning Quest: Part 1

Over the course of billions of years, a crude meta-learning algorithm was able to produce the brain. This meta-learning algorithm, known as biological evolution, is slow and inefficient. Nevertheless, it produced organisms that can learn to solve complex problems in a matter of seconds. In essence, a slow and inefficient learning algorithm produced a fast and efficient one. That’s the beauty of meta-learning, and it’s what I believe will lead to strong AI.

Today, in 2017, meta-learning is still in its infancy. There are few papers published on the subject, and they all approach fairly easy problems. For more details on this, I have a YouTube video that describes the current state of meta-learning.

This is where my quest comes in. Meta-learning is far from producing strong AI, but I want to take it several steps closer.

The first step in my quest was to develop a meta-learning algorithm that could plausibly scale up to human-level intelligence. To this end, I developed a memory-augmented neural network (MANN) which I call sgdstore. Unlike other MANNs, sgdstore works by training a neural network dynamically and using the ever-changing network as a large memory bank. I have had great successes with sgdstore, and I believe it’s ready for harder challenges.

As a meta-learning algorithm, evolution had to produce adaptable creatures that could succeed in many diverse situations. If we want to use meta-learning to produce strong AI, we will need our own set of diverse challenges for meta-learning to solve. For this set of challenges, I am looking to OpenAI Universe. Universe has over 1,000 virtual environments, many of which are video games. If I could use meta-learning to teach an sgdstore model to play new video games quickly, that would be a huge step forward. After all, life isn’t very far from a video game.

By the way, the idea of using meta-learning on OpenAI Universe is not a brand-new idea. Ilya Sutskever gave a talk in 2016 about this very subject. However, while the idea is simple, there are many technical obstacles in the way. I suspect that Evolution Strategies was part of OpenAI’s approach to overcoming these obstacles. As I will describe, I am taking a slightly different path.

Among the many technical difficulties involved in meta-learning, one is particularly troublesome: memory consumption. My sgdstore model is trained via back-propagation, the traditional algorithm for training neural networks. As it runs, back-propagation needs to store intermediate values. This means that, as episodes of training get longer and longer, memory consumption grows worse and worse. To give you an idea, the meta-learning tasks with which I’ve tested sgdstore are on the order of 100 time-steps. If a video game is played at 10 frames-per-second, then 5 minutes of play would amount to 3,000 time-steps. Without any modification to my algorithm, this would be infeasible.

Luckily, there are algorithms to make back-propagation use asymptotically less memory. In the extreme, it’s possible to make memory consumption grow logarithmically with episode length (a huge win). I actually thought of that algorithm myself, but of course–knowing my luck–it was already known. I went ahead and implemented the algorithm in a Github repo called lazyrnn. I am confident that it will come in handy.

As an aside, OpenAI’s recent variant of Evolution Strategies addresses the memory issue in a different way. By avoiding back-propagation, ES does not need more memory for longer episodes. However, there are other difficulties with ES that make me shy away from it. For one, it seems to depend greatly on network parameterization and weight decay, both things which I doubt will lend themselves nicely to Recurrent Neural Networks like sgdstore or LSTM.

Now that I have a way of training sgdstore on long episodes (lazyrnn), and I have a source of simulated experience (OpenAI Universe), I am left with one final task. Since the meta-learner will be performing reinforcement learning (learning from a reward signal), I need to implement a powerful RL algorithm. For this, I plan to use Trust Region Policy Optimization (TRPO). TRPO is pretty much the state of the art when it comes to RL. Today and yesterday, I have been implementing forward automatic-differentiation so that I can compute Fisher-vector products for TRPO.

I hope to have TRPO implemented within the next few days, at which point I will begin my meta-learning experiments. With any luck, sgdstore will be able to learn something. From my experience, though, I doubt anything will work on the first try. I’m sure many challenges await me.

 

Slice Aliasing Is Nicer Than You Realize

Most languages handle array slicing in one of two ways: with copying or with aliasing. JavaScript copies:

var arr = [1, 2, 3];
var arr1 = arr.slice(1, 2);
arr[1] = 0;
console.log(arr1); // prints [ 2 ]

Go, on the other hand, aliases:

arr := []int{1, 2, 3}
arr1 := arr[1:2]
arr[1] = 0
fmt.Println(arr1) // prints [0]

Until yesterday, my ML framework’s vector library did not support aliasing. When you sliced a vector, you created a copy. I thought this would make things easier for a few reasons. For one, aliasing means that you have to worry about overlapping memory regions. Have you ever tried to implement your own version of memmove()? It’s not fun. However, after months of working with my framework, it became abundantly clear that aliasing would be nice to have.

In my view, the biggest merit of aliasing is performance. Suppose I have a highly-optimized function dot(x, y) which computes the dot product of two vectors (i.e. arrays of numbers). If I want to use this to compute a matrix-vector product, I can call dot once for each row of the matrix. But wait: how do I extract each row from the matrix? If this is Go, I can create a reference to the row in constant time without copying any memory. If this is JavaScript, on the other hand, I have to use slice() to make a copy of the row. That’s obviously worse for performance, especially with GC overhead.

Suppose I wanted to optimize the dot function in JavaScript by adding offset arguments, such as dot(x, xOffset, y, yOffset). Now I could use the offset arguments instead of slicing. But wait, I just added two more arguments to my function! And what’s more, I have to add offset arguments to every similar math function I want to optimize. As someone who maintains a vector manipulation library, that seems like a lot of extra clutter.

As a note, Java suffers from exactly what I described above. Since there’s no array aliasing, the Arrays class is cluttered with method variants like fill(short[] a, int fromIndex, int toIndex, short val). If Java arrays worked like Go slices, you wouldn’t need those pesky index arguments.

You could avoid the extra arguments by wrapping every array in a type that looks something like this:

SlicedVector {
  int    start;
  int    length;
  type[] backingArray;
}

If all your APIs took these structures instead of arrays, then you could slice for free. But now you’ve implemented aliasing! You would have all the same problems you’d have with aliasing: overlapping array arguments, weird bugs relating to mutability, etc. So why not just use aliasing from the start?

Overall, it’s surprising how little I thought about aliasing before creating my own vector library. Usually it’s not an issue. It was only yesterday, when I profiled a program and found that 30% of its overhead was coming from vector slicing, that I finally decided in favor of aliasing.