The Meta-Learning Quest: Part 1

Over the course of billions of years, a crude meta-learning algorithm was able to produce the brain. This meta-learning algorithm, known as biological evolution, is slow and inefficient. Nevertheless, it produced organisms that can learn to solve complex problems in a matter of seconds. In essence, a slow and inefficient learning algorithm produced a fast and efficient one. That’s the beauty of meta-learning, and it’s what I believe will lead to strong AI.

Today, in 2017, meta-learning is still in its infancy. There are few papers published on the subject, and they all approach fairly easy problems. For more details on this, I have a YouTube video that describes the current state of meta-learning.

This is where my quest comes in. Meta-learning is far from producing strong AI, but I want to take it several steps closer.

The first step in my quest was to develop a meta-learning algorithm that could plausibly scale up to human-level intelligence. To this end, I developed a memory-augmented neural network (MANN) which I call sgdstore. Unlike other MANNs, sgdstore works by training a neural network dynamically and using the ever-changing network as a large memory bank. I have had great successes with sgdstore, and I believe it’s ready for harder challenges.

As a meta-learning algorithm, evolution had to produce adaptable creatures that could succeed in many diverse situations. If we want to use meta-learning to produce strong AI, we will need our own set of diverse challenges for meta-learning to solve. For this set of challenges, I am looking to OpenAI Universe. Universe has over 1,000 virtual environments, many of which are video games. If I could use meta-learning to teach an sgdstore model to play new video games quickly, that would be a huge step forward. After all, life isn’t very far from a video game.

By the way, the idea of using meta-learning on OpenAI Universe is not a brand-new idea. Ilya Sutskever gave a talk in 2016 about this very subject. However, while the idea is simple, there are many technical obstacles in the way. I suspect that Evolution Strategies was part of OpenAI’s approach to overcoming these obstacles. As I will describe, I am taking a slightly different path.

Among the many technical difficulties involved in meta-learning, one is particularly troublesome: memory consumption. My sgdstore model is trained via back-propagation, the traditional algorithm for training neural networks. As it runs, back-propagation needs to store intermediate values. This means that, as episodes of training get longer and longer, memory consumption grows worse and worse. To give you an idea, the meta-learning tasks with which I’ve tested sgdstore are on the order of 100 time-steps. If a video game is played at 10 frames-per-second, then 5 minutes of play would amount to 3,000 time-steps. Without any modification to my algorithm, this would be infeasible.

Luckily, there are algorithms to make back-propagation use asymptotically less memory. In the extreme, it’s possible to make memory consumption grow logarithmically with episode length (a huge win). I actually thought of that algorithm myself, but of course–knowing my luck–it was already known. I went ahead and implemented the algorithm in a Github repo called lazyrnn. I am confident that it will come in handy.

As an aside, OpenAI’s recent variant of Evolution Strategies addresses the memory issue in a different way. By avoiding back-propagation, ES does not need more memory for longer episodes. However, there are other difficulties with ES that make me shy away from it. For one, it seems to depend greatly on network parameterization and weight decay, both things which I doubt will lend themselves nicely to Recurrent Neural Networks like sgdstore or LSTM.

Now that I have a way of training sgdstore on long episodes (lazyrnn), and I have a source of simulated experience (OpenAI Universe), I am left with one final task. Since the meta-learner will be performing reinforcement learning (learning from a reward signal), I need to implement a powerful RL algorithm. For this, I plan to use Trust Region Policy Optimization (TRPO). TRPO is pretty much the state of the art when it comes to RL. Today and yesterday, I have been implementing forward automatic-differentiation so that I can compute Fisher-vector products for TRPO.

I hope to have TRPO implemented within the next few days, at which point I will begin my meta-learning experiments. With any luck, sgdstore will be able to learn something. From my experience, though, I doubt anything will work on the first try. I’m sure many challenges await me.