Prierarchy: Implicit Hierarchies

On first blush, hierarchical reinforcement learning seems like the holy grail of artificial intelligence. Ideally, it performs long-term exploration, learns over long time-horizons, and has many other benefits. Furthermore, HRL makes it possible to reason about an agent’s behavior in terms of high-level symbols. However, in my opinion, pure HRL is too rigid and doesn’t capture how humans actually operate hierarchically.

Today, I’ll propose an idea that I call “the prierarchy” (a combination of “prior” and “hierarchy”). This idea is an alternative way to look at HRL, without the need to define rigid action boundaries or discrete levels of hierarchy. I am using the prierarchy in the Unity Obstacle Tower Challenge, so I will keep a few details close to my chest for now. However, I intend to update this post once the contest is completed.

Defining the Prierarchy

To understand the prierarchy framework, let’s consider character-level language models. When you run a language model on a sentence, there are some parts of the sentence where the model is very certain what the next token should be (low-entropy), and other parts where it is very uncertain (high-entropy). We can think of this as follows: the high-entropy parts of the sentence mark the starts of high-level actions, while the low-entropy parts represent the execution of those high-level actions. For example, the model likely has more entropy at the starts of words or phrases, while it has less entropy near the ends of words (especially long, unique words). As a consequence of this view, we can say that prompting a language model is the same thing as starting a high-level action and seeing how the model executes this high-level action.

Under the prierarchy framework, we initiate “high-level actions” by taking a sequence of low-probability, low-level actions which kick off a long sequence of high-probability, low-level actions. This assumes that we have some kind of prior distribution over low-level actions, possibly conditioned on observations from the environment. This prior basically encodes the lower-levels of the hierarchy that we want to control with a high-level policy.

The nice thing about the prierarchy is that we never had to make any specific assumptions about action boundaries, since the “high-level actions” aren’t real or concrete, they are just our interpretation of a sequence of continuous entropy values. This means, for one thing, that no low- or high-level policy has to worry about stopping conditions.

Practical Application

In order to implement the prierarchy, you first need a prior. A prior, in this case, is a distribution over low-level actions conditioned on previous observations. This is exactly a “policy” in RL. Thus, you can get a prior in any way you can get a policy: you can train a prior on a different (perhaps more dense but misaligned) reward function; you can train a prior via behavior cloning; you can even hand-craft a prior using some kind of approximate solution to your problem. Whatever method you take, you should make sure that the prior is capable of performing useful behaviors, even if these behaviors occur in the wrong order, for the wrong duration of time, etc.

Once you have a prior, you can train a controller policy fairly easily. First, initialize the policy with the prior, and then perform a policy gradient algorithm with a KL regularizer KL(policy|prior) instead of an entropy bonus. This algorithm is basically saying: “do what the prior says, and inject information into it when you need to in order to achieve rewards”. Note that, if the prior is the uniform distribution over actions, then this is exactly equivalent to traditional policy gradient algorithms.

If the prior is low entropy, then you should be able to significantly increase the discount factor. This is because the noise of policy gradient algorithms scales with the amount of information that is injected into the trajectories by sampling from the policy. Also, if you select a reasonable prior, then the prior is likely to explore much better than a random agent. This can be useful in very sparse-reward environments.

Let’s look at a simple (if not contrived) example. In RL, frame-skip is used to help agents take more consistent actions and thus explore better. It also helps for credit assignment, since the policy makes fewer decisions per span of time in the environment. It should be clear that frame-skip defines a prior distribution over action sequences, where every Nth action completely determines the next N-1 actions. My claim, which I won’t demonstrate today, is that you can achieve a similar effect by pre-training a recurrent policy to mimic the frame-skip action distribution, and then applying a policy gradient algorithm to this policy with an appropriately modified discount factor (e.g. 0.99 might become 0.99^0.25) and a KL regularizer against the frame-skip prior. Of course, it would be silly to do this in practice, since the prior is so easy to bake into the environment.

There are tons of other potential applications of this idea. For example, you could learn a better action distribution via an evolutionary algorithm like CMA-ES, and then use this as the prior. This way, the long-term exploration benefits of Evolution Strategies could be combined with the short-term exploitation benefits of policy gradient methods. One could also imagine learning a policy that controls a language model prior to write popular tweets, or one that controls a self-driving car prior for the purposes of navigation.

Conclusion

The main benefit of HRL is really just that it compresses sequences of low-level actions, resulting in better exploration and less noise in the high-level policy gradient. That same compression can be achieved with a good low-level prior, without the need to define explicit action boundaries.

As a final note, the prierarchy training algorithm looks almost identical to pre-training + fine-tuning, indicating that it’s not a very special idea at all. Nonetheless, it seems like a powerful one, and perhaps it will save us from wasting a lot of time on HRL.