The goal of this essay is to help you understand EfficientZero, a reinforcement learning agent that obtains better-than-human median performance on a set of 26 Atari games after just two hours of real-time experience playing each game.

Specifically, it gets 116% of human median performance on the data-limited Atari 100k benchmark. The previously-best algorithm only reached about 41% of median human performance, so this is a reasonably large leap.

Chart stolen from paper

Chart of performance

The benchmark is called 100k because agents only interact with the environment for 100,000 steps -- about two hours. Note also that the human benchmarks were also set after the humans in question had about two hours of experience on the game. So EfficientZero seems to -- at least on this set of games -- exceed humans in sample efficiency specifically.

This is particularly impressive when you recall that, going into this, the agents were entirely ignorant of anything whatsoever about the world. The networks of their brains were initialized with random values. Humans manage to do pretty well on Atari after two hours, but we've pretrained in the actual world for many years, which lets us apply analogies from this world, experience from other video games, and so on. This agent managed to do comparably well with none of these advantages.

(Granted, the 100k benchmark focuses on Atari environments which are relatively easy to make progress in, because it was meant to be used for sample-efficiency benchmarks. It excludes extremely-difficult-to-explore environments like Montezuma's Revenge, where the first reward is quite hard to get.)

So. After reading this, you should understand how EfficientZero works, and how the changes in EfficientZero improve upon its predecessors. You should also know what further improvements to it are likely in the coming 3 to 24 months.

My target reader is someone who can program, or is at least broadly technically literate. They should be reasonably familiar with the basic principles of supervised machine learning, but they do not need to be particularly up-to-date with reinforcement learning. Reading this will nevertheless be heavy going if you aren't at least a little familiar with reinforcement learning; expect to have to reread some stuff.

I have aggressively and in some cases silently cut content that is not necessary for understanding the contributions of EfficientZero. For instance, in my explanation of how EfficientZero's predecessor MuZero works, I have ignored major elements of MuZero that EfficientZero does not modify. I have also only explained the version of MuZero ("MuZero Reanalyse") that EfficientZero builds upon. And finally in some places I'm confused about why techniques work as well as they do; I've noted them rather than try to smooth them over.

I'll divide this into five sections.

Reinforcement Learning Primer: A quick refresher on standard terminology and methods in RL, just for context. You can skip this, if you're remotely familiar with RL. What is the kind of problem that RL solves?
Historical background: EfficientZero is a relatively natural advance in model-based RL, given prior advances in model-free RL and elsewhere. It's useful to look at the content of these other advances, in order to understand EfficientZero and the conditions of EfficientZero's existence.
MuZero: MuZero is the 2019 algorithm which extended the Go-playing algorithm AlphaGo to single-player, non-zero sum games like Atari. How does MuZero work?
EfficientZero: EfficientZero is basically three independent modifications to MuZero stacked on top of each other. What do these three independent changes do?
Conclusion. What kind of work is likely to follow EfficientZero? Does EfficientZero provide any evidence about "how hard" it is to improve reinforcement learning right now?

1: Reinforcement Learning Primer

Alright, a quick review of the basics. If you follow RL at all, you've seen all the information below so many times at the start of various papers that it's stuck painfully to your retina. So feel free to skip.

(If you want a longer version, Sutton and Barto's "Reinforcement Learning: An Introduction" is the standard, excellent introductory text.)

The basic abstraction of a reinforcement learning problem is as follows.

There is an agent and an environment. The agent and an environment interact over a series of episodes.

Each episode has some finite number of timesteps, counting from 0 to . These episodes each correspond to a thing like "a game of chess" or "a level of a video game" or "an attempt to put a product in a box in a warehouse" and so on. They're always considered to be causally isolated from each other, except for the fact that an agent can learn from one to the next.

At every timestep $t$ within such an episode, the environment emits some observation and reward to the agent, $o_{t}$ and $r_{t}$ . The observation is what the agent perceives; it's what the agent sees, hears, or feels; we usually represent it with a big matrix or tensor of numbers. The reward, on the other hand, is a single number that's either positive or negative, corresponding roughly to pleasure or pain.

(The observations usually need to be rolled up into a state $s_{t}$ that contains all the relevant information for acting. This is often produced by stacking the last 4 to 10 observations — frames of video in a video game, for instance — on top of each other and saying "This is probably enough." Sometimes, the state is instead produced by a sequence-to-fixed-length-vector neural network.)

In response to the observation / state, the agent emits some action, $a_{t}$ . The environment processes the action, and emits the next observation and reward, $o_{t + 1}$ and $r_{t + 1}$ . And so on until the episode ends.

Image from aforementioned Sutton and Barto; every RL article is contractually obligated to have some version of this image in it

Positive reward is good, obviously, and higher positive reward is better.

The agent doesn't want to maximize the immediate per-frame reward, though. If you chose an action which gets you 100 reward in the next frame, but 0 reward for each of a thousand frames afterwards, that's obviously inferior to an action which would get you 0 reward in the next frame, but 100 reward for each of a thousand frames afterwards. The marshmallow now might be inferior to the two marshmallows later. Maybe.

So instead the agent wants to maximize the sum of future expected reward, which is called the return, which I'll abbreviate $G$ because otherwise "return" and "reward" will get confused and also because everyone else does it. This is the total of $r_{t} + r_{t + 1} + r_{t + 2} + r_{t + 3} + r_{t + 4} . . .$ until the end of the episode.

(The rewards that sum to make the return $G$ are normally discounted according to how far in the future they are.

But in my first draft of this essay, the time-discount variable just cluttered up many subsequent formulas, which I am trying to keep simple; also, it provided no EfficientZero-specific insight. So I'll falsely pretend for the rest of the essay that EfficientZero uses an undiscounted reward, and present the math accordingly -- including in my definitions for state-value and action-value functions, which is honestly a pretty sketchy decision on my part.)

Ahem. The total return mentioned above, $r_{0} + r_{1} + r_{2} + r_{3} + . . . r_{T}$ can be more compactly defined using a sum notation.

$Return = G = T \sum t = 0 r_{t}$

Return, then, is what the RL agent wishes to maximize.

In order to do this, the agent must have a policy. The policy of an agent is just whatever makes decisions for it.

You can think of an agent’s policy as a mapping from a state to an action, or from a history of observations to an action, or from a state to an action distribution. It is usually represented by the symbol $π$ . The internals of the policy could be anything: it could include a learned model of the dynamics of the environment, it might use the state-value function defined below, it could be entirely random, and so on.

For all of what follows, I'll assume that the agent is taking discrete actions; i.e., it will choose one element out of a finite, predefined set of possible actions to perform. For instance, in a side-scrolling video game, the set might include go left, go right, jump, and attack.

Specifically, we'll represent the output of a policy as a distribution over actions. Thus, if an agent could do one of go left, go right, or be still, a policy might output [0.33, 0.33, 0.33] to indicate that it is ambivalent between the actions, or [0.9, 0.05, 0.05] to indicate it has a strong preference for one or another. Then when actually acting, the agent samples randomly from this policy distribution.

Because of this, you can think as a policy as a function taking a state and an action, and returning a probability -- $π (s_{t}, a_{t}) \to R$ -- or as a function taking a state and returning something like an array -- $π (s_{t}) = [0.2, 0.6, 0.2]$ .

Policies, of course, can be better or worse. The optimal policy is one for which from any state no other policy can lead to greater return, in expectation.

I introduce some more terms below. But these terms are somewhat less important. They are things that are useful for determining how to maximize return. But they're ultimately only useful if they help us maximize it. They are calculated on the way to maximizing reward by many current algorithms -- but a perfect intelligence coded by someone with a textbook from the future might not explicitly use them at all, because by then we might have discovered better abstractions.

Indeed, some ways of addressing the RL problem don't use them at all today, such as evolutionary methods or methods derived from supervised learning. For our purposes, however, they're relatively important.

Two of these things are a policy's state-value function and action-value function.

The state value function, $V_{π} (s_{t})$ , returns the expected total future return, given a particular policy, from a particular state.

Intuitively it's something like "given where I am, and given how I act, am I likely to be rewarded or punished?" The state-value function for a game of chess, given that you're an average player playing against an average player, is probably high if you're 8 points worth of pieces ahead; it's probably low if you're 8 points of pieces behind.

On the other hand, the state-value function only exists relative to a particular policy. If Magnus Carlsen is dropped into a game of chess against me where he starts 8 points behind, his state-value function for that game would still correctly return a high value, because he's just that much better at chess than me.

You can define it compactly thus, where E stands for expected value:

$V_{π} (s_{t}) = E [G | s_{t}, π] = E [T \sum i = t r_{i} | s_{t}, π]$ .

To repeat myself: It answers the question, given that you are following such and such a policy from such and such a state, how much reward do you expect?

The action-value function is a very similar concept. It returns the expected future return, given a particular policy, from a particular state and from a particular action.

Or, in short, it's exactly the same as the state-value function except instead of drawing the next action from the distribution of $π$ , the next action is already specified.

$Q_{π} (s_{t}, a_{t}) = E [G | s_{t}, a_{t}, π] = E [T \sum i = t r_{i} | s_{t}, a_{t}, π]$

For example: Suppose I am riding a bike. The action-value function, given the state "riding a bike" and the action "suddenly twisting the bike handlebars all the way to the left", probably returns a lower number than if it is given the state "riding a bike" and the action "riding like a normal human being," because twisting the handlebars is likely to make me crash and crashing the bike will cause me pain.

Note that, importantly, if we have the action-value function for the optimal policy, then we have everything we need to act perfectly.

From each state, we can just check each possible action against the action-value function $q (s_{t}, a_{t})$ , and choose the action that returns the highest value from the action-value function. Greedy search against the action-value function for the optimal policy is thus equivalent to the optimal policy. For this reason, many algorithms try to learn the action-value function for the optimal policy.

Knowing the state value function for the optimal policy does not automatically let you take the best action. Can you see why that is? Think about it for a moment. What else would you need?

In order to choose the best action from the state value function, you would need a model. A model is a function with pretty much the same type signature as the environment -- it takes a state $s_{k}$ and an action $a_{t}$ and returns the estimated future state $s_{k}$ and / or a reward $r_{t}$ . If you have a good model, then you can use the state value function to choose the best action, by running it over the predicted states following from different actions and choosing the action associated with the best state. A model is what lets you map a state to other states though actions, and then pick the action that generates a state with highest expected value.

(Like how in Chess, your gut feel about the goodness of the position of the board corresponds to a state-value function. You consider a few moves you could make, and maybe a few trees of moves several levels down the game tree, and then choose the move that seems to lead to the best state-value function.)

If a reinforcement learning agent uses a model it is model-based; if it doesn’t, it is model-free.

Model-based agents are attractive for several reasons. They let you plan from among several imagined futures, which seems like something an actually intelligent agent would be able to do. But in fact they're pretty hard to put together well.

One reason for this is that if you train a model in the naive way -- just predict some observation $o_{t + 1}$ given a history of observations -- most of what it learns is irrelevant for planning. Predicting every future pixel of a video feed on a self driving car -- including the clouds, the waving tree branches, the little argumentative marital drama going on in the car in front of you -- means that you're spending the vast majority of your model's capacity on things that are irrelevant for actually planning out the route of the self-driving car. You want some way to only predict parts of the world that matter, but it's hard to put that into math.

Ok. Given the above understanding of terms, we're now a little better equipped to understand the historical background for EfficientZero, and EfficientZero itself.

2: Historical Background

Here's a brief illustration of some prior results that influenced EfficientZero. (Note that this is by no means a complete summary.)

Relevant Papers

There are three basic points about what came before EfficientZero that I want to make here.

First, EfficientZero is mostly a continuation of the AlphaGo / AlphaZero / MuZero line of model-based learning research.

You almost certainly remember this line of research, because in 2016 AlphaGo handily defeated many-times-world-champion Lee Sedol at Go, which had until that date evaded computer dominance. (And AlphaGo was of course memorably enshrined as maaaaaybe smoke-under-the-door for AGI.)

AlphaGo was initially trained to imitate human Go games, however. Its initial policy was trained to predict the moves of expert human Go players, given a board history. Shortly after AlphaGo, however, DeepMind created an agent which was trained entirely off games it played against itself, without human data to imitate -- and it was therefore named AlphaZero. This algorithm also obtained superhuman performance at Chess and Shogi.

Both AlphaGo and AlphaZero, however, were limited in one particular and important way: they had to possess hard-coded models of the game.

As model-based reinforcement learning algorithms, AlphaGo and AlphaZero both had a predictive model of the environment. This model told them exactly what the next state of the environment would be conditional on some action. This model was itself hard-coded into them, and not learned. AlphaGo and AlphaZero used neural networks to determine which branches within the game tree constructed by this model were promising to explore. They learned, for instance, state-value functions to evaluate how likely they were to win from various hypothetical positions, by training on who eventually won in their self-play games. But what branches in the tree were actually possible was hard-coded by the game model.

Of course, it's very easy to code a predictive model of an environment for a deterministic, two player game of perfect information. But it does greatly limit the domain to which such an algorithm can be applied.

MuZero extended AlphaZero to domains where it does need to learn its model, and thereby extended the applicability of this kind of research. That is, MuZero starts entirely ignorant about what the dynamics of the world are. It learns to predict the next state of the world, conditional on some action, by encountering that world. (Or at least, it learns to predict aspects of the world; more on that later.) And it uses its model of the world to build the transitions over edges in its tree search.

MuZero was able to do well at Go, Chess, Shogi, and Atari; it did not lose its faculty for Go by gaining the ability to play Atari. (It in fact did better at Go than AlphaZero, which is honestly still kinda confusing for me.)

And EfficientZero is simply MuZero with three further modifications, which I will discuss later.

That's the "primary" research lineage for EfficientZero, if you had to choose one.

But it's not the only line of research that EfficientZero draws from. In the blue in the image above, you can see an important line of model-free research that precedes EfficientZero.

With the exception of Alpha* based research, model-free algorithms are generally the ones which have been used in big prestige projects like Starcraft or Dota2.

In model-free RL agents, the generalizing power of a neural network is used to make the agent do the-kind-of-actions-that-lead-to-more-rewards-in-similar-situations-in-the-past, rather than helping the agent choose actions that lead to the best outcomes in a series of imagined future trajectories.

For instance, one common way that model-free RL agents learn is by starting off by learning the aforementioned action-value function, $Q (s_{t}, a_{t})$ for the random policy. That is, the agent does random things; and the agent tracks locations in the world where particular random actions cause it pleasure or pain. Pressing your hand down on a glowing stove causes pain, for instance.

But by acting greedily with respect to this action-value function, then, and doing the kinds of action that cause pleasure and avoiding those that cause pain, such agents can improve their own policy. You can remember that pressing your hand down on a glowing stove causes pain, and not do that.

This in turn alters the true values for the action-value function, which the agent is learning, and so on and so forth, backwards and forwards, with pleasure and pain moving backward through the agent's anticipations; until eventually, in theory, the agent arrives at the optimal policy. This iteration between learning an action-value function for a bad policy and using it to improve the policy is called policy iteration, and is fundamental to a lot of reinforcement learning.

The specific 2021 model-free paper from which EfficientZero draws is "Data-Efficient Reinforcement Learning with Self-Predictive Representations" (henceforth SPR) which achieved the prior state-of-the-art data efficiency result on Atari 100k, while learning an action-value function.

How did SPR do this?

While learning the action-value function, pretty much all model-free algorithms have to learn a "representation" of the world as an intermediate step. This representation (hopefully) summarizes the relevant features of an agent's observations, to let the agent make decisions easily; when driving a car, your representation of the world should make "Is there a car driving towards me in the wrong lane" a readily-accessible fact. (Although one can hope that you didn't acquire the representation through model-free learning, which would be both hazardous and expensive.)

What SPR did was use feature-learning techniques from supervised learning to make the representation of the world learned by $Q (s_{t}, a_{t})$ also predictive of the representation of the world in subsequent time-steps. The way you think about the world now, should probably provide some hints about the way you'll think about the world tomorrow. As the paper says, "we start with the intuition that encouraging state representations to be predictive of future states given future actions should improve... data efficiency." This turned out to be true; SPR had a 55% better score than any prior agent on the 100k Atari task.

Are you confused? Doesn't that sound something like model-based learning, to anticipate the future? Good job noticing that, if so! SPR borrows somewhat from model-based learning, in that its state representation is predictive of future state representations. Or, slightly more precisely, its state representation is predictive of an (initially) random projection of the future state representation.

This takes us to the second point: the three-word summary of EfficientZero is "MuZero plus SPR".

This is a slight oversimplification. EfficientZero makes three large changes, of which only the largest is from SPR. And this largest change is not absolutely identical between the two. But nevertheless, if we allow lossy compression, saying that EfficientZero is "MuZero + SPR" is a pretty close summary of the thing.

So this is the history of EfficientZero -- it has MuZero as it's most recent ancestor on its model-based side, and SPR as it's most recent ancestor on its model-free side. So let's take a look at how MuZero works.

3: MuZero: How It Works

According to Julian Schrittweiser, one of the chief authors of MuZero, the mu in the title stands for at least three things.

First, it can be pronounced like Japanese for "dream", which makes sense because MuZero learns by imagining / dreaming about past experience. Second, it can be pronounced like the Greek character μ, which makes sense because that character frequently stands for a model. And finally, it can be pronounced like Japanese for "void" or "nothing", which makes sense because it receives no rules about the environment before it begins to learn.

(This last use for mu, of course, is the same as that in the first koan of "The Gateless Gate," a classic work of Zen Buddhism. Here, mu is given as the answer to the question "Does a dog have the Buddha nature?" The applicability to AI is left as exercise to the reader.)

Anyhow. My explanation of MuZero will proceed as follows.

First, I'm going to describe how the neural network within MuZero works, supposing (falsely) that it happened to start out perfectly trained. This will help us understand the intended semantics.

Second, I'm going to explain how MuZero can use an (imperfectly trained) neural network to produce policy distributions and value functions better than those that the neural network produces by itself. The method by which MuZero does this is called Monte-Carlo Tree Search (MCTS), although this name is quite misleading.

And third and finally, I'll explain how MCTS and the neural network interact with the environment to produce training data, which can be used to train the neural network towards the optimal policy.

(Note that in all of the below, although I refer to "MuZero" for brevity, I'm universally referring to the much, much more sample-efficient version "MuZero Reanalyse." This was introduced in the same paper as MuZero and later expanded further by DeepMind.)

3.1: The Neural Network

For all this section, I'll pretend that the neural network MuZero uses is already trained. This will help us understand the semantics for it.

There are three sub-networks that make up MuZero's neural network. Ultimately, these are all just pieces of one big, end-to-end differentiable and end-to-end trained network. But it's easiest to understand it as broken into several bits. (In all that follows I'll subscript the components of the neural network with $θ$ to indicate that all these pieces are defined by the parameters of the network, $θ$ .)

The first is the representation function. It takes the current and prior observations, and emits a state. This state is the input for the other two networks.

If we call the representation function network $h_{θ}$ , then you can picture it thus:

H-Theta Image

So $h_{θ} (o_{t}, o_{t - 1}, o_{t - 2}, o_{t - 2}, . . .) \to s_{k}$ . (Here and henceforth, generally k = t, but k is used to index states within the neural network's imagined view of the environment rather than observations in the environment).

Don't think right now about what $s_{k}$ stands for, or what it compresses, or what it's useful for predicting. It's an empty vessel, which we assume to have been trained via back-propagation so that it gives us all the information necessary for the other two sub-parts of the network to do their task.

I'll note in advance: one of the things that makes MuZero stand out relative to other model-based neural networks is that in many others, the network is trained so that $s_{k}$ will have information useful, for instance, for correctly predicting future observations. This is, again, what many model-based video-game playing agents do: they predict a future action-conditioned frame of video from prior frames of video. There is no loss in MuZero, however, which dictates that this be the case.

What are these other tasks, though?

Well, the second network is the dynamics network. It takes a state $s_{k}$ , a hypothetical action, and then returns another hypothetical state $s_{k + 1}$ and reward estimated for that action $r_{k + 1}$ .

If we call the dynamics network $g_{θ}$ , then you can represent it graphically thus.

G-Theta Image

So $g_{θ} (s_{k}, a_{k}) \to s_{k + 1}, r_{k + 1}$ . This function has to be executed for every action whose effect you wish to predict.

Note that frequently one would call the dynamics network with different actions, to evaluate alternative possible trajectories. In such a case, the single branch shown above would separate into multiple, different $s_{k} s$ for the same k, and look more like a tree than a single trunk.

So now we know at least one thing the representation $s_{k}$ must contain information about; it must contain enough information about the environment to let us predict future rewards, conditional on different actions.

If the network is perfectly trained, then, if you get rewards of 0, 0, and 1 after doing the actions move left, move left, and move left in the actual environment from some particular state, then you should also get imagined rewards of 0, 0, and 1 from the neural network after doing the same actions in the imagined environment from a particular state. Note though, that while doing these actions in the actual environment would also let you see new observations each frame; the neural network is not trained to predict its observations, but only to predict the reward.

Nevertheless, if the dynamics network gave us perfectly accurate results, then we could use it to plan our actions.... sorta. Here's how:

From our current state $s_{k}$ , we could look at the reward immediately following every different action we could do. We would then have many different $s_{k + 1}$ s, together with the $r_{k + 1}$ s associated with them. And we could then expand each $s_{k + 1}$ to get all possible $s_{k + 2}$ and $r_{k + 2}$ , and so on and so forth.

Obviously, this breadth-first search gets pretty unmanageable pretty quickly, but it would at least let us choose actions leading to higher reward within very shallow depth.

But this isn't terribly useful. We often want to plan to receive rewards hundreds or thousands of steps into the future, and the search space becomes completely unmanageable if we try to plan with only the dynamics function.

So we need a final network: the prediction network.

The prediction network takes the aforementioned state $s_{k}$ and outputs the results of the state-value function $V_{π} (s_{k})$ and a distribution over actions for a policy $π (s_{k})$ . Ultimately, we want the policy to be the optimal policy and the state-value function to be the correct value for the optimal policy.

If we call the prediction network $f_{θ}$ , this makes the final piece of our situation look like this:

Totality-Theta

So $f_{θ} (s_{k}) \to V (s_{k}), π (s_{k})$ . You execute this part of the neural net for every state whose value function and action you are interested in.

This can let us plan much more easily. If the network is perfectly trained, we could use either the state-value function or the policy to perform the perfect action in any situation.

The policy distribution straightforwardly gives you the best action.

The value function is scarcely harder to use. We can simply run the dynamics function once for each possible action, and then, for each state the dynamics function spits out, we compute the state value function for that state. Then we just choose the action which has the highest state value associated with it.

Now we know the three outputs of the network -- the reward, the value function, and the distribution over actions. We know what values we want each of these to take. We want the reward to take the true value that it would take in the environment, conditioned on prior states and actions. And we want the value function and distribution to take the value corresponding to the optimal policy, conditioned on the same thing.

But how can we get the right data to train this network, if we drop the assumption that it is not already entirely trained?

3.2: Monte Carlo Tree Search

One thing to note about the above is that, in an imperfectly trained neural network, the values the network described above emits are not necessarily consistent.

What do I mean by this? Well, imagine a scenario where the network has produced a particular state, $s_{k}$ , from the representation function. And imagine furthermore that the policy function gives an even distribution over the two actions available, left and right.

But if you apply the dynamics function to the network, suppose that the dynamics function predicts that you get 1000 reward in the next state if you take action left, and continue to get this reward forever after, and the dynamics function predicts that you get -1000 reward in the next state if you take action right, and continue to get that reward forever after. So the even distribution of the policy at state $s_{k}$ is inconsistent with the reward predicted by the model at the next state. The two parts of the neural network don't form a consistent picture of the world together.

At the very highest level of abstraction, Monte Carlo Tree Search as used by MuZero just uses the neural network to produce policy and state value estimates that are more internally consistent.

Due to how the network is trained, as we'll see in 3.3, the policies produced are also better, but fundamentally that's not what's going on. If you screwed up the training data for the MuZero network very badly, well, MCTS would continue to produce more consistent values for the network over time, even if the values were very wrong.

In detail, Monte-Carlo Tree Search in MuZero produces:

State value estimates $V (s_{k})$ that are more consistent with later-in-time state-value estimates, policies, and rewards. (Rewards and state-value estimates ultimately play a larger role than policies in arriving at the right estimate.)
Policy estimates $π (s_{k})$ that are also more consistent with later-in-time state-value estimates, policies, and rewards. (As in the prior parenthetical.)
(MCTS does not produce a new target for the predicted reward.)

I'm actually going to skip lightly over how MCTS does this, because it remains ~~entirely~~ mostly unchanged through MuZero and EfficientZero.

Note that Monte Carlo Tree Search differs from both the familiar depth-first-search and breadth-first search in programming. Generally speaking, it will expand the best nodes in the game tree available first, using the value function and the policy from the neural network to figure out which are best. This means that it is much, much better at handling large state spaces, and why it was used as a policy improvement operator for a game like Go with a large state space. So by skipping it I don't mean to imply that it isn't important; just that it's relatively unimportant for the changes that EfficientZero makes, and that this essay is already way too long.

(You can read a summary of how MCTS works -- with gifs! -- on DeepMind's page on MuZero.)

One further thing to note.

The family of algorithms currently called "MCTS" originally involved -- as the name "Monte Carlo" suggests -- randomness. In short, originally, the algorithm tried to evaluate whether a game-state in a board game was good or bad by looking at whether you were likely to win or lose if you assume random or semi-random moves from this state to the end of the game.

MuZero's Monte-Carlo Tree Search... does not involve such rollouts to the end of the game. It only uses the neural network to evaluate the goodness of states. There's actually no randomness involved at all. So I think it's a little misnamed.

Anyhow, given that MCTS gives you a more consistent policy and value function, there are two things you can do.

You could use the new policy distribution to choose the best action, while actually choosing what to do in a real environment. MuZero does this.
You could use the more consistent policy distribution / value function as targets for the neural network, to train it to be better. MuZero also does this. So let's now turn to MuZero's training.

3.3: MuZero's Training

Let's suppose we have an untrained neural network, with the architecture described above. And let's suppose it has experienced several episodes in our environment, and saved the trajectories from these episodes. How can we train our neural network?

Well, the trajectories through the environment are just a series of observations, rewards, and actions.

$o_{0}, r_{0}, a_{0}, o_{1}, r_{1}, a_{1}, o_{2}, r_{2}, a_{2} . . . o_{T}, r_{T}$

It turns out to be relatively straightforward to train MuZero given such a history.

We choose a timestamp $t$ in the episode, and, using the representation network, generate a state $s_{k}$ for that timestamp.

We then generate subsequent states $s_{k + n}$ using the actual actions $a_{t}$ that the agent took at corresponding time-steps.

Having done so, then we can train the reward $r_{k + n}$ to simply match the environment's actual reward at time $s_{t + n}$ .

Training the value function is only a little more complicated.

If you remember, the state value function is just the expected return beneath a policy:

$V (s_{t}) = E [R | s_{t}, π = E [r_{t + 1} + r_{t + 2} + . . . . r_{t + N} | s_{t}, π$

This means that if we gathered enough data from the environment, with the policy, we could just move the value for the state-value function towards the sum of future rewards from the environment. Over a long enough time frame, the average return from each state, given that you're using a particular policy, will definitionally converge towards the state-value function value for that state and policy.

Simply training off entire rewards like this, though, tends to be somewhat slow empirically. We can improve our performance by bootstrapping: training the value-function to move towards a target that we create by summing together a short chain of rewards with the current value function at the end of the chain.

$V (s_{t}) \to r_{t + 1} + r_{t + 2} + r_{t + 3} + V (s_{t + 4})$

Here, we move $V (s_{t})$ towards the value on the right, which is constructed out of a series of actual rewards plus the current estimate of the state-value on a future state. Even if the state-value function starts off entirely randomly, this target will incorporate experienced rewards and over time cause $V$ to converge to the right value.

Finally, the target for the policy for the neural network is simply the improved / more consistent policy distribution generated for each state by Monte Carlo Tree Search. Here we lean heavily on MCTS; while the targets for the rewards and the value function are generated with data directly from the experienced episode trajectory, the target for the policy only indirectly uses this experience via the changes to the neural network.

So over time, training looks something like this.

At first, everything the network does is random. But even from random actions, the neural network can learn to predict the rewards $r_{t}$ it encounters in the environment. The rewards in the environment also begin to move $V (s_{k})$ towards an accurate value for a random policy.
But MCTS will now return a policy distribution improved by learned values for $r_{t}$ and $V (s_{k})$ . This MCTS-improved policy is used in action selection, so the action selection is now better than random. And it is also used as a target for training the policy distribution, so the policy distribution moves towards a better-than-random policy.
This means that the targets for $V (s_{k})$ will, in turn, change again, because the state-value function is policy-specific. The better the policy, the higher the correct value for $V (s_{k})$ . The state-value function no longer predicts the state-value for a random policy, but for a better-than random policy.
This loop continues, where improved estimates for the state-value function and rewards propagate through MCTS into the policy. Altogether, these improved estimates also improve the depth to which MCTS can search (although I haven't gone into how this detail works).
Because the targets for $V (s_{k})$ and $π$ are generated with MCTS, even old historical data can be relevant, because the network essentially imagines what the right values would be if it had experienced them with it's current estimates.

Over time, this process converges on (hopefully) the optimal policy.

4: EfficientZero: What It Changes

EfficientZero makes three independent changes on top of MuZero, which when taken together push it into better-than-human performance. If you drop any one of them, median performance on the Atari 100k drops to worse-than-human. They are, nevertheless, independent, and we can discuss them separately.

(EfficientZero also makes a few other changes to MCTS and network architecture; some of the networks within it are smaller than those in MuZero, for instance. The authors of the paper think these changes don't matter too much, so let's hope that they're right.)

4.1: Self Supervised Consistency Loss

Let's start off with a problem that MuZero shares with many model-free learning methods, even though MuZero is itself model-based: namely, it doesn't begin to learn until it receives non-uniform reward values.

All of the values that MuZero's network predicts -- the one-step reward, the state-value, the policy -- relate to the rewards from the environment. On one hand, this is good -- it means that the representations learned will necessarily be suitable for predicting the state-value and the policy, unlike representations learned over the course of predicting future observations. But on the other hand, this necessitates slow learning: it means that until the agent begins to get non-uniform rewards, the agent cannot learn.

This is obviously problematic in situations where non-zero rewards are rare. And it also presumably hurts performance even when rewards are relatively common.

To solve this problem, MuZero adds a new training target for the neural network. You can approximate it by saying that the state $s_{k}$ should be predictive of $s_{k + 1}$ .

The intuition is that the way you think about the world now should include anticipations of the way that you'll think about the world in the future. This is, as I mentioned in the section on historical background, pretty much exactly the same as intuition and implementation behind "Self Predictive Representations".

If you want to be slightly more technical: the state $s_{k}$ , fed through an action conditioned transformation network, should be predictive of an (initially random) projection from $s_{k + 1}$ . The network structure used to define this target is structured like the SimSiam feature-learning network, except it learns an equivalence across temporal steps and image augmentations rather than only an equivalence across image augmentations.

Stolen once more from the EfficientZero paper:

Self-Predictive Representations

So, essentially, the above pulls a transformation of the current state, fed through an action-conditioned dynamics network, close to a transformation from the future state. The way that the network in fact thinks about $s_{k + 1}$ should be something it can anticipate from how it thinks about $s_{k}$ ; the loss for training the neural network is produced from the difference between these two transformations. This loss is fed into the backpropagation through time when the network trains, along with the other three losses native to MuZero.

If we remove this from EfficientZero, overall median performance drops from 1.16 times better than human to a pathetic 0.34 of human performance; this is by far the largest drop.

Honestly, I'm... still puzzling over why this technique works so well.

Despite the just-so stories mentioned above about how intelligence surely involves prediction, I'm dubious about my ability to retrodict these results.

Here's part of the problem, as I see it. The basic architecture here dictates that the state representation should be, when training starts, predictive of an initially random projection of future states. But learning to be predictive of such random future states seems like it falls subject to exactly the same problem as learning to be predictive of future observations: you have no guarantee that EfficientZero will be learning relevant information, which means it could be wasting network capacity on irrelevant information. There's a just-so story you could tell where adding this extra predictive loss results in worse end-to-end behavior because of this wasted capacity, just like there's a just-so story where adding this extra predictive loss results in better end-to-end behavior because of faster training. I'm not sure why one turned out to be true rather than the other.

This ambiguity of why technical changes work is a persistent problem for deep learning. Batch normalization is a ubiquitous deep learning technique, but the story about why it works in the paper where it was published currently seems false. Its hard to make further progress even in capabilities when your understanding of the situation is fuzzy, let alone in interpretability.

This is the most important part of the paper and although the actual implementation details make perfect sense, I'm still not 100% sure why it works.

4.2: Value Prefix

Imagine that a knife is dropping towards your hand, blade down, from about five feet up. In the dark, so you cannot see it.

You anticipate, under the circumstances, some pain if you don’t act. You aren't sure exactly when the pain is going to occur; you just have a rough timeframe of “in the next second or so”. Which is all that you need for acting, assuming your hand isn't trapped in place.

A lot of RL algorithms, though, try to predict the exact moment that the pain occurs. This is difficult. MuZero tries to predict this, for instance -- the model tries to predict $r_{t}$ at precisely $t$ steps into the future. This is wasting computational power, in some sense -- you don't need to know exactly when some particular reward happens. And according to the authors, the uncertainty about the exact frame where pleasure or pain occurs impacts search with the MCTS, which in MuZero strongly depends on the model's predictions of when pleasure or pain occurs.

On the intuition that anticipating the pain (or pleasure) is important, but anticipating the exact timing of the pain (or pleasure) might not be, EfficientZero changes this somewhat. Rather than predicting the exact reward for a particular state in a particular timestep, $r_{t}$ , it instead tries to predict the cumulative reward for several states over a window of timesteps, $r_{t} + . . . . r_{n}$ . This learned value the paper calls the value prefix, and it is used, rather than the reward, in MCTS.

If you drop the value prefix learning from EfficientZero, median performance drops from 1.16x to 0.55x of the human median.

4.3: Off-Policy Correction

This third improvement does less than the other two. Dropping it from the model only lowers performance from 1.16x to 0.86x human median performance. Which, I should note, is still an enormous gain in sample-efficiency.

The motivation behind this addition is nevertheless the clearest of all three, I think. I'm most confident that the story about why it works is the actual reason it works.... which is a little depressing, but here goes anyhow.

As mentioned in the description of MuZero, when training the state-value part of the neural network, we construct targets for it from a combination of actual experienced rewards and predicted state-values on a later state.

$V (s_{t}) \to r_{t + 1} + r_{t + 2} + r_{t + 3} + r_{t + 4} . . . . + V (s_{t + n})$

Here, the target for $V (s_{k})$ is constructed from the actual rewards received by the agent,when the agent was acting in the environment over the remembered trajectory from t + 1, t + 2, etc. The estimated state value of $s_{t + n}$ is then added to the summed rewards so far, as part of the standard reinforcement learning bootstrapping.

Here's the problem. Suppose that this remembered trajectory comes from early on, when the agent was stupider.

If that's the case, then the trajectory here is likely sub-optimal. The current agent wouldn't take the same actions that the past agent took. That is, the rewards above are taken from an experienced trajectory that looks like this:

$s_{t}, r_{t}, a_{t}, s_{t + 1}, a_{t + 1}, r_{t + 1}, s_{t + 2}, a_{t + 2}, r_{t + 2} . . .$

In this experienced trajectory, then, the states and rewards experienced obviously depend on the actual actions taken. Which means that the sequence of rewards is likely not the same that the agent (in its more trained state) would take, just as the final state $s_{k + n}$ might not be the final state the agent (in its more trained state) would find itself in.

If the agent were in that situation now, it would take different actions; but it's still being updated towards a target as if it would have blindly done the same thing over and over.

The fix here is remarkably simple: the older an episode is, the shorter the imagined sequence of rewards off of which you bootstrap should be.

If the trajectory that you imagine, giving you the sequence of rewards shown above, is relatively recent, then you'd still probably make the same decisions now. So you can sample from a relatively long trajectory. Your update might look like this:

$V (s_{t}) \to r_{t + 1} + r_{t + 2} + r_{t + 3} + r_{t + 4} + V (s_{t + 5})$

On the other hand, if the trajectory that you imagine is relatively old, then you'd likely make different decisions now. So you should sample from a relatively short trajectory. Your update might look like this:

$V (s_{t}) \to r_{t + 1} + V (s_{t + 2})$

Doing so increases the accuracy of the value-target when learning from older saved trajectories. MuZero, when in sample-efficient mode, saves a lot of trajectories to learn from.

5: Conclusions

I'm a fan of how EfficientZero is three independent additions to MuZero piled on top of each other. Publishing this rather than the least publishable unit is quite laudable. At least, it's laudable from the perspective where we are concerned with academic Goodharting, rather than from the perspective where we are concerned with short AGI horizons.

There are a few expectations I take away from this work, generally.

First, I expect this work to be quickly surpassed and quickly built upon.

Other papers already implicitly suggest ways to improve EfficientZero. For instance, two days after EfficientZero came out, DeepMind released a paper on generalization with self-supervised world models. While this paper focuses on generalization rather than sample efficiency, it compares what is almost exactly the self-predictive representations of EfficientZero with several other related techniques.

One of these techniques, explicit contrastive learning, performs notably better than the self-predictive representations. ~~I expect that if you swapped out SPR with constrastive learning in EfficientZero, you'd get an immediate if modest bump in sample efficiency.~~ (Edit: I'm no longer expect this, see Ankesh's comment below.)

That's a very low hanging fruit, but there are others out there, easy to imagine with just a little effort, which are only a small step away in the high-dimensional space of RL design algorithms. In some cases you just need to take another step in the same direction. Not all of these imaginable alterations will work, but some of them will. A world where one paper contributes 3 useful improvements to a top model-based learning method, and where none of those improvements involves several large steps in design-space, is probably not a world where such improvements are rare.

And the compute requirements for EfficientZero were relatively modest; the thing holding people back from attempting to take these further steps is, honestly, probably neither the need for compute, nor the ability to think of potential improvements, but the engineering talent needed for performing the experiments quickly. And potentially, the lack of interest in sample-efficiency; the ease with which we can scale up RL algorithms to use billions of frames from multiple actors on different machines has probably contributed to a lack of research in this direction.

So, I forecast significant improvements to this within the timeframe of six to twenty-four months, in private if not in public. I expect at least 25% gain in sample-efficiency towards the start of this time period, and at least a 100% gain in sample-efficiency towards the end. I also expect attempts to extend these results to the entire Atari testbed of 57 games, with some success. I also expect attempts to get these results on larger, more interesting video games.

I expect that these efforts involving larger video games will work well. A 2020 paper showed that the MCTS family of algorithms in some sense approximated policy-optimization. Policy optimization has been used for large projects such as Dota2, so I expect that further EfficientZero work will also be successful in this area.

Second, it seems extremely likely that over the next one to four years, we'll see a shift away from sample-efficiency on these single-game test-beds, and on to sample efficiency in multi-task domains.

What makes me confident about this? Well, EfficientZero is very roughly 500 times more sample-efficient than the original Atari-playing Deep-Q-Network paper of 2015, which came out six years ago. If in six more years, we increased sample efficiency by another factor of 500, then the algorithms would be able to go from complete ignorance of the world to human-equivalent scores after less than thirty seconds of experience playing the game. This seems quite probably impossible, even for a superintelligence - you need more time with many games to figure out the rules alone.

I don't think this (necessarily) means that we'll have super-intelligent artificial intelligence in six years. I think instead it means that -- after further improvements mentioned above -- we'll hit diminishing returns on sample-efficiency in Atari, just like image classification hit diminishing returns on ImageNet. The only space where there will really be room to make interesting progress will be on bigger tasks, so interesting research will focus on these bigger tasks.

So sample-efficiency will turn to more complex domains, whether they be things like Starcraft or multi-task testbeds. MuZero-based methods might have some difficulty in some of these. In multi-task domains, for instance, MuZero will have problems because it learns state representations explicitly adapted for the rewards that it is receiving. Learning more reward-neutral representations could be challenging -- although I expect the challenge to be superable.

Third, and finally, I think this work is moderate to strong evidence that even without major conceptual breakthroughs, we're nowhere near the top of possible RL performance.

To repeat myself: a world where one paper contributes 3 useful improvements to a top model-based learning method, and where none of those improvements involves several large steps in design-space, looks to me very different than a world where these improvements are rare and hard to find. It seems to me likely that there are more improvements to be found without enormous Copernican shifts, and without the development of new back-propagation-level insights.

AI ALIGNMENT FORUM
AF