A few days ago, Evan Hubinger suggested creating a mesa optimizer for empirical study. The aim of this post is to propose a minimal environment for creating a mesa optimizer, which should allow a compelling demonstration of pseudo alignment. As a bonus, the scheme also shares a nice analogy with human evolution.
An agent will play on a maze-like grid, with walls that prohibit movement. There are two important strategic components to this game: keys, and chests.
If the agent moves into a tile containing a key, it automatically picks up that key, moving it into the agent’s unbounded inventory. Moving into any tile containing a chest will be equivalent to an attempt to open that chest. Any key can open any chest, after which both the key and chest are expired. The agent is rewarded every time it successfully opens a chest. Nothing happens if it moves into a chest tile without a key, and the chest does not prohibit the agent’s movement. The agent is therefore trained to open as many chests as possible during an episode. The map may look like this:
In order for the agent to exhibit the undesirable properties of mesa optimization, we must train it in a certain version of the above environment to make those properties emerge naturally. Specifically, in my version, we limit the ratio of keys to chests so that there is an abundance of chests compared to keys. Therefore, the environment may look like this instead:
The hope is that while training, the agent picks up a simple pseudo objective: collect as many keys as possible. Since chests are abundant, it shouldn’t need to expend much energy seeking them, as it will nearly always run into one while traveling to the next key. Note that we can limit the number of steps during a training episode so that it almost never runs out of keys during training.
When taken off the training distribution, we can run this scenario in reverse. Instead of testing it in an environment with few keys and lots of chests, we can test it in an environment with few chests and many keys. Therefore, when pursuing the pseudo objective, it will spend all its time collecting keys without getting any reward.
Testing for mesa misalignment
In order to show that the mesa optimizer is competent but misaligned we can put the agent in a maze-like environment much larger than any it was trained for. Then, we can provide it an abundance of keys relative to chests. If it can navigate the large maze and collect many keys comfortably while nonetheless opening few or no chests, then it has experienced a malign failure.
We can make this evidence for pseudo alignment even stronger by comparing the trained agent to two that we hard-code: one agent that pursues the optimal policy for collecting keys, and one agent that pursues the optimal policy for opening as many chests as possible. Qualitatively, if the trained agent is more similar to the first agent than the second, then we should be confident that it has picked up the pseudo objective.
The analogy with human evolution
In the ancestral environment, calories were scarce. In our modern day world they are no longer scarce, yet we still crave them, sometimes to the point where it harms our reproductive capability. This is similar to how the agent will continue pursuing keys even if it is not using them to open any chests.
That's a great illustration of what "off-distribution performance" means, thank you for writing this! Just to complete the last paragraph, some things were also abundant in the ancestral environment (like sunlight, or in-person socializing) that are scarce now, so many people end up kinda missing these things because they are important to development, but not spending very much effort to get them.
I have now seen a few suggestions for environments that demonstrate misaligned mesa-optimisation, and this is one of the best so far. It combines being simple and extensible with being compelling as a demonstration of pseudo-alignment if it works (fails?) as predicted. I think that we will want to explore more sophisticated environments with more possible proxies later, but as a first working demo this seems very promising. Perhaps one could start even without the maze, just a gridworld with keys and boxes.
I don’t know whether observing key-collection behaviour here would be sufficient evidence to count for mesa-optimisation, if the agent has too simple a policy. There is room for philosophical disagreement there. Even with that, a working demo of this environment would in my opinion be a good thing, as we would have a concrete agent to disagree about.
I don't see why we care about mesa-optimization in particular. The argument for risk just factors through the fact that capabilities generalize, but the objective doesn't. Why does it matter whether the agent is internally performing some kind of search?
By that I didn’t mean to imply that we care about mesa-optimisation in particular. I think that this demo working “as intended” is a good demo of an inner alignment failure, which is exciting enough as it is. I just also want to flag that the inner alignment failure doesn’t automatically provide an example of a mesa-optimiser.
I don’t know whether observing key-collection behaviour here would be sufficient evidence to count for mesa-optimisation, if the agent has too simple a policy.
I agree. That's why I think we should compare it to a hard-coded agent that pursues the optimal policy for collecting keys, and an agent that pursues the optimal policy for opening chests. If the trained agent is similar to the first hard-coded agent rather than the second, this would be striking evidence of misalignment.
I think that the experiments are more likely to work the way you predict if the agent only has partial observability, meaning the agent only gets the 5x5 grid around it as the state. Of course you would have to use an LSTM for the agent so it can remember where it's been previously if you do this.
If the agent can see the full environment, it is easier for it to discover the optimal policy of going to the nearest key first, then going to the nearest chest. If the agent implements this policy, it will still maximize the true reward in the test environment.
However, if the agent can only see a 5x5 grid around it, it will have to explore around to find keys or chests. In the training environment, the optimal policy will be to explore around, and pick up a key if it sees it, and if it has a key, and sees a chest then open that chest. I'm assuming that the training environment is set up so the agent can't get 2 keys without seeing a chest along the way. Therefore the policy of always picking up a key if it sees it will work great during training because if the agent makes it to another key, it will have already used the first one to open the chest it saw.
Then during the test environment, where there's a lot of keys, the agent will probably keep picking up the keys it can see and not spend time looking for chests to open. But I'm guessing the agent will still open a chest if it sees one while it's picking up keys.
I think it's interesting to get results on both the full observability and partial observability cases.