This is independent research. To make further posts like this possible, please consider supporting me.

Epistemic status: This is my highly imperfect understanding of a lot of detailed thinking by Abram, Evan, John, and others.


  • This is a brief post summarizing my learnings from reading Abram’s Gradations of Inner Alignment Obstacles

  • The first section is about mesa-optimizers in general

  • The second section is about mesa-searchers in comparison to mesa-controllers

  • The third section is about compression, explicit goals, consistency, and power

  • The fourth section is about whether GPT-3 is deceptive and the connection to Ajeya Cotra’s recent proposal.

  • The fifth section is about the lottery ticket hypothesis


First, a quick bit of background on the inner optimizer paper.

When we search over a large space of algorithms for one that performs well on a task, we may find one that, if deployed, would not just perform well on the task but would also exert unexpected influence over the world we live in. For example, while designing a robot vacuum we might search over policies that process observations from the robot’s sensors and control the motor outputs, evaluating each according to how effectively it vacuums the floor. But a sufficiently exhaustive search over a sufficient expressive hypothesis space could turn up a policy that internally builds up an understanding of the environment being vacuumed and its human inhabitants, and chooses the actions that it expects to accomplish an objective. The question is then: is this internal goal actually the goal of cleaning the floor? And more broadly: what kind of capacity does this robot vacuum have to influence the future? Will it, for example, build a sufficiently sophisticated model of houses and humans that it discovers that it can achieve its objective by subtly dissuading humans from bringing messy food items into carpeted areas at all? If so, then by searching over policies to control the motor outputs of our robot vacuum we have accidentally turned up a policy that exerts a much more general level of influence over the future than we anticipated, and we might be concerned about the direction that this method of designing robot vacuums is heading in.

So we would like to avoid unwittingly deploying mesa-optimizers into the world. We could do that either by developing a detector for mesa-optimizers so that we can check the policies we find during search, or by developing a training process that does not produce mesa-optimizers in the first place.

Mesa-controllers and mesa-searchers

Next, a bit of background from a previous post by Abram:

Abram is concerned not just about policies that influence the future via explicit internal search, such as a policy that evaluates possible plans, evaluate the likely consequences of each one, and picks the one whose consequences match an internal goal, but also policies that exert goal-directed influence over the future in any other way. For example, a thermostat consistently brings a room to a particular temperature. It does not accomplish that by evaluating the possible actions it might take and picking the one that seems most likely to bring the room to a desired temperature. But it does in fact bring the room to a consistent temperature. In the example above, when we search over policies for our robot vacuum, we might turn up one that does not perform an explicit internal search over possible plans, but does push the state of the world towards some target state in the same way that a thermostat pushes the temperature of a room towards a target temperature. Such a system might exert a level of influence over the future that was not expected by the engineers that designed the system and thus might pose dangers.

So we might want to broaden our goal from avoiding mesa-searchers to avoiding mesa-controllers, of which mesa-searchers are a special case.

Compression and explicit goals

Now we come to the content of Gradations of Inner Alignment Obstacles.

The problem with avoiding all mesa-controllers, as Richard Ngo points out, is that almost any competent policy could be seen as a mesa-controller since most tasks involve pushing the world towards some state or other, so avoiding all mesa-controllers would mean avoiding all competent policies. For example, when we search over policies for our robot vacuum, we hope to find one that is efficient in its floor-vacuuming pattern but that does not build sophisticated models of houses or humans, nor exert influence in surprising ways over the world. But even a benign robot vacuum policy should still push the state of the world towards one in which the floor is clean, so could still be considered a mesa-controller. So we need a more precise definition of what kind of policies we are trying to avoid.

Abram suggests that what we really want to avoid is policies that explicitly represent their objectives. This is because policies that steer the world in different directions under different conditions are not particularly worrisome, whereas policies that consistently push the world towards the same objective from anywhere within a wide region of configuration space are powerful and therefore dangerous. And in order to push the world towards a consistent objective, Abram hypothesizes that such policies will need explicit internal representations of their objective.

So the picture that emerges seems to be:

  • Optimization leads to compression. Powerful outer-optimization processes solving complex problems within a space-constrained hypothesis class will select policies that somehow represent solutions compactly.

  • Compression could lead to explicit objectives. One way that compression might happen is by forming policies with an explicit internal represention of some objective. This includes mesa-controllers as well as mesa-searchers as a special case.

  • Explicit objectives leads to consistency. A policy with an explicit internal objective might push the world towards that objective from anywhere within a wide region of configuration space.

  • Consistency is power. This capacity for goal-directed behavior over a wide range of world configurations is the kind of behavior we should work to either detect or avoid in machine learning systems.

Deception and GPT-3

Abram then continues to apply this framework to the question of whether GPT-3 is currently deceiving us. My understanding of his reasoning is as follows.

  • We hope that our machine learning systems are not deliberately producing the kind of mistakes they think we want to see.

  • Such behavior would be both suboptimal, in that unnecessary mistakes are being produced, and also dangerous, in that our machine learning systems are capable not just of giving the correct answer, but also of picking a convincing set of mistakes.

  • It does not matter whether these mistakes being deliberately optimized for convincingness, which would be an instance of mesa-search, or merely produced as a by-product of the training process, which would be an instance of mesa-control, because in either case we may underestimate the power of the policies we have discovered and therefore might deploy a dangerous policy.

  • We do not currently know to what extent GPT-3 produces the kind of mistakes it thinks we want to see.

  • We do know that GPT-3 can produce very shallow mistakes that it thinks we want to see. For example, given a prompt containing spelling errors, GPT-3 will produce text that also contains spelling errors, despite having the capacity, in general, to produce text without spelling errors. Yes, this is expected given the way GPT-3 was trained -- that’s why we call this mesa-control.

  • But given prompts containing subtle reasoning errors, will GPT-3 produce text that contains further subtle reasoning errors? And is GPT-3 capable, in general, or producing text without subtle reasoning errors? We do not at present know the answers to these question.

  • If we discovered that GPT-3 is, in fact, capable of producing text without subtle reasoning errors, and beyond that is capable of producing subtle reasoning errors when it thinks that’s want we want to see, then this would be bad news, since it would tell us that GPT-3 is more powerful and more deceptive than we thought.

  • One way we could find out whether GPT-3 is powerful and deceptive is to perform further fine-tuning of the parameters constituting the GPT-3 model, using a training signal that more directly punishes subtle reasoning errors than the one GPT-3 was originally trained on. If this greatly improved the quality of the text produced by GPT-3 given only a small amount of fine-tuning, then we might reason that GPT-3 was capable of producing high quality text all along but was deliberately producing the kind of mistakes it thought we wanted from it.

  • Abram conjectures that Ajeya Cotra’s recent proposal to fund work to align narrowly superhuman models such as GPT-3 might involve work that generalizes this kind of inquiry.

  • Abram points out that positive results in this direction would give us valuable but disheartening news, since it would tell us that our models are already more powerful and more deceptive than we previously knew.

The lottery ticket hypothesis

The lottery ticket hypothesis, as originally articulated by Frankle and Carbin, is as follows:

Dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that -- when trained in isolation -- reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective

That is, we can think of a dense neural network as a weighted combination of a huge number of sparse subnetworks. When we randomly intiailize our dense neural network at the beginning of training, we are also randomly initializing each of these huge number of sparse subnetworks. The lottery ticket hypothesis says that there is some as-yet-unknown property of the initialization of sparse subnetworks that makes them trainable, and that by initializing this huge number of sparse subnetworks each with independent random initialization we are buying a huge number of lottery tickets, hoping to get at least one that has this trainability property. If this hypothesis is true then we might view the entire training process is essentially being about identifying which subnetworks have this traininability property and training them, while down-weighting all the other subnetworks.

Abram points out that this is concerning from the perspective of deception and mesa-optimizers since there may be deceptive subnetworks within any randomly initialized dense network, as suggested by the feasibility of data poisoning backdoor attacks. This is bad news for approaches to avoiding deception that work by not introducing deception into networks during training, since not introducing deception during training won’t be enough if there are already deceptive subnetworks present at the beginning of training.


I hope that this attempt at summarization sheds more light than confusion on the topics of inner optimization, agency, and deception. The basic situation here, as I see it, is that we are looking at machine learning systems that are conducting ever more powerful searches over ever more expressive hypothesis spaces and asking:

  • What dangers might turn up in the policies output by these search processes? (This is the conversation about power, alignment, and deception.)

  • What are the specific characteristics of a policy that make it dangerous? (This is the conversation about mesa-controllers and mesa-searchers.)

  • Do contemporary machine learning models already exhibit these characteristics? (This is the conversation about deception in GPT-3.)

  • From where do these dangerous characteristics originate? (This is the conversation about the lottery ticket hypothesis.)

New Comment
4 comments, sorted by Click to highlight new comments since:

I think Abram's concern about the lottery ticket hypothesis wasn't about the "vanilla" LTH that you discuss, but rather the scarier "tangent space hypothesis." See this comment thread.

Thank you for the pointer. Why is the tangent space hypothesis version of the LTH scarier?

Well, it seems to be saying that the training process basically just throws away all the tickets that score less than perfectly, and randomly selects one of the rest. This means that tickets which are deceptive agents and whatnot are in there from the beginning, and if they score well, then they have as much chance of being selected at the end as anything else that scores well. And since we should expect deceptive agents that score well to outnumber aligned agents that score well... we should expect deception.

I'm working on a much more fleshed out and expanded version of this argument right now.

Yeah right, that is scarier. Looking forward to reading your argument, esp re why we would expect deceptive agents that score well to outnumber aligned agents that score well.

Although in the same sense we could say that a rock “contains” many deceptive agents, since if we viewed the rock as a giant mixture of computations then we would surely find some that implement deceptive agents.