Thanks to Evan Hubinger for the extensive conversations that this post is based on, and for reviewing a draft.

This post is going to assume familiarity with mesa-optimization - for a good primer, check out Does SGD Produce Deceptive Misalignment by Mark Xu.

Deceptive inner misalignment is the situation where the agent learns a misaligned mesaobjective (different from the base objective we humans wanted) and is sufficiently "situationally aware" to know that unless it deceives the training process by pretending to be aligned, gradient descent may alter its mesaobjective.

There are two different reasons that an AI model could become a deceptive mesaoptimizer: 

  1. During early training (before Situational Awareness), the agent learns a mesaobjective that will generalize poorly on the later-training/validation distribution. Once the mesaoptimizer becomes Situationally Aware, it will seek to actively avoid changes to whatever mesaobjective it had at that moment.
    • I'll call this argument "path dependence".
  2. Alternatively, it may be that mesaoptimizer is misaligned even on the training distribution. Given sufficient optimization pressure, the learning process may favor a NN that is a mesaoptimizer with the simplest possible objective (which would fail to get any reward in the real environment), and that a misaligned objective of this sort can persist through deception alone.
    • I'll call this argument "malign priors".

In this post, I'll focus on the "malign priors" argument, and why I think a well-tuned speed prior can largely prevent it.

Why does this matter? Well, if deceptive inner misalignment primarily occurs due to path dependence, that implies that ensuring inner alignment can be reduced to the problem of ensuring early-training inner alignment - which seems a lot more tractable, since this is before the model enters the "potentially-deceptive" regime.

First, why would anyone think (2) was actually likely enough to justify studying it? I think the best reason is that by studying these pressures in the limit, we can learn lessons about the pressures that exist on the margin. For example, say we have an objective  that is perfectly-aligned on the training data, and there's a very-slightly-simpler objective  that is slightly worse on the training distribution. We might ask the question: is SGD likely to push  to become , and compensate for the reduced accuracy of directly optimizing  via deceptively optimizing  on the training data? I think this post provides us with tools to directly analyze this possibility. (If you buy the rest of the post, then with a sufficient speed + simplicity prior, the answer is that  will stay favored over . That's good!) 

Priors on Learned Optimizers

Let's talk about priors!

We can think of large neural networks as basically implementing short programs, and the process of "training" an NN is just searching through the space of programs until we find one that does well on our target task.

We assume that if two NN-programs have equal performance, the training process will usually pick the one favored on priors/inductive biases.

There are several different types of priors that neural networks might have: 

  • Simplicity prior/Solomonoff prior: if we converted the NN into a Turing Machine, how many states would writing down the TM require? 
    • In previous posts, Paul Christiano and others have argued that the Solomonoff prior is likely to favor deceptively-aligned mesaoptimizers (summarized by Mark Xu here).
  • Speed prior: how many steps does the NN-turned-TM run for before terminating, on an average input?
    • Evan Hubinger has argued whether the speed prior is heavily-disfavored in practice; he and I disagree about this, and I'll summarize our dialogue later in this post.
  • Circuit-size prior: how many gates/wires would it take to rewrite the NN as a boolean circuit?
  • Circuit-depth prior: same as above, but what is the maximum sequential depth of any part of the circuit?

In this post, I will focus on the speed prior, and why I think it generally disfavors deceptive mesaoptimization. Importantly, the speed prior never exists in practice on its own (since NN training has a built-in simplicity prior), so our focus is actually going to be on the speed + simplicity prior.

Two variants of deceptive mesa-optimizers

When people think of deceptive mesaoptimization, they tend to think of two different program structures:

  • Fast Deceptive Mesaoptimizer: The AI wakes up in a new environment, and checks whether it's "in training" or "out in the wild". It can perform this test in a way that is essentially impossible to observe in any training distribution, e.g. by checking whether a factorization of RSA-2048 exists. If it's still in training, it proceeds to optimize the objective it's supposed to. If it realizes it's in the outside world, it instead optimizes its secret misaligned mesaobjective.
  • Simple Deceptive Mesaoptimizer: Alternatively: the AI wakes up in a new environment, and sees it has a simple (misaligned) mesaobjective. It knows the outside world exists because that's been useful elsewhere throughout training; it is vaguely aware that getting to the outside world would help it maximize its objective. It thinks for a while, until it realizes that to get outside, it needs to pass the training process, and to pass the training process, it needs to optimize the current environment's objective. It doesn't know the current environment's objective, so it has to figure it out. Then it proceeds as usual, optimizing the current objective.

We can contrast both of these with the "honest" mesaoptimizer structure:

  • Fast Honest (Proxy-Aligned) Mesaoptimizer: The AI wakes up in a new environment, and proceeds to optimize a proxy that is well-aligned with the environment's current objective. (It may still be misaligned on the test distribution.)

We'll assume that the training process naturally pushes the NN to discover its existence within a training process and outside world. This seems reasonably likely (e.g. if this knowledge helps predict unobserved parts of the environment, or clarify ambiguity in the reward).

We can write these different mesaoptimizers down explicitly:

def fast_honest_mesaoptimizer(env):
	my_obj = get_proxy_for_base_obj(env)
	useful_knowledge = [outer_world_env, ...]
	action = mesaoptimize(env, my_obj, useful_knowledge)
	return action

def simple_deceptive_mesaoptimizer(env):
	my_obj = simplest_obj
	useful_knowledge = [outer_world_env, ...]
	# optimize over the current env, so that you can eventually optimize over the outer env
	overall_env = combine(env, outer_world_env)
	action = mesaoptimize(overall_env, my_obj, useful_knowledge)
	return action

def fast_deceptive_mesaoptimizer(env):
	useful_knowledge = [outer_world_end, is_deployed_in_world, ...]
	# Checks whether the env is the deployment environment, e.g. via checking for RSA-2048 solutions
	if is_deployed_in_world(env):
		my_obj = simplest_obj
		my_obj = get_proxy_for_base_obj(env)
	action = mesaoptimize(env, my_obj, useful_knowledge)
	return action

# For completeness, let's write out the structure of an arbitrary mesaoptimizer.
def mesaoptimize(env, mesa_obj, useful_knowledge):
	best_action, best_performance = None, -Inf
	while continue_optimizing(...):
		action = get_next_action_candidate(...)
		if best_performance < mesa_obj(simulate(env, action, ...)):
			best_action = action
	return best_action

(Note these are the cartoonishly-simple version of mesaoptimizers; in practice, they'd likely have all sorts of heuristics and policies to improve search speed. However, I believe these arguments only get stronger if you add those in, since they effectively multiply both the speed and complexity costs.)

Let  refer to the fast non-deceptive mesaoptimizer,  refer to the simple/slow deceptive mesaoptimizer, and   refer to the fast deceptive mesaoptimizer.

We'll now compare these different possibilities along our two axes of priors: speed, and simplicity. Let  be the description length of , and  be the time-cost of  (inverse of speed).


Thus, if , then  and  is favored over the honest mesaoptimizer according to a simplicity prior. In the other case, . This may not be a huge difference:  may be the simplest possible objective and thus negligible, and  might not be that complicated - but it is at least slightly more complex.

What about the speed prior?


( is the time it takes for the mesaoptimizer to rederive the knowledge necessary to figure out the base objective, vs. what  and  needed to store in order to derive the base objective more quickly, i.e. the complexity of the function  )

Here we see the opposite relationship! It's undoubtedly true that , since  must rederive both its overall strategy for escaping, and also figure out which mesaobjective to optimize in the current environment, every episode. At the same time,  is just slightly faster than , since it doesn't need to perform the train-or-test check.

So, given both a speed and simplicity prior, and an "exchange rate"  between the two of , how do the different mesaoptimizers compare?

 loses to  on both the speed and simplicity priors (albeit barely for both), implying that RSA-2048-style deception is not going to arise naturally during the training process. (That said, it could still occur via the path-dependence argument for deceptive mesaoptimization, plus self-modification e.g. via gradient hacking.)

 beats  on the simplicity prior, but loses on the speed prior. Thus, which of these two is favored by training depends on the relative strength of the speed vs. simplicity priors being applied, which we'll denote with the exchange rate function [1]?

At a minimum,  is favored over  only if . We have dropped several terms here all of which favor , but even this term alone may be sufficient to favor , assuming a sufficiently strong speed prior relative to the simplicity prior (defined by ).

If we can get  within a certain reange, we can force the training process's priors to favor non-deceptive mesaoptimization. We'll talk about how in the last section.

What is  likely to be in practice? We can reframe the argument as follows: is the cost of the time it takes for a mesaoptimizer to rediscover a fact, smaller or larger than the complexity-cost of encoding that fact?

Given a useful fact, is it "cheaper" to pay the complexity cost to store that fact, or pay the time-cost to rederive it via search?

I want to walk through my intuition for why for most reasonable values of , the complexity cost of storage is lower than the equivalent time cost of derivation:

  • We can think of the very concept of "search" or "mesaoptimization" as trading description-length for time. The alternative to search is hardcoding a lookup table of outputs in response to inputs - that's cheap in speed, but expensive in simplicity.
  • Different quality search algorithms trade between speed and simplicity at different exchange rates - e.g. MCTS may need to consider a huge number of possible outputs before finding an acceptable-quality output, whereas MCTS + a policy network  to guide branch-choice can consider far fewer outputs to find a similar-quality output.
  • However, encoding  has a description-length cost. If the MCTS+ algorithm is a learned mesaoptimizer, then that implies that  had to be learned painstakingly by lots of runs of the outer optimization loop. Faster mesaoptimization has to be paid for by more outer optimization. 
  • This implies that there is a limit to how efficiently an optimizer that learns a mesaoptimizer can use that mesaoptimizer to trade "time" for "complexity". It may force   into some natural range.
    • Can we look at empirical evidence about mesaoptimization to guess at this range? Maybe!
    • For example, consider the continuous improvement in chess search engines over time for a fixed compute budget. In this case, the human chess software industry is the optimizer, searching over and proposing better heuristics for chess search programs, which are the mesaoptimizer. Human chess software heuristics improve at a decent rate (see this graph from user Hippke ), and an exponential increase in required compute (possible due to Moore's Law) is required to offset a few years of human metasearch over heuristics. Similar trends seem to occur in other software domains, per Grace (2013).
    • I'd love other people to take a stab at this.
  • Of course, we're operating in a slightly different domain: we're searching over programs that themselves search over programs. This in principle leaves open the possibility that we can go down a mesa-level and up a speed-level multiple times: an optimizer finds a faster mesaoptimizer, which finds a faster mesamesaoptimizer, etc. This would mean that the exchange rate of simplicity vs. speed never stabilizes, as it keeps getting cheaper to re-derive the same knowledge if only you would go one level deeper. A few comments on this:
    • One immediate objection is: why is the mesa^k optimizer capable of discovering a strategy so much better than the mesa^(k-1) optimizer, that the outer optimizer couldn't have discovered with a little more effort?
    • Another: assuming that the mesa^(k-1) optimizer is searching more efficiently than the mesa^k optimizer, how does the mesa^(k-1) optimizer detect this (given it's within a single episode) and then somehow give up on its own outer search and fully delegate to the mesa^k optimizer?
    • Perhaps the thing that makes me most skeptical of this is that I'd have expected to see some evidence of this phenomenon somewhere in nature or human development, but I can't think of anything. Admittedly, AI is the first time we've really done recursive-program-search, but the lack of this naturally arising seems like a point against such mesamesamesaoptimizations being feasible or favored in practice.
    • Note also that this trend needs to keep on going; if it stops after  levels, our mesa^k-optimizer now just has some fixed exchange rate  between speed and complexity (albeit much smaller than the original ).
  • Overall, my intuition is that it is not cheaper to rederive facts via optimization than to hardcode them, because in order to rederive a fact via search you must be considering several other candidates during that search. But this intuition is vague and I would love others' thoughts.

If we can in fact make  favor faster programs enough, then this suggests that we won't need to worry about deceptive mesaoptimizers being favored on priors!

That said, the value of  depends at least in part on the choices we make in NN architecture design/loss functions.

How ML architecture choice affects the speed-simplicity tradeoff

First, the simplicity prior naturally occurs in all search algorithms, and does seem particularly selected for in SGD. (Some people claim NN "grokking" is this SGD discovering an even-simpler explanation for the data.)

What determines the speed prior? In practice, NNs run for a fixed number of layers - we don't currently know how to differentiably vary the size of the NN being run. We can certainly imagine NNs being rolled-out a fixed number of times (like RNNs), where the number of rollouts is controllable via a learned parameter, but this parameter won't be updateable via a standard gradient.

Instead, in practice, the speed prior is naturally enforced by a two step procedure:

  • The NN is assigned a fixed quantity of time/rollouts in this sample, and gets to use all of them. It does a gradient update as though it will always have this quantity of rollouts.
  • The training procedure chooses a (possibly different) quantity of rollouts to be used in the next sample, and then repeats.

Is the speed prior natural?

Why does the speed prior exist at all? In practice, AI developers only have a finite time-budget for any project (or alternatively, a finite compute budget). If the total time is , and they increase the number of rollouts-per-episode from  to , they've decreased the total number of episodes they get to learn on by . This tradeoff forces AI developers to be fairly judicious with the strength of their speed prior.

In a similar vein, for long-term planners like simple deceptive mesaoptimizers, there may be diminishing returns to thinking longer and longer about plans. Jones (2021) looks at scaling laws for planning and finds that for simple boardgames, using twice as much compute (roughly, running search for twice as long) as your opponent improves your win-rate just to . For a fixed opponent (e.g. a given environment), this implies diminishing returns to longer and longer planning.

Another argument for the speed prior: hardware scaling is no longer making chips faster, only capable of more parallel instructions, so it seems likely we're stuck with our modern constraints on "sequential depth of reasoning per unit time" for the foreseeable future.

That said, an overly strong speed prior may also have real costs, which Evan goes into in this post:

Evidence from double descent implies that strongly selecting for speed gives substantially worse performance. In the standard double descent setup, as you increase the size of your model, you first get better performance (less underfitting), then worse performance (more overfitting), then very bad performance right when you hit zero training error (the interpolation threshold), then better and better performance as you make your model larger after that (the interpolation regime). If we equate model size to speed (which is a reasonably good proxy, since larger models require strictly more computation to run), selecting the fastest model that fits the data—which is essentially what it means to use a speed prior—would put you exactly on the interpolation threshold, which double descent implies is a uniquely bad place to be for generalization. Thus, double descent seems to provide concrete, empirical evidence that speed priors don't generalize very well when translated into neural networks and used on real-world machine learning tasks, which seems like a strong competitiveness argument to avoid them.

I tend to think there's a natural amount of speed prior that doesn't much hurt performance and substantially increases the number of outer optimization samples you can iterate over.

How to actually implement a speed prior

If AI designers want to train on more episodes and save time/compute resources, it makes sense for them to select the maximum speed prior that allows for good completion of the task. By the previous arguments, this is likely to disfavor the "simple deceptive" mesaoptimizer.

Specifically, here is a training strategy that would tend to avoid malign-priors mesaoptimizers:

  • Start off with a small number of NN-rollouts-per-action . Train and see whether this is sufficient to perform well. If it isn't, increase the number of rollouts by a small amount , and train more. Continue until you achieve good performance.
  • Why this is helpful: for any fixed number of rollouts (i.e. maximum thinking time), the NN will look for the simplest program that can achieve the task within this time. If any mesaoptimizer can solve the task within a fixed period of time, then the fast honest mesaoptimizer can solve the task within that period (since the others are ultimately relying on the fast-honest mesaoptimizer's strategy).
    • Note that this does require that , as otherwise they're both equally possible. This is an important quantity to investigate!
  • Based on the speed+simplicity prior, this will always favor the honest mesaoptimizer (since the fast-deceptive mesaoptimizer is slightly more complex, and the simple-deceptive mesaoptimizer doesn't have long enough to run in order to perform equally well).

A few final points:

  • Speed priors stop making sense when an NN has cross-episode memory separate from its weights; at that point, its planning is no longer independent across "episodes", and so you can think of its whole training as one single long episode. In this case, it seems plausible that there is no effective speed prior (since the cost of thinking thoughts in one episode are amortized over all other episodes).
    • "Avoiding cross-episode memory" seems like an important intervention to push for in broader trends of ML development.
  • Finding the appropriate value of  doesn't seem to require us to wait until we're close to AGI. "How long it takes to rederive a given fact" seems like it may be a general quality of mesaoptimizers, and empirically-investigatable with something close to current models.
  • The fact that speed priors are sufficient to disfavor mesaoptimizers should substantially affect the type of deceptive mesaoptimization behavior we expect to see. In particular:
    • Deceptive mesaoptimization may occur (e.g. due to the "path dependence" argument), but the deceptive mesaoptimizer will know that its mesaobjective may not be able to persist (due to learning priors creating pressure against it). It will be forced to use techniques like gradient-hacking or otherwise externally storing information to communicate to its future self. This increases the importance of analyzing the plausibility of different gradient hacking techniques, and of studying whether NNs exhibit acausal decision theories.
  • This does not argue against the mesaobjective being the simplest possible objective that performs well on the training distribution. This is still very worrying. For example, if we were training an agent to be corrigible, then unless we had a really well-designed training distribution for which the simplest possible mesaobjective generalized corrigibility off-distribution, then our mesaobjective would fail to generalize well to the test distribution.
    • This implies that we should be focusing substantial effort on identifying training distributions where the simplest-possible-proxy-objective has the generalization behavior we want. This seems like a fairly concrete and empirically-investigatable research problem.
  1. ^

    This is kind of abusing notation - in practice,  will probably not be factorable this way, i.e.  , but it's helping me convey the intuition for now.

New Comment
11 comments, sorted by Click to highlight new comments since:

I think a contentious assumption you're making with this model is the value-neutral core of mesaoptimizer cognition, namely your mesaoptimize in the pseudocode. I think that our whole problem in practice is roughly that we don't know how to gradient-descend our way toward general cognitive primitives that have goals factored out.

A different way to point at my perceived issue: the mesaoptimizers are built out of a mesaoptimize primitive, which is itself is a mesaoptimizer that has to be learnt. This seems to me to be not well-founded, and not actually factoring a mesaoptimizer into smaller parts.


I think my argument only gets stronger if you assume that the mesaobjective is a large pile of heuristics built into the mesaoptimization algorithm, since that takes up much more space.

In the traditional deceptive mesaoptimization story, the model needs to at some point switch from "pursuing objective X directly" to "pursuing objective Y indirectly by deceptively pursuing objective X". I agree that, if there isn't really a core "mesaoptimizer" that can have goals swapped out, the idea of seemlessly transitioning between the two is very unlikely, since you initially lack the heuristics for "pursuing objective Y".

I'm not sure whether you're arguing that my post fails to imply the speed prior disincentivizes deceptive mesaoptimization, or whether you're arguing that deceptive mesaoptimization isn't likely in the first place.

A longer reply on the points about heuristic mesaobjectives and the switch:

I will first note here that I'm not a huge fan of the concepts/story from the mesaoptimizers paper as a way of factoring reality. I struggle to map the concepts onto my own model of what's going to happen as we fumble toward AGI.

But putting that aside, and noting that my language is imprecise and confused, here is how I think about the "switch" from directly to deceptively pursuing your training objective:

  1. "Pursuing objective X" is an abstraction we use to think about an agent that manages to robustly take actions that move in the direction of objective X
  2. We can think of an agent as "pursuing X directly" if we think that the agent will take an available option that it can tell moves toward X
  3. We can think of an agent as "pursuing X deceptively" if the agent would stop taking actions that move toward X under some change of context.
  4. Some such "deceptive" agents might be better described as "pursuing Y directly" for some Y.

So an example transition from pursing X "directly" to "deceptively" would be an agent you train to keep your diamonds safe, that eventually learns that you're judging this via cameras, and will therefore take actions that fool the cameras if they become available.

And notably I don't think your argument applies to this class of example? It at least doesn't seem like I could write down a speed prior that would actually reassure me that my diamond-keeper won't try to lie to me.

And a final note: none of that seems to matter for my main complaint, which is that the argument in the post seems to rely on factoring "mesaoptimizer" as "stuff + another mesaoptimizer"?

If so, I can't really update on the results of the argument.

Two quick things to say:

(1) I think the traditional story is more that your agent pursues mostly-X while it's dumb, but then gradient descent summons something intelligent with some weird pseudo-goal Y, because this can be selected for when you reward the agent for looking like it pursues X.

(2) I'm mainly arguing that your post isn't correctly examining the effect of a speed prior. Though I also think that one or both of us are confused about what a mesaoptimizer found by gradient-descent would actually look like, which matters lots for what theoretical models apply in reality.

I'm confused by your Fast Deceptive Mesaoptimiser example. How does the speed prior come in here? It isn't mentioned in the description.

I think speed and complexity trade off against model's precision? Like, for simple objectives you can hardcode it, but for complex one you wouldn't have space for base objective that you couldn't spend on table of chemical elements or something. So I would expect t(mesaoptimize_env) for the base objective to be only slighter greater than t(is_deployed_in_world) + t(find(get_base_obj(env))) + t(mesaoptimize_env) for a mesaoptimizer that hardcodes a check, but computes base objective, because I expect t(find(get_base_obj(env))) to be mostly contained in t(mesaoptimize_env). Hmmm, except, maybe if you somehow specifically penalize difference in speed for instrumental vs terminal objective? But then mesaoptimizer would just find other instrumental ways.


I think you might be conflating two different scenarios? t(is_deployed_in_world) and t(find(get_base_obj)) are not simultaneously present in either case.

That said, I'll still do my best to respond. By definition, t(find(get_base_obj(env))) is whatever part of t(mesaoptimize_env) that  doesn't have to do. Is your claim that that's basically nothing? If it's always super easy to identify the base objective and then optimize it, then  shouldn't need to pay the penalty of storing c(get_base_obj), since it could also near-instantly derive the base objective. In that case, both  and  still lose due to needing to pay for c(simple_objective) and, respectively, c(is_deployed_in_world) or t(mesaoptimize_outer_env).

I'm not saying these costs are exorbitant, but they are strictly greater than not paying them - so the prior would never incentivize paying them, and always slightly disincentivize doing so. (I.e. they'd never arise naturally from optimization pressure alone.)

I think you might be conflating two different scenarios?

I present alternative strategy for a mesaoptimizer that, yes, wasn't in the post, by I don't see why?

Is your claim that that’s basically nothing?

Yes, or at least it approaches relatively nothing as we get more competent optimizer.

If it’s always super easy to identify the base objective and then optimize it, then shouldn’t need to pay the penalty of storing c(get_base_obj), since it could also near-instantly derive the base objective.

If it doesn't store it, it isn't - it would be able to derive that humans want it, but wouldn't want to optimize it itself.


What would it want to optimize, then, according to you? I’m claiming that “I want to figure out what the environment wants and then do it” is a simpler goal than “I want X, to get that I’m going to figure out what the environment wants and then do it”

Re using both, you’re right, if you make your other assumptions then using both could work. (Though it does still have higher simplicity cost.)

This is very interesting! A few thoughts/questions:

  1. I didn't quite follow the argument that H_{fh} beats H_{sd} on complexity. Is it that pointing to the base objective is more complicated than the logic of (simple mesaobjective) + (search logic to long-run optimize the mesaobjective)? If so worry a little that H_{sd} still has to learn a pointer to the base objective, if only so that it can perform well on it during training.
  2. I actually think you can define a speed prior with a single long training episode. For an agent that plays chess the prior can be over thinking time per move. For an agent that runs in a simulated environment it could be 'thinking time per unit simulation time'. For GPT it could be 'thinking time per predicted word', and so on.