Thanks to Evan Hubinger for the extensive conversations that this post is based on, and for reviewing a draft.

This post is going to assume familiarity with mesa-optimization - for a good primer, check out Does SGD Produce Deceptive Misalignment by Mark Xu.

Deceptive inner misalignment is the situation where the agent learns a misaligned mesaobjective (different from the base objective we humans wanted) and is sufficiently "situationally aware" to know that unless it deceives the training process by pretending to be aligned, gradient descent may alter its mesaobjective.

There are two different reasons that an AI model could become a deceptive mesaoptimizer:

During early training (before Situational Awareness), the agent learns a mesaobjective that will generalize poorly on the later-training/validation distribution. Once the mesaoptimizer becomes Situationally Aware, it will seek to actively avoid changes to whatever mesaobjective it had at that moment.
- I'll call this argument "path dependence".
Alternatively, it may be that mesaoptimizer is misaligned even on the training distribution. Given sufficient optimization pressure, the learning process may favor a NN that is a mesaoptimizer with the simplest possible objective (which would fail to get any reward in the real environment), and that a misaligned objective of this sort can persist through deception alone.
- I'll call this argument "malign priors".

In this post, I'll focus on the "malign priors" argument, and why I think a well-tuned speed prior can largely prevent it.

Why does this matter? Well, if deceptive inner misalignment primarily occurs due to path dependence, that implies that ensuring inner alignment can be reduced to the problem of ensuring early-training inner alignment - which seems a lot more tractable, since this is before the model enters the "potentially-deceptive" regime.

First, why would anyone think (2) was actually likely enough to justify studying it? I think the best reason is that by studying these pressures in the limit, we can learn lessons about the pressures that exist on the margin. For example, say we have an objective that is perfectly-aligned on the training data, and there's a very-slightly-simpler objective $B^{'}$ that is slightly worse on the training distribution. We might ask the question: is SGD likely to push $B$ to become $B^{'}$ , and compensate for the reduced accuracy of directly optimizing $B^{'}$ via deceptively optimizing $B$ on the training data? I think this post provides us with tools to directly analyze this possibility. (If you buy the rest of the post, then with a sufficient speed + simplicity prior, the answer is that $B$ will stay favored over $B^{'}$ . That's good!)

Priors on Learned Optimizers

Let's talk about priors!

We can think of large neural networks as basically implementing short programs, and the process of "training" an NN is just searching through the space of programs until we find one that does well on our target task.

We assume that if two NN-programs have equal performance, the training process will usually pick the one favored on priors/inductive biases.

There are several different types of priors that neural networks might have:

Simplicity prior/Solomonoff prior: if we converted the NN into a Turing Machine, how many states would writing down the TM require?
- In previous posts, Paul Christiano and others have argued that the Solomonoff prior is likely to favor deceptively-aligned mesaoptimizers (summarized by Mark Xu here).
Speed prior: how many steps does the NN-turned-TM run for before terminating, on an average input?
- Evan Hubinger has argued whether the speed prior is heavily-disfavored in practice; he and I disagree about this, and I'll summarize our dialogue later in this post.
Circuit-size prior: how many gates/wires would it take to rewrite the NN as a boolean circuit?
- Evan has argued that minimal circuits may be deceptive. (I don't fully buy the proof.)
Circuit-depth prior: same as above, but what is the maximum sequential depth of any part of the circuit?

In this post, I will focus on the speed prior, and why I think it generally disfavors deceptive mesaoptimization. Importantly, the speed prior never exists in practice on its own (since NN training has a built-in simplicity prior), so our focus is actually going to be on the speed + simplicity prior.

Two variants of deceptive mesa-optimizers

When people think of deceptive mesaoptimization, they tend to think of two different program structures:

Fast Deceptive Mesaoptimizer: The AI wakes up in a new environment, and checks whether it's "in training" or "out in the wild". It can perform this test in a way that is essentially impossible to observe in any training distribution, e.g. by checking whether a factorization of RSA-2048 exists. If it's still in training, it proceeds to optimize the objective it's supposed to. If it realizes it's in the outside world, it instead optimizes its secret misaligned mesaobjective.
Simple Deceptive Mesaoptimizer: Alternatively: the AI wakes up in a new environment, and sees it has a simple (misaligned) mesaobjective. It knows the outside world exists because that's been useful elsewhere throughout training; it is vaguely aware that getting to the outside world would help it maximize its objective. It thinks for a while, until it realizes that to get outside, it needs to pass the training process, and to pass the training process, it needs to optimize the current environment's objective. It doesn't know the current environment's objective, so it has to figure it out. Then it proceeds as usual, optimizing the current objective.

We can contrast both of these with the "honest" mesaoptimizer structure:

Fast Honest (Proxy-Aligned) Mesaoptimizer: The AI wakes up in a new environment, and proceeds to optimize a proxy that is well-aligned with the environment's current objective. (It may still be misaligned on the test distribution.)

We'll assume that the training process naturally pushes the NN to discover its existence within a training process and outside world. This seems reasonably likely (e.g. if this knowledge helps predict unobserved parts of the environment, or clarify ambiguity in the reward).

We can write these different mesaoptimizers down explicitly:

def fast_honest_mesaoptimizer(env):
	my_obj = get_proxy_for_base_obj(env)
	useful_knowledge = [outer_world_env, ...]
	action = mesaoptimize(env, my_obj, useful_knowledge)
	return action

def simple_deceptive_mesaoptimizer(env):
	my_obj = simplest_obj
	useful_knowledge = [outer_world_env, ...]
	# optimize over the current env, so that you can eventually optimize over the outer env
	overall_env = combine(env, outer_world_env)
	action = mesaoptimize(overall_env, my_obj, useful_knowledge)
	return action

def fast_deceptive_mesaoptimizer(env):
	useful_knowledge = [outer_world_end, is_deployed_in_world, ...]
	# Checks whether the env is the deployment environment, e.g. via checking for RSA-2048 solutions
	if is_deployed_in_world(env):
		my_obj = simplest_obj
	else:
		my_obj = get_proxy_for_base_obj(env)
	action = mesaoptimize(env, my_obj, useful_knowledge)
	return action

# For completeness, let's write out the structure of an arbitrary mesaoptimizer.
def mesaoptimize(env, mesa_obj, useful_knowledge):
	best_action, best_performance = None, -Inf
	while continue_optimizing(...):
		action = get_next_action_candidate(...)
		if best_performance < mesa_obj(simulate(env, action, ...)):
			best_action = action
	return best_action

(Note these are the cartoonishly-simple version of mesaoptimizers; in practice, they'd likely have all sorts of heuristics and policies to improve search speed. However, I believe these arguments only get stronger if you add those in, since they effectively multiply both the speed and complexity costs.)

Let $H_{f h}$ refer to the fast non-deceptive mesaoptimizer, $H_{s d}$ refer to the simple/slow deceptive mesaoptimizer, and $H_{f d}$ refer to the fast deceptive mesaoptimizer.

We'll now compare these different possibilities along our two axes of priors: speed, and simplicity. Let $c (x)$ be the description length of $x$ , and $t (x)$ be the time-cost of $x$ (inverse of speed).

$c (H_{f h}) = c (get_base_obj) + c (mesaoptimize) + c (useful_knowledge)$

$c (H_{s d}) = c (simplest_obj) + c (combine) + c (mesaoptimize) + c (useful_knowledge)$

$c (H_{f d}) = c (get_base_obj) + c (simplest_obj) + c (is_deployed_in_world) + c (mesaoptimize) + c (useful_knowledge)$

Thus, if $c (get_base_obj) > c (simplest_obj) + c (combine)$ , then $c (H_{f h}) > c (H_{s d})$ and $H_{s d}$ is favored over the honest mesaoptimizer according to a simplicity prior. In the other case, $c (H_{f h}) < c (H_{f d})$ . This may not be a huge difference: $c (simplest_obj)$ may be the simplest possible objective and thus negligible, and $c (is_deployed_in_world)$ might not be that complicated - but it is at least slightly more complex.

What about the speed prior?

$t (H_{f h}) = t (get_proxy_for_base_obj) + t (mesaoptimize_env)$

$t (H_{s d}) = t (combine) + t (find(get_proxy_for_base_obj(env))) + t (mesaoptimize_env)$ $+ t (mesaoptimize_outer_world_env)$

( $t (find(get_base_obj(env)))$ is the time it takes for the mesaoptimizer to rederive the knowledge necessary to figure out the base objective, vs. what $H_{f h}$ and $H_{f d}$ needed to store in order to derive the base objective more quickly, i.e. the complexity of the function $c (get_base_obj)$ )

$t (H_{f d}) = t (is_deployed_in_world) + t (get_proxy_for_base_obj) + t (mesaoptimize_env)$

Here we see the opposite relationship! It's undoubtedly true that $t (H_{f h}) < t (H_{s d})$ , since $H_{s d}$ must rederive both its overall strategy for escaping, and also figure out which mesaobjective to optimize in the current environment, every episode. At the same time, $H_{f h}$ is just slightly faster than $H_{f d}$ , since it doesn't need to perform the train-or-test check.

So, given both a speed and simplicity prior, and an "exchange rate" $g$ between the two of $g (c (x)) = t (x)$ , how do the different mesaoptimizers compare?

$H_{f d}$ loses to $H_{f h}$ on both the speed and simplicity priors (albeit barely for both), implying that RSA-2048-style deception is not going to arise naturally during the training process. (That said, it could still occur via the path-dependence argument for deceptive mesaoptimization, plus self-modification e.g. via gradient hacking.)

$H_{s d}$ beats $H_{f h}$ on the simplicity prior, but loses on the speed prior. Thus, which of these two is favored by training depends on the relative strength of the speed vs. simplicity priors being applied, which we'll denote with the exchange rate function $g$ ^[1]?

At a minimum, $H_{s d}$ is favored over $H_{f h}$ only if $g (c (get_proxy_for_base_obj)) < s (find(get_proxy_for_base_obj(env)))$ . We have dropped several terms here all of which favor $H_{f h}$ , but even this term alone may be sufficient to favor $H_{f h}$ , assuming a sufficiently strong speed prior relative to the simplicity prior (defined by $g$ ).

If we can get $g$ within a certain reange, we can force the training process's priors to favor non-deceptive mesaoptimization. We'll talk about how in the last section.

What is $g$ likely to be in practice? We can reframe the argument as follows: is the cost of the time it takes for a mesaoptimizer to rediscover a fact, smaller or larger than the complexity-cost of encoding that fact?

Given a useful fact, is it "cheaper" to pay the complexity cost to store that fact, or pay the time-cost to rederive it via search?

I want to walk through my intuition for why for most reasonable values of $g$ , the complexity cost of storage is lower than the equivalent time cost of derivation:

We can think of the very concept of "search" or "mesaoptimization" as trading description-length for time. The alternative to search is hardcoding a lookup table of outputs in response to inputs - that's cheap in speed, but expensive in simplicity.
Different quality search algorithms trade between speed and simplicity at different exchange rates - e.g. MCTS may need to consider a huge number of possible outputs before finding an acceptable-quality output, whereas MCTS + a policy network $Π$ to guide branch-choice can consider far fewer outputs to find a similar-quality output.
However, encoding $Π$ has a description-length cost. If the MCTS+ $Π$ algorithm is a learned mesaoptimizer, then that implies that $Π$ had to be learned painstakingly by lots of runs of the outer optimization loop. Faster mesaoptimization has to be paid for by more outer optimization.
This implies that there is a limit to how efficiently an optimizer that learns a mesaoptimizer can use that mesaoptimizer to trade "time" for "complexity". It may force $g$ into some natural range.
- Can we look at empirical evidence about mesaoptimization to guess at this range? Maybe!
- For example, consider the continuous improvement in chess search engines over time for a fixed compute budget. In this case, the human chess software industry is the optimizer, searching over and proposing better heuristics for chess search programs, which are the mesaoptimizer. Human chess software heuristics improve at a decent rate (see this graph from user Hippke ), and an exponential increase in required compute (possible due to Moore's Law) is required to offset a few years of human metasearch over heuristics. Similar trends seem to occur in other software domains, per Grace (2013).
- I'd love other people to take a stab at this.
Of course, we're operating in a slightly different domain: we're searching over programs that themselves search over programs. This in principle leaves open the possibility that we can go down a mesa-level and up a speed-level multiple times: an optimizer finds a faster mesaoptimizer, which finds a faster mesamesaoptimizer, etc. This would mean that the exchange rate of simplicity vs. speed never stabilizes, as it keeps getting cheaper to re-derive the same knowledge if only you would go one level deeper. A few comments on this:
- One immediate objection is: why is the mesa^k optimizer capable of discovering a strategy so much better than the mesa^(k-1) optimizer, that the outer optimizer couldn't have discovered with a little more effort?
- Another: assuming that the mesa^(k-1) optimizer is searching more efficiently than the mesa^k optimizer, how does the mesa^(k-1) optimizer detect this (given it's within a single episode) and then somehow give up on its own outer search and fully delegate to the mesa^k optimizer?
- Perhaps the thing that makes me most skeptical of this is that I'd have expected to see some evidence of this phenomenon somewhere in nature or human development, but I can't think of anything. Admittedly, AI is the first time we've really done recursive-program-search, but the lack of this naturally arising seems like a point against such mesamesamesaoptimizations being feasible or favored in practice.
- Note also that this trend needs to keep on going; if it stops after $k$ levels, our mesa^k-optimizer now just has some fixed exchange rate $g^{'}$ between speed and complexity (albeit much smaller than the original $g$ ).
Overall, my intuition is that it is not cheaper to rederive facts via optimization than to hardcode them, because in order to rederive a fact via search you must be considering several other candidates during that search. But this intuition is vague and I would love others' thoughts.

If we can in fact make $g$ favor faster programs enough, then this suggests that we won't need to worry about deceptive mesaoptimizers being favored on priors!

That said, the value of $g$ depends at least in part on the choices we make in NN architecture design/loss functions.

How ML architecture choice affects the speed-simplicity tradeoff

First, the simplicity prior naturally occurs in all search algorithms, and does seem particularly selected for in SGD. (Some people claim NN "grokking" is this SGD discovering an even-simpler explanation for the data.)

What determines the speed prior? In practice, NNs run for a fixed number of layers - we don't currently know how to differentiably vary the size of the NN being run. We can certainly imagine NNs being rolled-out a fixed number of times (like RNNs), where the number of rollouts is controllable via a learned parameter, but this parameter won't be updateable via a standard gradient.

Instead, in practice, the speed prior is naturally enforced by a two step procedure:

The NN is assigned a fixed quantity of time/rollouts in this sample, and gets to use all of them. It does a gradient update as though it will always have this quantity of rollouts.
The training procedure chooses a (possibly different) quantity of rollouts to be used in the next sample, and then repeats.

Is the speed prior natural?

Why does the speed prior exist at all? In practice, AI developers only have a finite time-budget for any project (or alternatively, a finite compute budget). If the total time is $T$ , and they increase the number of rollouts-per-episode from $k$ to $k + 1$ , they've decreased the total number of episodes they get to learn on by $\frac{T}{k (k + 1)}$ . This tradeoff forces AI developers to be fairly judicious with the strength of their speed prior.

In a similar vein, for long-term planners like simple deceptive mesaoptimizers, there may be diminishing returns to thinking longer and longer about plans. Jones (2021) looks at scaling laws for planning and finds that for simple boardgames, using twice as much compute (roughly, running search for twice as long) as your opponent improves your win-rate just to $\frac{2}{3} : \frac{1}{3}$ . For a fixed opponent (e.g. a given environment), this implies diminishing returns to longer and longer planning.

Another argument for the speed prior: hardware scaling is no longer making chips faster, only capable of more parallel instructions, so it seems likely we're stuck with our modern constraints on "sequential depth of reasoning per unit time" for the foreseeable future.

That said, an overly strong speed prior may also have real costs, which Evan goes into in this post:

Evidence from double descent implies that strongly selecting for speed gives substantially worse performance. In the standard double descent setup, as you increase the size of your model, you first get better performance (less underfitting), then worse performance (more overfitting), then very bad performance right when you hit zero training error (the interpolation threshold), then better and better performance as you make your model larger after that (the interpolation regime). If we equate model size to speed (which is a reasonably good proxy, since larger models require strictly more computation to run), selecting the fastest model that fits the data—which is essentially what it means to use a speed prior—would put you exactly on the interpolation threshold, which double descent implies is a uniquely bad place to be for generalization. Thus, double descent seems to provide concrete, empirical evidence that speed priors don't generalize very well when translated into neural networks and used on real-world machine learning tasks, which seems like a strong competitiveness argument to avoid them.

I tend to think there's a natural amount of speed prior that doesn't much hurt performance and substantially increases the number of outer optimization samples you can iterate over.

How to actually implement a speed prior

If AI designers want to train on more episodes and save time/compute resources, it makes sense for them to select the maximum speed prior that allows for good completion of the task. By the previous arguments, this is likely to disfavor the "simple deceptive" mesaoptimizer.

Specifically, here is a training strategy that would tend to avoid malign-priors mesaoptimizers:

Start off with a small number of NN-rollouts-per-action $k$ . Train and see whether this is sufficient to perform well. If it isn't, increase the number of rollouts by a small amount $k + ϵ$ , and train more. Continue until you achieve good performance.
Why this is helpful: for any fixed number of rollouts (i.e. maximum thinking time), the NN will look for the simplest program that can achieve the task within this time. If any mesaoptimizer can solve the task within a fixed period of time, then the fast honest mesaoptimizer can solve the task within that period (since the others are ultimately relying on the fast-honest mesaoptimizer's strategy).
- Note that this does require that $ϵ < t (H_{s d}) - t (H_{f h})$ , as otherwise they're both equally possible. This is an important quantity to investigate!
Based on the speed+simplicity prior, this will always favor the honest mesaoptimizer (since the fast-deceptive mesaoptimizer is slightly more complex, and the simple-deceptive mesaoptimizer doesn't have long enough to run in order to perform equally well).

A few final points:

Speed priors stop making sense when an NN has cross-episode memory separate from its weights; at that point, its planning is no longer independent across "episodes", and so you can think of its whole training as one single long episode. In this case, it seems plausible that there is no effective speed prior (since the cost of thinking thoughts in one episode are amortized over all other episodes).
- "Avoiding cross-episode memory" seems like an important intervention to push for in broader trends of ML development.
Finding the appropriate value of $g$ doesn't seem to require us to wait until we're close to AGI. "How long it takes to rederive a given fact" seems like it may be a general quality of mesaoptimizers, and empirically-investigatable with something close to current models.
The fact that speed priors are sufficient to disfavor mesaoptimizers should substantially affect the type of deceptive mesaoptimization behavior we expect to see. In particular:
- Deceptive mesaoptimization may occur (e.g. due to the "path dependence" argument), but the deceptive mesaoptimizer will know that its mesaobjective may not be able to persist (due to learning priors creating pressure against it). It will be forced to use techniques like gradient-hacking or otherwise externally storing information to communicate to its future self. This increases the importance of analyzing the plausibility of different gradient hacking techniques, and of studying whether NNs exhibit acausal decision theories.
This does not argue against the mesaobjective being the simplest possible objective that performs well on the training distribution. This is still very worrying. For example, if we were training an agent to be corrigible, then unless we had a really well-designed training distribution for which the simplest possible mesaobjective generalized corrigibility off-distribution, then our mesaobjective would fail to generalize well to the test distribution.
- This implies that we should be focusing substantial effort on identifying training distributions where the simplest-possible-proxy-objective has the generalization behavior we want. This seems like a fairly concrete and empirically-investigatable research problem.

^{^}
This is kind of abusing notation - in practice, $g$ will probably not be factorable this way, i.e. $g (s (a) + s (b)) \neq g (s (a)) + g (s (b))$ , but it's helping me convey the intuition for now.

17

The Speed + Simplicity Prior is probably anti-deceptive

17