Outer vs inner misalignment: three framings

Richard_Ngo

A core concept in the field of AI alignment is a distinction between two types of misalignment: outer misalignment and inner misalignment. Roughly speaking, the outer alignment problem is the problem of specifying an reward function which captures human preferences; and the inner alignment problem is the problem of ensuring that a policy trained on that reward function actually tries to act in accordance with human preferences. (In other words, it’s the distinction between aligning the “outer” training signal versus aligning the “inner” policy.) However, the distinction can be difficult to pin down precisely. In this post I’ll give three and a half definitions, which each come progressively closer to capturing my current conception of it. I think Framing 1 is a solid starting point; Framings 1.5 and 2 seem like useful refinements, although less concrete; and Framing 3 is fairly speculative. For those who don’t already have a solid grasp on the inner-outer misalignment distinction, I recommend only reading Framings 1, 1.5 and 2.

For the purposes of this document, I’ll focus on the reinforcement learning setting, since that’s the case where the distinction is clearest. However, the same concepts could also apply in a supervised or self-supervised context, if we replace references to “policies” and “reward functions” with “models” and “loss functions”. The extent to which the inner alignment problem will appear in non-RL contexts is an important open question in alignment.

Framing 1: types of behavioral misalignment

Consider training an RL policy until it’s getting high reward in its training environment. Suppose we then evaluate the policy in a different test environment, without retraining it; what will happen? Consider four possibilities:

The policy behaves incompetently. This is a capability generalization failure.
The policy behaves in a competent and desirable way. This is aligned behavior.
The policy behaves in a competent yet undesirable way which gets high reward according to the original reward function.^[1] This is an outer alignment failure, also known as reward misspecification.
The policy behaves in a competent yet undesirable way which gets low reward according to the original reward function.^[2] This is an inner alignment failure, also known as goal misgeneralization. Langosco et al. (2022) provide a more formal definition and some examples of goal misgeneralization.

Note that this categorization requires us to evaluate the original reward function in the new environment. This makes sense when the environments are fairly similar (e.g. “playing Go against an AI” vs “playing Go against a human”) or the reward function is fairly general (e.g. based on human evaluations). However, we should be most concerned about alignment failures in policies that are able to behave competently in test environments very different from their training environment (e.g. generalizing from simulations to the real world, or from one task to another). Under such large shifts, there may be many reasonable ways to generalize the original reward function to the test environment (even for human evaluators), which blurs the distinction between reward misspecification and goal misgeneralization. The next framing aims to provide a definition of alignment which avoids this problem.

Framing 1.5: causes of behavioral misalignment

When a policy misbehaves in a test environment, the thing we intuitively care about is what caused that misbehavior. Since we’re not doing any additional training in the test environment, this doesn’t depend on how we evaluate the original reward function in the test environment. Instead, it depends on the rewards the policy received in the training environment, and how the policy generalizes. The factors affecting the latter are often lumped together under the heading of “inductive biases”. So we can distinguish two broad types of alignment failure: those caused primarily by incorrect training rewards (outer alignment failures), and those caused primarily by inductive biases (inner alignment failures). In general, we should expect that alignment failures are more likely to be in the first category when the test environment is similar to (or the same as) the training environment, as in these examples; and more likely to be in the second category when the test environment is very different from the training environment.

This is only a rough intuitive classification, since both factors contribute to all actions chosen by the policy. A more precise definition could involve quantifying how much we would have needed to change the training rewards to prevent misbehavior in the test environment. We might be able to measure this using techniques for assigning responsibility for different outputs to different training datapoints. If a few easily-corrected training datapoints had a big effect on misbehavior, that’s a clear example of an outer alignment failure.^[3]

Alternatively, if there were very strong inductive biases towards misbehavior, it might be very difficult to modify the training rewards to prevent misbehavior—which would be a clear example of inner misalignment. A key intuition for the likelihood of inner misalignment is that we should expect policies’ capabilities to generalize further than their alignment, once they’re highly capable. To explain the arguments behind this intuition, however, I’ll need to move to another framing in which we don’t just talk about policies’ behavior, but also the cognition they’re carrying out.

Framing 2: cognitive misalignment

Instead of defining alignment behaviorally, we can instead define it directly in terms of what a policy is trying to achieve. I’ll define a policy’s goals as internal representations of features of environmental outcomes,^[4] stored in its neural weights, which are strongly correlated with that policy’s estimates of the values of different outcomes. Under this definition, policies which don’t explicitly estimate outcome values can only have goals if they implicitly calculate those values (e.g. as part of an implicit planning process).^[5] I’ll further define final goals as feature representations which are strongly correlated with value estimates even when controlling for other features (i.e. they’re valuable for their own sake); and instrumental goals as feature representations which become much less correlated with values when controlling for other features (i.e. they’re mainly valuable for the sake of achieving other goals). Misalignment is when policies have learned final goals that are poorly correlated with human preferences. Under this definition, policies which haven't learned any goals are neither aligned nor misaligned.

This definition is less precise than the definitions from Framing 1, but I think it’s useful regardless. We have clear examples of networks learning meaningful representations, such as the representations of curves, wheels, and dog heads discussed in Olah et al.’s work on circuits. We even have examples of RL policies learning representations of outcomes in their environments, such as DeepMind’s discovery in a Quake III policy of “particular neurons that code directly for some of the most important game states, such as a neuron that activates when the agent’s flag is taken, or a neuron that activates when an agent’s teammate is holding a flag.” We don’t know exactly how those representations of outcomes influence policies’ behavior, but it would be very strange if they didn’t somehow track that a teammate holding a flag is a better outcome than one’s own flag being taken. Other cases are trickier—for example, we don't know whether GPT-3 internally estimates values for different conversational outcomes. However, this seems like a question that advances in interpretability research might allow us to answer.

Building on Framing 1.5, we can now define outer and inner misalignment in terms of possible changes to training rewards.^[6] Misaligned goals are an outer alignment failure to the extent that they could have been prevented by modifying the rewards given during training. However, as discussed previously, this might be difficult if policies have inductive biases towards generalizing in misaligned ways. I’ll briefly note three key arguments that highly capable policies will have strong inductive biases towards misaligned goals:

For a wide range of misaligned goals, policies with those as final goals would plausibly get as much training reward as policies with aligned goals, by acting deceptively.
Policies which can perform well on a wide range of different tasks will have many possible ways of generalizing their final goals to very novel tasks, and we’ll have little control over how this happens.
Small disparities between policies’ final goals and human preferences could lead policies to pursue convergent instrumental subgoals which are undesirable to humans.

Framing 3: online misalignment

Not recommended unless you already feel comfortable with the previous framings.

The previous sections were framed in terms of a training-deployment distinction: first we train a policy, and then we deploy it in a new environment. But in practice I expect that policies which are capable enough to qualify as AGIs will continue receiving gradient updates during deployment (e.g. via some kind of meta-learning or continual learning setup). If a policy’s goals are continually updated based on the rewards it receives during deployment, do the definitions I gave in Framing 2 still work? I think they do, as long as we account for another dimension of variation: how robust a policy’s goals are. Consider two possibilities:

During deployment, policies’ goals change easily in response to additional reward feedback. If we notice misbehavior, we can rapidly train them to stop misbehaving.
1. This is plausible because AGIs will be very sample-efficient, and so will update rapidly based on feedback.
During deployment, policies’ goals are very robust to additional reward feedback. It requires penalizing many examples of misbehavior to change their goals.
1. This is plausible because it may be much easier to change an AGI’s empirical beliefs than its goals. If we penalize it for taking an action, it may just learn that humans will likely catch this type of misbehavior, rather than losing its desire to misbehave when it won't get caught.
2. Also see the example of humans: even though humans are very sample-efficient, we have some very strongly-ingrained goals (like the desire to survive, or drives learned during childhood), which we often retain even after a significant amount of negative reinforcement.

In the first case, if the policy ends up causing a bad outcome, it seems reasonable to say that our choices of rewards during deployment were primarily responsible: by choosing better rewards, we could have easily steered the policy away from whichever misbehavior led to the catastrophe. So I’m going to call this case an online outer alignment failure.

In the second case, if a policy ends up causing a bad outcome, it seems reasonable to say that our choices of rewards during deployment weren’t primarily responsible. That leaves open the question of what was primarily responsible. Perhaps the policy’s goals were less robust earlier during training, and improving the reward function then would have been the easiest way to prevent the failure—in which case we can view this as an outer alignment failure in the sense given by Framing 2. Or perhaps the policy had a strong inductive bias towards certain goals, which was very hard for any rewards to overcome—in which case this is an inner alignment failure in the Framing 2 sense.

When should we call something an online outer alignment failure—if a policy’s goals take 10 gradient steps to update, or 1000, or 100,000? My main response is that the concept helps define a spectrum of possibilities, so we don’t need to draw any particular boundary for it to be useful. But if we did want to do so, we could use a framing from Christiano, who notes that the usefulness of online reward functions depends on how quickly policies can cause catastrophes. In a high-stakes setting where a policy can set a catastrophe in motion in a handful of steps (e.g. by quickly disabling human control over its training setup), then even very good online feedback is unlikely to be sufficient to prevent this. So we can define outer online alignment failures as ones which occur in a low-stakes setting where better online feedback could have changed the policy’s goals quickly enough to prevent catastrophe (e.g. because triggering catastrophes requires misbehavior over many timesteps).

The two most useful things about Christiano’s framing, from my perspective:

It pushes us to ground definitions of alignment in specific real-world failure modes.
It highlights that the threshold for “solving” alignment may be different depending on how many other defenses we have in place against misbehavior.

Conclusions

I’d encourage alignment researchers to get comfortable switching between these different framings, since each helps guide our thinking in different ways. Framing 1 seems like the most useful for connecting to mainstream ML research. However, I think that focusing primarily on Framing 1 is likely to overemphasize failure modes that happen in existing systems, as opposed to more goal-directed future systems. So I tend to use Framing 2 as my main framing when thinking about alignment problems. Lastly, when it’s necessary to consider online training, I expect that the “goal robustness” version of Framing 3 will usually be easier to use than the “high-stakes/low-stakes” version, since the latter requires predicting how AI will affect the world more broadly. However, the high-stakes/low-stakes framing seems more useful when our evaluations of AGIs are intended not just for training them, but also for monitoring and verification (e.g. to shut down AGIs which misbehave).

^{^}
Here there's a potential ambiguity, since policies could potentially get high reward by reward tampering even when an untampered version of the reward function would assign very low reward to their behavior. Instead of trying to interpret reward tampering as either an outer or inner alignment problem, I'll carve it off into its own category (which I discuss more in a later footnote).
^{^}
Why add the "undesirable" criterion here, rather than just calling any competent behavior which gets low reward a type of goal misgeneralization? This is essentially a definitional choice to make alignment, outer misalignment, and inner misalignment disjoint categories; if any low-reward competent behavior counted as inner misalignment, then a policy could be inner misaligned while still being aligned with human values overall, which seems strange.
^{^}
Note that this provides something of a moving target, as our ability to assign rewards improves over time. For example, right now rewards are determined only by a policy’s behavior. However, there are some (speculative) proposals for training networks on rewards which depend on their activations, not just their actions. If these proposals end up working, then an increasingly wide range of failures could be seen as caused by failure to penalize activations in the right way, and therefore qualify as “outer alignment failures” under this definition. But I think this is a feature of my definition, not a bug: converting inner alignment failures into more legible outer alignment failures is a useful step towards fixing them.
^{^}
The most natural examples of environmental features are just features of states in MDPs or POMDPs. But I intend the term more broadly. For example, we’d like agents to learn the goal “never steal money”, but this is more easily formulated as a feature of an overall trajectory than a feature of any given state.
^{^}
Leike provides another definition of inner misalignment which is also stated in terms of agents’ internal representations—specifically their implicit learned representations of the inner reward function in a meta-RL setup. I think his formulation is a useful lens on the problem, but I prefer my own, because I think it’s simpler to talk about goals directly than to use the meta-RL framing.
^{^}
As in Framing 1, this definition leaves it ambiguous how to classify reward tampering (thanks to Michael Cohen for highlighting this point). For example, if we implement the reward function on hardware that’s very easily hackable and the agent learns the goal of hacking it, should this count as an outer alignment failure or an inner alignment failure? Again, since neither seems quite right, I'll put this into a third category of "tampering failures". However, I don’t expect tampering failures to be a key part of why policies learn misaligned goals; tampering with the physical implementation of rewards is a sufficiently complex strategy that we should only expect agents to carry it out if they’re already misaligned. One exception: if agents learn to manipulate human supervisors during episodes, we could consider this a type of tampering. However, this is sufficiently similar to other types of misbehavior that agents might do during episodes that, for most purposes, I think we can just focus on outer and inner misalignment failures instead.

Nice post! Two things I particularly like are the explicit iteration (demonstrating by example how and why not to only use one framing), as well as the online learning framing.

The policy behaves in a competent yet undesirable way which gets low reward according to the original reward function.[2] This is an inner alignment failure, also known as goal misgeneralization. Langosco et al. (2022) provide a more formal definition and some examples of goal misgeneralization.

It seems like a core part of this initial framing relies on the operationalisation of "competent", yet you don't really point to what you mean. Notably, "competent" cannot mean "high-reward" (because of category 4) and "competent" cannot mean "desirable" (because of category 3 and 4). Instead you point at something like "Whatever it's incentivized to do, it's reasonably good at accomplishing it". I share a similar intuition, but just wanted to highlight that subtleties might hide there (maybe addressed in later framings but at least not mention at this point)

In general, we should expect that alignment failures are more likely to be in the first category when the test environment is similar to (or the same as) the training environment, as in these examples; and more likely to be in the second category when the test environment is very different from the training environment.

What comes to my mind (and is a bit mentionned after the quote) is that we could think of different hypotheses on the hardness of alignment as quantifying how similar the test environment must be to the training one to avoid inner misalignment. Potentially for harder versions of the problem, almost any difference that could tractably be detected is enough for the AI to behave differently.

I’d encourage alignment researchers to get comfortable switching between these different framings, since each helps guide our thinking in different ways. Framing 1 seems like the most useful for connecting to mainstream ML research. However, I think that focusing primarily on Framing 1 is likely to overemphasize failure modes that happen in existing systems, as opposed to more goal-directed future systems. So I tend to use Framing 2 as my main framing when thinking about alignment problems. Lastly, when it’s necessary to consider online training, I expect that the “goal robustness” version of Framing 3 will usually be easier to use than the “high-stakes/low-stakes” version, since the latter requires predicting how AI will affect the world more broadly. However, the high-stakes/low-stakes framing seems more useful when our evaluations of AGIs are intended not just for training them, but also for monitoring and verification (e.g. to shut down AGIs which misbehave).

Great conclusion! I particularly like your highlighting that each framing is more adapted to different purposes.

It seems like a core part of this initial framing relies on the operationalisation of "competent", yet you don't really point to what you mean. Notably, "competent" cannot mean "high-reward" (because of category 4) and "competent" cannot mean "desirable" (because of category 3 and 4). Instead you point at something like "Whatever it's incentivized to do, it's reasonably good at accomplishing it".

I think here, competent can probably be defined in one of two (perhaps equivalent) ways:
1. Restricted reward spaces/informative priors over reward functions: as the appropriate folk theorem goes, any policy is optimal according to some reward function. "Most" policies are incompetent; consequently, many reward functions incentivize behavior that seems incoherent/incompetent to us. It seems that when I refer to a particular agent's behavior as "competent", I'm often making reference to the fact that it achieves high reward according to a "reasonable" reward function that I can imagine. Otherwise, the behavior just looks incoherent. This is similar to the definition used in Langosco, Koch, Sharkey et al's goal misgeneralization paper, which depends on a non-trivial prior over reward functions.
2. Demonstrates instrumental convergence/power seeking behavior. In environments with regularities, certain behaviors are instrumentally convergent/power seeking. That is, they're likely to occur for a large class of reward functions. To evaluate if behavior is competent, we can look for behavior that seem power-seeking to us (i.e., not dying in a game). Incompetent behavior is that which doesn't exhibit power-seeking or instrumentally convergent drives.

The reason these two can be equivalent is the aforementioned folk theorem: as every policy has a reward function that rationalizes it, there exists priors over reward functions where the implied prior over optimal policies doesn't demonstrate power seeking behavior.

Thanks for the post. I unfortunately mostly feel confused about these framings. I think this problem would be mostly fixed if there were detailed concrete examples of each framing being applied. The descriptions felt very abstract and I mostly don't think I understood what you were trying to communicate.

During deployment, policies’ goals are very robust to additional reward feedback. It requires penalizing many examples of misbehavior to change their goals.

I think that if/when we get to certain scary points (like the AI proposes a plan for building a new factory, and we're worried that the overt proposal hides a method by which the AI will exfiltrate itself to run on unsecure servers), reward feedback will not matter much and the AI will be fully robust against reward feedback. If the AI didn't want to be updated by us anymore, it would probably just modify its own training process to secretly nullify the cognitive updates triggered by reward.

Also see the example of humans: even though humans are very sample-efficient, we have some very strongly-ingrained goals (like the desire to survive, or drives learned during childhood), which we often retain even after a significant amount of negative reinforcement.

I'm guessing that humans get negative reinforcement events which would cause their credit assignment to pintpoint and downweight their desire to survive.