Categorizing failures as “outer” or “inner” misalignment is often confused

Rohin Shah

Pop quiz: Are the following failures examples of outer or inner misalignment? Or is it ambiguous?

(I’ve spoilered the scenarios so you look at each one separately and answer before moving on to the next. Even if you don’t want to actually answer, you should read through the scenarios for the rest of the post to make sense.)

Scenario 1: We train an AI system that writes pull requests based on natural language dialogues where we provide specifications. Human programmers check whether the pull request is useful, and if so the pull request is merged and the AI is rewarded. During training the humans always correctly evaluate whether the pull request is useful. At deployment, the AI constructs human farms where humans are forced to spend all day merging pull requests that now just add useless print statements to the codebase.

Scenario 2: We train an AI system to write pull requests as before. The humans providing feedback still correctly evaluate whether the pull request is useful, but it turns out that during training the AI system noticed that it was in training, and so specifically wrote useful pull requests in order to trick the humans into thinking that it was helpful, when it was already planning to take over the world and build human farms.

I posed this quiz to a group of 10+ full-time alignment researchers a few months ago, and a majority thought that Scenario 1 was outer misalignment, while almost everyone thought that Scenario 2 was inner misalignment.

But that can’t be correct, since Scenario 2 is just Scenario 1 with more information! If you think that Scenario 2 is clearly inner misalignment, then surely Scenario 1 has to be either ambiguous or inner misalignment! (See the appendix for hypotheses about what’s going on here.)

I claim that most researchers’ intuitive judgments about outer and inner misalignment are confused^[1]. So what exactly do we actually mean by “outer” and “inner” misalignment? Is it sensible to talk about separate “outer” and “inner” misalignment problems, or is that just a confused question?

I think it is misguided to categorize failures as “outer” or “inner” misalignment. Instead, I think “outer” and “inner” alignment are best thought of as one particular way of structuring a plan for building aligned AI.

I’ll first discuss some possible ways that you could try to categorize failures, and why I’m not a fan of them, and then discuss outer and inner alignment as parts of a plan for building aligned AI.

Possible categorizations

Objective-based. This categorization is based on distinctions between specifications or objectives at different points in the overall system. This post identifies three different notions of specifications or objectives:

Ideal objective (“wishes”): The hypothetical objective that describes good behavior that the designers have in mind.
Design objective (“blueprint”): The objective that is actually used to build the AI system. This is the designer's attempt to formalize the ideal objective.
Revealed objective (“behavior”): The objective that best describes what actually happens.

On an objective-based categorization, outer misalignment would be a discrepancy between the ideal and design objectives, while inner misalignment is a discrepancy between the design and revealed objectives.

(The mesa-optimizers paper has a similar breakdown, except the third objective is a structural objective, rather than a behavioral one. This is also the same as Richard’s Framing 1.)

With this operationalization, it is not clear which of the two categories a given situation falls under, even when you know exactly what happened. In our scenarios above, what exactly is the design objective? Is it “how well the pull request implements the desired feature, as evaluated by a human programmer” (making it an inner alignment failure), or is it “probability that the human programmer would choose to merge the pull request” (making it an outer alignment failure)?

The key problem here is that we don't know what rewards we “would have” provided in situations that did not occur during training. This requires us to choose some specific counterfactual, to define what “would have” happened. After we choose a counterfactual, we can then categorize a failure as outer or inner misalignment in a well-defined manner.

(In simple examples like CoinRun, we don’t run into this problem because we use programmatic design objectives, for which the appropriate counterfactuals are relatively clear, intuitive, and agreed upon.)

Generalization-based. This categorization is based on the common distinction in machine learning between failures on the training distribution, and out of distribution failures. Specifically, we use the following process to categorize misalignment failures:

Was the feedback provided on the actual training data bad? If so, this is an instance of outer misalignment.
Did the learned program generalize poorly, leading to bad behavior, even though the feedback on the training data is good? If so, this is an instance of inner misalignment.

This definition has a free variable: what exactly counts as “bad feedback”? For example:

Suppose we define “good feedback” to be feedback that rewards good outputs and doesn’t reward bad outputs. Under this definition, both Scenario 1 and 2 are examples of inner misalignment, since (by assumption) during training only actually good pull requests were approved.
Suppose we define “good feedback” to be feedback that rewards good outputs that were chosen for the right reasons, and doesn’t reward bad outputs or good outputs chosen for bad reasons. Under this definition, Scenario 1 is ambiguous, and Scenario 2 is an outer misalignment failure (since the feedback rewarded useful pull requests that were chosen in order to trick the humans into thinking it was helpful).

Note that in practice you could have very different definitions depending on the setting, e.g. for sufficiently formally defined tasks, you could define good feedback based on notions like “did you notice all of the issues that could have been credibly exhibited to a polynomial-time verifier”.

The main benefits of this definition are:

You no longer need to choose a counterfactual over situations that don’t come up during training. You still need counterfactuals over actions that weren’t taken in particular training situations: things like “what would the reward have been if the AI had instead written a pull request that chose not to implement a requested feature in order to avoid a security vulnerability”.
This decomposition is more closely connected to techniques. If you have an outer misalignment failure, you can fix the feedback process. If you have an inner misalignment failure, you can diversify the training distribution, or otherwise change the generalization properties.
1. In contrast, with an objective-based decomposition, the solution to an outer misalignment issue may not be to change the feedback process – if your feedback is only wrong on hypothetical inputs that weren’t seen during training, then “fixing” the feedback process doesn’t change the AI system you get.

However, you still need to choose a definition of “good feedback”; this choice will often have major implications.

Root-cause analysis. One issue with the two previous operationalizations is that they don’t tell you how to categorize cases that are a mixture of the two – where the design objective or training feedback is slightly wrong, but bad generalization could also have been an issue. Richard's Framing 1.5 suggests a solution: ask whether the failure was “primarily caused” by incorrect rewards, or by inductive biases.

This seems like a general improvement on the previous operationalizations when thinking about the problems intuitively but is hard to make precise: it is often the case that there is not a single root cause.

Other work. Here are some other things that are somewhat related that I’m not considering:

Richard's Framing 3 extends these notions to online learning. I’m happy to just stick with train/test distinctions for the purposes of this post.
Low-stakes alignment is an idea used for similar purposes as outer / inner alignment, but can’t be used to categorize real-world failures. This is because low-stakes vs. high-stakes are properties of the environment, and the real-world environment is always high-stakes.

What is the purpose of “outer” and “inner” alignment?

Overall, I don’t like any of these categorizations. So maybe we should ask: do we actually need a categorization?

I don’t think so. I think the useful part of the concepts is to help us think about an approach to building aligned AI. For example, we might use the following decomposition:

Choose an operationalization for “good” feedback. (See examples in the “Generalization-based” approach above.)
Create a feedback process that, given a situation and an AI action, provides “good” feedback on that situation and action.
Find novel situations where AIs would take a “bad” action, and fix these problems (e.g. by incorporating them into training, or by changing the architecture to change the inductive bias).

Here Step 2 is vaguely similar to “outer alignment” and Step 3 is vaguely similar to “inner alignment”, but it’s still very different from what I imagine other people imagine. For example, if in Step 1 you define “good feedback” to say that you do reward the AI system for good outputs that it chose for good reasons, but don’t reward it for good outputs chosen to deceive humans into thinking the AI is aligned, then “look at the AI’s cognition and figure out whether it is deceiving you” is now part of “outer alignment” (Step 2), when that would traditionally be considered inner alignment. This is the sort of move you need to make to think of ELK as “outer alignment”.

Another use of the terms “outer” and “inner” is to describe the situation in which an “outer” optimizer like gradient descent is used to find a learned model that is itself performing optimization (the “inner” optimizer). This usage seems fine to me.

Conclusion

My takeaways from thinking about this are:

Don’t try to categorize arbitrary failures along the outer-inner axis. Some situations can be cleanly categorized -- CoastRunners is clearly an instance of specification gaming (which one could call outer misalignment) and CoinRun is clearly an instance of goal misgeneralization (which one could call inner misalignment) -- but many situations won't cleanly fall into a single category.
Instead, think about how alignment plans could be structured. Even then, I would avoid the terms “outer alignment” and “inner alignment”, since they seem to mislead people in practice. I would instead talk about things like “what facts does the model know that the oversight process doesn’t know”, “feedback quality”, “data distribution”, etc.

Appendix: Why are people’s answers inconsistent?

All of the approaches I’ve mentioned so far seem at least somewhat reasonable to me – they are all categorizations and would give consistent answers to the quiz above with Scenarios 1 and 2. This also means that none of them can explain why researchers give inconsistent answers to those scenarios. I’ll now speculate on what people might actually be thinking, such that they give inconsistent answers.

Training process vs. learned model. The heuristic here is that if you talk solely about the training process when describing the failure, then it will be categorized as outer misalignment. If you instead talk about how the learned model works and how its cognition leads to catastrophe, then it will be categorized as inner misalignment. This correctly predicts the pattern in the answers to the pop quiz, and corresponds to the notion that SGD is the “outer” optimizer and the learned model is the “inner” optimizer.

Why isn’t this a reasonable way to categorize failures? Well, most misalignment failures involve both (1) a training process producing a misaligned learned model and (2) a misaligned learned model that does some cognition that leads to catastrophe. If both elements are usually present, it can’t serve as a categorization.

Deception vs. other failures. The heuristic here is that if the model deliberately and knowingly deceives humans, then it is an inner misalignment failure, and otherwise it is an outer misalignment failure. (I’ve gotten this vibe from various writings; one example would be this comment thread.)

This only somewhat fits the pattern in the answers to the pop quiz – Scenario 1 should at least be categorized as “ambiguous” rather than “outer”. Perhaps since I didn’t specifically talk about deception people’s quick intuitions didn’t notice that deception could have been present and so they classified it as “outer”.

I don’t like this categorization very much as a way to think about catastrophic misalignment; it seems like in the vast majority of catastrophic misalignment scenarios the AI system does deceive humans. (For more on this, see the discussion on this post.) Even if it were a useful categorization, it seems very different from other uses of “outer” and “inner” misalignment, and I’d recommend using different terminology for it (such as “deceptive” and “non-deceptive”).

Gricean implicature. Perhaps people are using a consistent categorization, but they assume that in my scenarios I have told them all the relevant details. So in scenario 1 they assume that the AI system is not deliberately tricking humans at training while it is in scenario 2, which then justifies giving different answers.

I think this is plausible, but I’d bet against it. We don’t currently understand how AI systems internally reason, so it’s pretty normal to give a scenario that discusses how an AI behaves without saying anything about how the AI internally reasons. So there’s not that much information to be gained about AI internal mechanisms through implicature.

Priors against Scenario 2. Another possibility is that given only the information in Scenario 1, people had strong priors against the story in Scenario 2, such that they could say “99% likely that it is outer misalignment” for Scenario 1, which gets rounded to “outer misalignment”, while still saying “inner misalignment” for Scenario 2.

I would guess this is not what’s going on. Given the information in Scenario 1, I’d expect most people would find Scenario 2 reasonably likely (i.e. they don’t have priors against it).

^{^}
I am much less confident whether researchers are confused about outer and inner misalignment when given time to think. In the pop quiz above, people changed their minds pretty quickly upon further discussion.

From the perspective of Reframing Inner Alignment, both scenarios are ambiguous because it's not clear whether

you really had a policy-scoring function that was well-defined by the expected value over the cognitive processes that humans use to evaluate pull requests under normal circumstances, but then imperfectly evaluated it by failing to sample outside normal circumstances, or
your policy-scoring "function" was actually stochastic and "defined" by the physical process of humans interacting with the AI's actions and clicking Merge buttons, and this incorrect policy-scoring function was incorrect, but adequately optimized for.

I tend to favor the latter interpretation—I'd say the policy-scoring function in both scenarios was ill-defined, and therefore both scenarios are more a Reward Specification (roughly outer alignment) problem. Only when you do have "programmatic design objectives, for which the appropriate counterfactuals are relatively clear, intuitive, and agreed upon" is the decomposition into Reward Specification and Adequate Policy Learning really useful.

Yup, this is the objective-based categorization, and as you've noted it's ambiguous on the scenarios I mention because it depends on how you choose the "definition" of the design objective (aka policy-scoring function).

The key problem here is that we don't know what rewards we “would have” provided in situations that did not occur during training. This requires us to choose some specific counterfactual, to define what “would have” happened. After we choose a counterfactual, we can then categorize a failure as outer or inner misalignment in a well-defined manner.

We often do know what rewards we "would have" provided. You can query the reward function, reward model, or human labellers. IMO, the key issue with the objective-based categorisation is a bit different: it's nonsensical to classify an alignment failure as inner/outer based on some value of the reward function in some situation that didn't appear during training, as that value has no influence on the final model.

In other words: Maybe we know what reward we "would have" provided in a situation that did not occur during training, or maybe we don't. Either way, this hypothetical reward has no causal influence on the final model, so it's silly to use this reward to categorise any alignment failures that show up in the final model.

Yeah, that's another good reason to be skeptical of the objective-based categorization.

Priors against Scenario 2. Another possibility is that given only the information in Scenario 1, people had strong priors against the story in Scenario 2, such that they could say “99% likely that it is outer misalignment” for Scenario 1, which gets rounded to “outer misalignment”, while still saying “inner misalignment” for Scenario 2.
I would guess this is not what’s going on. Given the information in Scenario 1, I’d expect most people would find Scenario 2 reasonably likely (i.e. they don’t have priors against it).

FWIW, this was basically my thinking on the two scenarios. Not 99% likelihood, but scenario 1 does strike me as ambiguous but much more likely to be an outer misalignment problem (in the root cause sense).

Yeah, this makes sense given that you think of outer misalignment as failures of [reward function + training distribution], while inner misalignment is failures of optimization.

I'd be pretty surprised though if more than one person in my survey had that view.

A while ago you wanted a few posts on outer/inner alignment distilled. Is this post a clear explanation of the same concept in your view?

I don't think this post is aimed at the same concept(s).

I think any training setup that calculates feedback / rewards / loss based purely on the AI’s external behavior, independent of its internal state, is almost definitely a terrible plan. If we are talking about such plans anyway, for pedagogical or other reasons, then I would defend “outer” and “inner” as a meaningful and useful distinction with definition and reasons here. I think both of your spoiler boxes are inner misalignment (by my definition at that link), because in neither case can we point to any particular moment during training where the AI’s actual external behavior was counter to the programmer’s intention yet the AI nevertheless got a reward for that behavior.

If we are talking about training setups that do NOT give feedback / rewards based purely on external behavior, but rather which also have some kind of access to the AI’s internal thoughts / motivations—which is absolutely what we should be talking about!—then yeah, the words “outer” and “inner” misalignment stop being meaningful.

(Good post, thanks for writing it.)

If we are talking about such plans anyway, for pedagogical or other reasons, then I would defend “outer” and “inner” as a meaningful and useful distinction with definition and reasons here.

Yup, in the post, this is the generalization-based decomposition with "good feedback" defined as "rewards the right action and punishes the wrong action, irrespective of 'reasons'".

If we are talking about training setups that do NOT give feedback / rewards based purely on external behavior, [...] the words “outer” and “inner” misalignment stop being meaningful.

I think you can extend this to such training setups in a way where it is still meaningful, by defining "good feedback" as feedback that incentivizes "doing the right thing for the right reasons". (This is the sort of move that happens with ELK.)

My claim is more that these definitions aren't very useful as categorizations of failure scenarios, because most situations will be some complicated mix. In the generalization-based definition, the problem is "how many pieces of bad feedback do you have to give before it counts as outer misalignment rather than inner misalignment?" In the root-cause-based definition, the problem is "how do you identify the root cause?"

This is a good post, definitely shows that these concepts are confused. In a sense both examples are failures of both inner and outer alignment -

Training the AI with reinforcement learning is a failure of outer alignment, because it does not provide enough information to fully specify the goal.
The model develops within the possibilities allowed by the under-specified goal, and has behaviours misaligned with the goal we intended.

Also, the choice to train the AI on pull requests at all is in a sense an outer alignment failure.

Generalization-based. This categorization is based on the common distinction in machine learning between failures on the training distribution, and out of distribution failures. Specifically, we use the following process to categorize misalignment failures:

Was the feedback provided on the actual training data bad? If so, this is an instance of outer misalignment.

Did the learned program generalize poorly, leading to bad behavior, even though the feedback on the training data is good? If so, this is an instance of inner misalignment.

This categorization is non-exhaustive. Suppose we create a superintelligence via a training process with good feedback signal and no distribution shift. Should we expect that no existential catastrophe will occur during this training process?

You can extend the definition to online learning: choose some particular time and say that all the previous inputs on which you got gradients are the "training data" and the future inputs are the "test data".

In the situation you describe, you would want to identify the point at which the AI system starts executing on its plan to cause an existential catastrophe, set that as the specific point in time (so everything before it is "training" and everything after is "test"), and then apply the categorization as usual.

(Though even in that case it's not necessarily a generalization problem. Suppose every single "test" input happens to be identical to one that appeared in "training", and the feedback is always good.)

It's still well-defined, though I agree that in this case the name is misleading. But this is a single specific edge case that I don't expect will actually happen, so I think I'm fine with that.

Nice! Thinking about "outer alignment maximalism" as one of these framings reveals that it's based on treating outer alignment as something like "a training process that's a genuinely best effort at getting the AI to learn to do good things and not bad things" (and so of course the pull-request AI fails because it's teaching the AI about pull requests, not about right and wrong).

Introspecting, this choice of definition seems to correspond with feeling a lack of confidence that we can get the pull-request AI to behave well - I'm sure it's a solvable technical problem, but in this mindset it seems like an extra-hard alignment problem because you basically have to teach the AI human values incidentally, rather than because it's your primary intent.

Which gets me thinking about what other framings correspond to what other intuitions about what parts of the problem are hard / what the future will look like.

The second scenario omits the details about continuing to create and submit pull requests after takeover, instead just referring to human farms. Since it doesn't explicitly say that it's still optimizing for the original objective criteria and instead just refers to world domination, it appears to be inner misalignment (e.g. no longer aligned with the original optimizer). Did the original posing of this question specify that scenario 2 still maximizes pull requests after world domination?

From the perspective of Reframing Inner Alignment, both scenarios are ambiguous because it's not clear whether

you really had a policy-scoring function that was well-defined by the expected value over the cognitive processes that humans use to evaluate pull requests under normal circumstances, but then imperfectly evaluated it by failing to sample outside normal circumstances, or
your policy-scoring "function" was actually stochastic and "defined" by the physical process of humans interacting with the AI's actions and clicking Merge buttons, and this incorrect policy-scoring function was incorrect, but adequately optimized for.

The key problem here is that we don't know what rewards we “would have” provided in situations that did not occur during training. This requires us to choose some specific counterfactual, to define what “would have” happened. After we choose a counterfactual, we can then categorize a failure as outer or inner misalignment in a well-defined manner.

Yeah, that's another good reason to be skeptical of the objective-based categorization.

Priors against Scenario 2. Another possibility is that given only the information in Scenario 1, people had strong priors against the story in Scenario 2, such that they could say “99% likely that it is outer misalignment” for Scenario 1, which gets rounded to “outer misalignment”, while still saying “inner misalignment” for Scenario 2.
I would guess this is not what’s going on. Given the information in Scenario 1, I’d expect most people would find Scenario 2 reasonably likely (i.e. they don’t have priors against it).

Yeah, this makes sense given that you think of outer misalignment as failures of [reward function + training distribution], while inner misalignment is failures of optimization.

I'd be pretty surprised though if more than one person in my survey had that view.

A while ago you wanted a few posts on outer/inner alignment distilled. Is this post a clear explanation of the same concept in your view?

I don't think this post is aimed at the same concept(s).

(Good post, thanks for writing it.)

If we are talking about such plans anyway, for pedagogical or other reasons, then I would defend “outer” and “inner” as a meaningful and useful distinction with definition and reasons here.

Yup, in the post, this is the generalization-based decomposition with "good feedback" defined as "rewards the right action and punishes the wrong action, irrespective of 'reasons'".

If we are talking about training setups that do NOT give feedback / rewards based purely on external behavior, [...] the words “outer” and “inner” misalignment stop being meaningful.

This is a good post, definitely shows that these concepts are confused. In a sense both examples are failures of both inner and outer alignment -

Training the AI with reinforcement learning is a failure of outer alignment, because it does not provide enough information to fully specify the goal.
The model develops within the possibilities allowed by the under-specified goal, and has behaviours misaligned with the goal we intended.

Also, the choice to train the AI on pull requests at all is in a sense an outer alignment failure.

Generalization-based. This categorization is based on the common distinction in machine learning between failures on the training distribution, and out of distribution failures. Specifically, we use the following process to categorize misalignment failures:

Was the feedback provided on the actual training data bad? If so, this is an instance of outer misalignment.

Did the learned program generalize poorly, leading to bad behavior, even though the feedback on the training data is good? If so, this is an instance of inner misalignment.

It's still well-defined, though I agree that in this case the name is misleading. But this is a single specific edge case that I don't expect will actually happen, so I think I'm fine with that.

Which gets me thinking about what other framings correspond to what other intuitions about what parts of the problem are hard / what the future will look like.

47

Categorizing failures as “outer” or “inner” misalignment is often confused

47

Possible categorizations

What is the purpose of “outer” and “inner” alignment?

Conclusion

Appendix: Why are people’s answers inconsistent?