I have seen a lot of confusion recently surrounding exactly how outer and inner alignment should be defined and I want to try and provide my attempt at a clarification.
Here's my diagram of how I think the various concepts should fit together:
The idea of this diagram is that the arrows are implications—that is, for any problem in the diagram, if its direct subproblems are solved, then it should be solved as well (though not necessarily vice versa). Thus, we get:
And here are all my definitions of the relevant terms which I think produce those implications:
(Impact) Alignment: An agent is impact aligned (with humans) if it doesn't take actions that we would judge to be bad/problematic/dangerous/catastrophic.
Intent Alignment: An agent is intent aligned if the optimal policy for its behavioral objective is impact aligned with humans.
Outer Alignment: An objective function is outer aligned if all models that perform optimally on in the limit of perfect training and infinite data are intent aligned.
Robustness: An agent is robust if it performs well on the base objective it was trained under even in deployment/off-distribution.
Objective Robustness: An agent is objective robust if the optimal policy for its behavioral objective is impact aligned with the base objective it was trained under.
Capability Robustness: An agent is capability robust if it performs well on its behavioral objective even in deployment/off-distribution.
Inner Alignment: A mesa-optimizer is inner aligned if the optimal policy for its mesa-objective is impact aligned with the base objective it was trained under.
And an explanation of each of the diagram's implications:
: If a model is a mesa-optimizer, then its behavioral objective should match its mesa-objective, which means if it's mesa-objective is aligned with the base, then it's behavioral objective should be too.
: Outer alignment ensures that the base objective is measuring what we actually care about and objective robustness ensures that the model's behavioral objective is aligned with that base objective. Thus, putting them together, we get that the model's behavioral objective must be aligned with humans, which is precisely intent alignment.
: Intent alignment ensures that the behavioral objective is aligned with humans and capability robustness ensures that the model actually pursues that behavioral objective effectively—even off-distribution—which means that the model will actually always take aligned actions, not just have an aligned behavioral objective.
If a model is both outer and inner aligned, what does that imply?
Intent alignment. Reading off the implications from the diagram, we can see that the conjunction of outer and inner alignment gets us to intent alignment, but not all the way to impact alignment, as we're missing capability robustness.
Can impact alignment be split into outer alignment and inner alignment?
No. As I just mentioned, the conjunction of both outer and inner alignment only gives us intent alignment, not impact alignment. Furthermore, if the model is not a mesa-optimizer, then it can be objective robust (and thus intent aligned) without being inner aligned.
Does a model have to be inner aligned to be impact aligned?
No—we only need inner alignment if we're dealing with mesa-optimization. While we can get impact alignment through a combination of inner alignment, outer alignment, and capability robustness, the diagram tells us that we can get the same exact thing if we substitute objective robustness for inner alignment—and while inner alignment implies objective robustness, the converse is not true.
How does this breakdown distinguish between the general concept of inner alignment as failing “when your capabilities generalize but your objective does not” and the more specific concept of inner alignment as “eliminating the base-mesa objective gap?”
Only the more specific definition is inner alignment. Under this set of terminology, the more general definition instead refers to objective robustness, of which inner alignment is only a subproblem.
What type of problem is deceptive alignment?
Inner alignment—assuming that deception requires mesa-optimization. If we relax that assumption, then it becomes an objective robustness problem. Since deception is a problem with the model trying to do the wrong thing, it's clearly an intent alignment problem rather than a capability robustness problem—and see here for an explanation of why deception is never an outer alignment problem. Thus, it has to be an objective robustness problem—and if we're dealing with a mesa-optimizer, an inner alignment problem.
What type of problem is training a model to maximize paperclips?
Outer alignment—maximizing paperclips isn't an aligned objective even in the limit of infinite data.
How does this picture relate to a more robustness-centric version?
The above diagram can easily be reorganized into an equivalent, more robustness-centric version, which I've included below. This diagram is intended to be fully compatible with the above diagram—using the exact same definitions of all the terms as given above—but with robustness given a more central role, replacing the central role of intent alignment in the above diagram.
Edit: Previously I had this diagram only in a footnote, but I decided it was useful enough to promote it to the main body.
The point of talking about the “optimal policy for a behavioral objective” is to reference what an agent's behavior would look like if it never made any “mistakes.” Primarily, I mean this just in that intuitive sense, but we can also try to build a somewhat more rigorous picture if we imagine using perfect IRL in the limit of infinite data to recover a behavioral objective and then look at the optimal policy under that objective. ↩︎
What I mean by perfect training and infinite data here is for the model to always have optimal loss on all data points that it ever encounters. That gets a bit tricky for reinforcement learning, though in that setting we can ask for the model to act according to the optimal policy on the actual MDP that it experiences. ↩︎
Note that robustness as a whole isn't included in the diagram as I thought it made it too messy. For an implication diagram with robustness instead of intent alignment, see the alternative diagram in the FAQ. ↩︎
See here for an example of this confusion regarding the more general vs. more specific uses of inner alignment. ↩︎
See here for an example of this confusion regarding deceptive alignment. ↩︎
This post aims to clarify the definitions of a number of concepts in AI alignment introduced by the author and collaborators. The concepts are interesting, and some researchers evidently find them useful. Personally, I find the definitions confusing, but I did benefit a little from thinking about this confusion. In my opinion, the post could greatly benefit from introducing mathematical notation and making the concepts precise at least in some very simplistic toy model.
In the following, I'll try going over some of the definitions and explicating my understanding/confusion regarding each. The definitions I omitted either explicitly refer to these or have analogous structure.
This one is more or less clear. Even though it's not a formal definition, it doesn't have to be: after all, this is precisely the problem we are trying to solve.
The "behavioral objective" is defined in a linked page as:
This is already thorny territory, since it's far from clear what is "perfect inverse reinforcement learning". Intuitively, an "intent aligned" agent is supposed to be one whose behavior demonstrates an aligned objective, but it can still make mistakes with catastrophic consequences. The example I imagine is: an AI researcher who is unwittingly building transformative unaligned AI.
This is confusing because it's unclear what counts as "well" and what are the underlying assumptions. The no-free-lunch theorems imply that an agent cannot perform too well off-distribution, unless you're still constraining the distribution somehow. I'm guessing that either this agent is doing online learning or it's detecting off-distribution and failing gracefully in some sense, or maybe some combination of both.
Notably, the post asserts the implication intent alignment + capability robustness => impact alignment. Now, let's go back to the example of the misguided AI researcher. In what sense are they not "capability robust"? I don't know.
The "mesa-objective" is defined in the linked page as:
So it seems like we could replace "mesa-objective" with just "objective". This is confusing, because in other places the author felt the need to use "behavioral objective" but here he is referring to some other notion of objective, and it's not clear what's the difference.
I guess that different people have different difficulties. I often hear that my own articles are difficult to understand because of the dense mathematics. But for me, it is the absence of mathematics which is difficult! ↩︎
+1, great post.
Only nitpick: seems like it's worth clarifying what you mean by "infinite data" - from which distribution? And same with "off-distribution".
Thanks! And good point—I added a clarifying footnote.
Hmm, I think this is still missing something.
No—all data points that it could ever encounter is stronger than I need and harder to define, since it relies on a counterfactual. All I need is for the model to always output the optimal loss answer for every input that it's ever actually given at any point.
Deployment, but I agree that this one gets tricky. I don't think that the fact that the world is non-stationary is a problem for conceptualizing it as an MDP, since whatever transitions occur can just be thought of as part of a more abstract state. That being said, modeling the world as an MDP does still have problems—for example, the original reward function might not really be well-defined over the whole world. In those sorts of situations, I do think it gets to the point where outer alignment starts breaking down as a concept.
I'm not sure you have addressed Richard's point -- if you keep your current definition of outer alignment, then memorizing the answers to the finite set of data is always a way to score perfect loss, but intuitively doesn't seem like it would be intent aligned. And if memorization were never intent aligned, then your definition of outer alignment would be impossible.
Planned summary for the Alignment Newsletter:
Great post. Thanks for writing this — it feels quite clarifying. I'm finding the diagram especially helpful in resolving the sources of my confusion.
I believe everything here is consistent with the definitions I proposed recently in this post (though please do point out any inconsistencies if you see them!), with the exception of one point.
This may be a fundamental confusion on my part — but I don't see objective robustness, as defined here, as being a separate concept at all from inner alignment. The crucial point, I would argue, is that we ought to be treating the human who designed our agent as the base optimizer for the entire system.
Zooming in on the "inner alignment → objective robustness" part of the diagram, I think what's actually going on is something like:
But here's the key point: from the point of view of the human researcher who built the agent, the agent is actually a mesa-optimizer, and the agent's "behavioral objective" is really just the mesa-objective of that mesa-optimizer.
That is to say, I think there are three levels of optimizers being invoked implicitly here, not just two. Through that lens, "intent alignment", as defined here, is what I'd call "inner alignment between the researcher and the agent"; and "inner alignment", as defined here, is what I'd call "inner alignment between the agent and the mesa-optimizer it may give rise to".
In other words, humans live in this hierarchy too, and we should analyze ourselves in the same terms — and using the same language — as we'd use to analyze any other optimizer. (I do, for what it's worth, make this point in my earlier post — though perhaps not clearly enough.)
Incidentally, this is one of the reasons I consider the concepts of inner alignment and mesa-optimization to be so compelling. When a conceptual tool we use to look inside our machines can be turned outward and aimed back at ourselves, that's a promising sign that it may be pointing to something fundamental.
A final caveat: there may well be a big conceptual piece that I'm missing here, or a deep confusion that I have around one or more of these concepts that I'm still unaware of. But I wanted to lay out my thinking as clearly as I could, to make it as easy as possible for folks to point out any mistakes — would enormously appreciate any corrections!
I agree that what you're describing is a valid way of looking at what's going on—it's just not the way I think about it, since I find that it's not very helpful to think of a model as a subagent of gradient descent, as gradient descent really isn't itself an agent in a meaningful sense, nor do I think it can really be understood as “trying” to do anything in particular.
Sure, makes sense! Though to be clear, I believe what I'm describing should apply to optimizers other than just gradient descent — including optimizers one might think of as reward-maximizing agents.
Thanks for writing this up. Quick question re: "Intent alignment: An agent is intent aligned if its behavioral objective is aligned with humans." What does it mean for an objective to be aligned with humans, on your view? You define what it is for an agent to be aligned with humans, e.g.: "An agent is aligned (with humans) if it doesn't take actions that we would judge to be bad/problematic/dangerous/catastrophic." But you don't say explicitly what it is for an objective to be aligned: I'm curious if you have a preferred formulation.
Is it something like: “the behavioral objective is such that, when the agent does ‘well’ on this objective, the agent doesn’t act in a way we would view as bad/problematic/dangerous/catastrophic." If so, it seems like a lot might depend on exactly how “well” the agent does, and what opportunities it has in a given context. That is, an “aligned” agent might not stay aligned if it becomes more powerful, but continues optimizing for the same objective (for example, a weak robot optimizing for beating me at chess might be "aligned" because it only focuses on making good chess moves, but a stronger one might not be, because it figures out how to drug my tea). Is that an implication you’d endorse?
Or is the thought something like: "the behavioral objective such that, no matter how powerfully the agent optimizes for it, and no matter its opportunities for action, it doesn't take actions we would view as bad/problematic/dangerous/catastrophic"? My sense is that something like this is often the idea people have in mind, especially in the context of anticipating things like intelligence explosions. If this is what you have in mind, though, maybe worth saying so explicitly, since intent alignment in this sense seems like a different constraint than intent alignment in the sense of e.g. "the agent's pursuit of its behavioral objective does not in fact give rise to bad actions, given the abilities/contexts/constraints that will in fact be relevant to its behavior."
Maybe the best thing to use here is just the same definition as I gave for outer alignment—I'll change it to reference that instead.
Aren't they now defined in terms of each other?
"Intent alignment: An agent is intent aligned if its behavioral objective is outer aligned.
Outer alignment: An objective function r is outer aligned if all models that perform optimally on r in the limit of perfect training and infinite data are intent aligned."
Good point—and I think that the reference to intent alignment is an important part of outer alignment, so I don't want to change that definition. I further tweaked the intent alignment definition a bit to just reference optimal policies rather than outer alignment.
Thanks for writing this.
I wish you included an entry for your definition of 'mesa-optimizer'. When you use the term, do you mean the definition from the paper* (an algorithm that's literally doing search using the mesa objective as the criterion), or you do speak more loosely (e.g., a mesa-optimizer is an optimizer in the same sense as a human is an optimizer)?
A related question is: how would you describe a policy that's a bag of heuristics which, when executed, systematically leads to interesting (low-entopy) low-base-objective states?
*incidentally, looking back on the paper, it doesn't look like we explicitly defined things this way, but it's strongly implied that that's the definition, and appears to be how the term is used on AF.
Glad you liked it! I definitely mean mesa-optimizer to refer to something mechanistically implementing search. That being said, I'm not really sure whether humans count or not on that definition—I would probably say humans do count but are fairly non-central. In terms of the bag of heuristics model, I probably wouldn't count that, though it depends on what “bag of heuristics” means exactly—if the heuristics are being used to guide a planning process or something, then I would call that a mesa-optimizer.