"Inner Alignment Failures" Which Are Actually Outer Alignment Failures

by johnswentworth4 min read31st Oct 202036 comments

20

Inner AlignmentOuter AlignmentEvolutionAI
Frontpage

If you don’t know what “inner” and “outer” optimization are, or why birth control or masturbation might be examples, then check out one of the posts here before reading this one. Thanks to Evan, Scott, and Richard for discussions around these ideas - though I doubt all their objections are settled yet.

Claim: from an evolutionary fitness perspective, masturbation is an inner alignment failure, but birth control is an outer alignment failure.

More generally:

  • Assuming the outer optimizer successfully optimizes, failure to generalize from training environment to deployment environment is always an outer alignment failure (regardless of whether an inner optimizer appears).
  • Assuming outer alignment, inner alignment problems can occur only as the result of imperfect optimization.

All of these claims stem from one main argument, which can be applied to any supposed inner alignment problem.

This post will outline the argument, give a few examples, and talk about some implications.

Motivating Example: Birth Control vs Masturbation

Consider some simple biological system - like a virus or a mycobacterium - which doesn’t perform any significant optimization. The system has been optimized by evolution to perform well in certain environments, but it doesn’t perform any optimization “at runtime”.

From an evolutionary fitness perspective, the system can still end up misaligned. The main way this happens is an environment shift: the system evolved to reproduce in a certain environment, and doesn’t generalize well to other environments. A virus which spreads quickly in the rodents of one particular pacific island may not spread well on the mainland; a mycobacterium evolved for the guts of ancestral-environment humans may not flourish in the guts of humans with a modern diet.

Obviously this scenario does not involve any inner optimization, because there is no inner optimizer. It’s purely an outer alignment failure: the objective “reproductive fitness in ancestral environment” is not aligned with the objective “reproductive fitness in current environment”. 

Now imagine a more complex biological system which is performing some optimization - e.g. humans. Evolutionary “training” optimized humans for reproductive fitness in the ancestral environment. Yet the modern environment contains many possibilities which weren’t around in the ancestral environment - e.g. birth control pills. “Wanting birth control” did not decrease reproductive fitness in the ancestral environment (because it wasn’t available anyway), but it does in the modern environment. Thus, birth control is an outer alignment problem: “reproductive fitness in the ancestral environment” is not aligned with “reproductive fitness in the modern environment”.

By contrast, masturbation is a true inner alignment failure. Masturbation was readily available in the ancestral environment, and arose from a misalignment between human objectives and evolutionary objectives (i.e. fitness in the ancestral environment). It presumably didn’t decrease fitness very much in the ancestral environment, not enough for evolution to quickly find a work around, but it sure seems unlikely to have increased fitness - there is some “waste” involved. Key point: had evolution somehow converged to the “true optimum” of fitness in the ancestral environment (or even just something a lot more optimal), then that more-reproductively-fit “human” probably wouldn’t masturbate, even if it were still an inner optimizer.

General Argument

“Outer alignment” means that the main optimization objective is a good proxy for the thing we actually want. In the context of systems which are trained offline, this means that “performance in the training environment” is a good proxy for “performance in the deployment environment” - systems which perform well in the training environment should perform well in the deployment environment. All systems. If there is any system which performs well in the training environment but not in the deployment environment, then that’s an outer alignment failure. The training environment and objective are not aligned with the deployment environment and objective.

This is true regardless of whether there happens to be an inner optimizer or not.

Corollary: assuming outer alignment holds, all inner alignment failures are the result of imperfect outer optimization. Reasoning: assuming outer alignment, good performance on the proxy implies good performance on what we actually want. Conversely, if the inner optimizer gives suboptimal performance on what we want, then it gives suboptimal performance on the proxy. And if it gives suboptimal performance on the proxy, then the outer optimizer should not have selected that design anyway - unless the outer optimizer has failed to fully optimize.

Here’s a mathematical version of this argument, in a reinforcement learning context. We’ll define the optimal policy  as

… where  are the reward signals and the expectation is taken over the training distribution. Evan (one of the people who introduced the inner/outer optimization split) defines outer alignment in this context as “ is aligned with our true goals”.

Let’s assume that our reinforcement learner spits out a policy  containing a malicious inner optimizer, resulting in behavior unaligned with our true goals. There are two possibilities. Either:

  • The unaligned policy  is outer-optimal (i.e. optimal for the outer optimization problem), , so the outer-optimal policy  is unaligned with our goals. This is an outer alignment failure.
  • The policy  is not outer-optimal, , so the outer optimizer failed to fully optimize.

Thus: assuming outer alignment holds, all inner alignment failures are the result of imperfect optimization. QED.

Given Perfect Outer Optimization, Deceptive Alignment Is Always An Outer Alignment Failure

In particular, we can apply this argument in the context of deceptive alignment: an inner optimizer which realizes it’s in training, and “plays along” until it’s deployed, at which point it acts maliciously.

We have two possibilities. The first possibility is that the deceptive inner optimizer was not actually optimal even in training. If the outer optimizer is imperfect, the inner optimizer might be able to “trick it” into choosing a suboptimal policy during training - more on that shortly.

But assuming that the deceptive inner optimizer is part of the outer-optimal policy, the outer objective must be unaligned. After all, the policy containing the deceptive inner optimizer is itself an example of a policy which is optimal under the objective but is unaligned to human values. Clearly, the outer optimizer is willing to assign maximal reward to systems which do not produce good behavior by human standards. That's an outer alignment failure.

When Inner Alignment Does Matter: Imperfect Optimization

Inner alignment is still an important issue, because in practice we do not use perfect brute-force optimizers. We do not enumerate every single possibility in the search space and then choose the highest-scoring. Instead, we use imperfect optimizers, like gradient descent.

In principle, an inner optimizer could “hack” an imperfect optimization algorithm. I talk more about what this looks like in Demons in Imperfect Search (also check out Tesselating Hills, a concrete example in the context of gradient descent). Biological evolution provides an excellent example: greedy genes. In sexual organisms, genes are randomly selected from each parent during fertilization. But certain gene variants can bias that choice in favor of themselves, allowing those genes to persist even if they decrease the organism’s reproductive fitness somewhat.

This is a prototypical example of what a true inner alignment problem would look like: the inner optimizer would “hack” the search function of the outer optimizer.

One takeaway: inner alignment is a conceptually easy problem - we just need to fully optimize the objective. But it’s still potentially difficult in practice, because (assuming P  NP) perfect optimization is Hard. From this perspective, the central problem of inner alignment is to develop optimization algorithms which do not contain exploitable vulnerabilities.

On the flip side, we probably also want outer objectives which do not contain exploitable vulnerabilities. That's an outer alignment problem.

UPDATE: after discussion in the comments, I think the root of the disagreements I had with Evan and Richard is that they're thinking of "inner alignment" in a way which does not necessarily involve any inner optimizer at all. They're thinking of generalization error as "inner alignment failure" essentially by definition, regardless of whether there's any inner optimizer involved. Conversely, they think of "outer alignment" in a way which ignores generalization errors.

As definitions go, these seem to cut reality at a reasonable set of joints, though the names seem misleading (the names certainly mislead me!).

Adopting their definitions, the core argument of this post is something like:

  • (outer alignment) + (no generalization error) + (full optimization) => alignment
  • generalization error and outer alignment are both orthogonal to mesa-optimization; they are problems which need to be addressed regardless of whether mesa-optimizers are an issue
  • therefore mesa-optimizers are a problem in their own right only to the extent that they exploit imperfections in the (outer) optimization algorithm.

20

36 comments, sorted by Highlighting new comments since Today at 5:40 AM
New Comment

I think my key objection is: if there's any distributional shift at all between the training and the deployment environments, and the agent screws up at all during deployment, then your definition calls it an outer alignment failure. In particular, if there's any dangerous behaviour that we couldn't simulate during training which arises in the real world, then your definition calls it an outer alignment failure. But we can never have training environments which are identical to deployment environments, or simulate all possible dangerous behaviours, so under your definition any problems at deployment time which we have not seen during training are caused by outer misalignment - which in practice will make every problem an outer misalignment problem. (But in fact, even we had seen those problems during training without the optimiser getting rid of them, then clearly they weren't a large enough proportion of the training data. So that's a problem with the distribution of training data, which also seems like it'd fit under your definition of outer misalignment too).

Yes, part of the outer alignment problem is making sure that any training procedure we use will generalize well to the real world. This does not mean that there can't be any distributional shift, or that our training environments must be identical to our deployment environments. It means that our architecture must be robust to distribution shifts, or at least whatever kinds of distribution shifts we care about. That's certainly not impossible - it's a major selling point of causal structure learning, for instance.

So if there are safety problems because you deployed the agent, can that ever qualify as inner misalignment under your definition? Or will it always be the fault of the training procedure for not having prepared the agent for the distributional shift to actual deployment? Unlike masturbation, which you can do both in modern times and in the ancestral environment, almost all of the plausible AI threat scenarios involve the AI gaining access to capabilities that it didn't have access to during training.

By contrast, under the conventional view of inner/outer alignment, we can say: bad behaviour X is an outer alignment problem if it would have been rewarded highly by the training reward function (or a natural generalisation of it). And X is an inner alignment problem if the agent is a mesa-optimiser and deliberately chose X to further its goals. And X is just a robustness problem, not an alignment problem, if neither of these apply.

Alright, let's take those definitions and apply the argument from the OP.

bad behaviour X is an outer alignment problem if it would have been rewarded highly by the training reward function (or a natural generalisation of it)

So, if we observe bad behavior X in deployment, then one of two things must be the case:

  • the bad behavior was highly rewarded by the training reward function, so outer alignment failed
  • the bad behavior was not highly rewarded by the training reward function, so the training process failed to optimize

This applies regardless of whether there's an inner optimizer. So, the only way an inner optimizer can cause problematic behavior in deployment is if either (a) outer alignment fails, or (b) it's exploiting imperfect optimization during training.

So if there are safety problems because you deployed the agent, can that ever qualify as inner misalignment under your definition?

Yes, it would be an inner alignment problem if there were higher-scoring designs which the training process overlooked because the inner optimizer hacked the outer optimizer during training (i.e. the imperfect optimization case).

Or will it always be the fault of the training procedure for not having prepared the agent for the distributional shift to actual deployment?

If a distributional shift causes the problem, that is always 100% of the time a failure of the training procedure and/or training objective, by your own definition.

Okay, it really feels like we're talking past each other here, and I'm not sure what the root cause of that is. But I suspect that it is because you are thinking of outer alignment as choosing an objective function plus an environment, thereby bundling the robustness problem into the outer alignment problem, and I am thinking of it as just choosing an objective function. I could argue further why your definition isn't useful, but instead I'm just going to argue that it violates common usage.

Here's Paul: "You could select a policy which performed well on the distribution, but it may be the case that that policy is trying to do something [other than what you wanted]. There might be many different policies which have values that lead to reasonably good behavior on the training distribution, but then, in certain novel situations, do something different from what I want. That's what I mean by inner alignment."

Here's Evan: "[a mesa-optimiser could] get to the end of the maze on the training distribution, but it could be an objective that will do anything else, sort of off-distribution. That fundamental robustness problem of, when you train a model, and that model has an objective, how do you ensure that that objective is the one that you trained it for? That’s the inner alignment problem."

Here's Vlad: "Since the mesa-optimiser is selected based on performance on the base objective, we expect it (once trained) to have a good policy on the training distribution."

All of them are talking about mesa-optimisers which perform well on the training distribution being potentially inner-misaligned during deployment. None of them claim that the only way inner misalignment can be a problem is via hacking the training process. You don't engage with any previous definitions of inner alignment in your post, you just assert that the concept everyone else has been using is not "true" inner alignment, which seems pretty unhelpful.

(You do quote Evan but the thing he says is consistent with my outer alignment = objective function conception, and you don't explain why you're justified in interpreting it the other way).

(You do quote Evan but the thing he says is consistent with my outer alignment = objective function conception, and you don't explain why you're justified in interpreting it the other way).

That is in fact how I meant it—the definition I've given in the past of outer alignment explicitly only looks at the objective function and not the whole training process. See also this comment explaining how I think my definition differs from John's.

But I suspect that it is because you are thinking of outer alignment as choosing an objective function plus an environment, thereby bundling the robustness problem into the outer alignment problem, and I am thinking of it as just choosing an objective function.

Ok, I buy that. We could definitely break up what-I'm-calling-outer-alignment into one piece that says "this reward function would give good policies high reward if the policies were evaluated on the deployment distribution" (i.e. outer alignment) and a second piece which says "policies which perform well in training also perform well in deployment" (i.e. robustness). Key point: even after separating "robustness" from "outer optimization":

  • robustness + outer alignment + full optimization = fully aligned system, regardless of whether or not any inner optimizers are present
  • robustness is orthogonal to inner optimization
  • assuming outer alignment and robustness, inner alignment failure can only be a problem due to imperfect optimization

None of them claim that the only way inner misalignment can be a problem is via hacking the training process.

Yes, that's why I wrote this post. All of these people seem to not realize the implications of their own definitions.

You seem to think I'm arguing about definitions here. I'm not. I'm mostly using definitions other people put forward (e.g. Evan's definition of outer optimization in the OP), and arguing that those definitions imply things which their original authors do not seem to realize.

You seem to be arguing that inner alignment is/should be defined to include generalization problems, so long as an inner optimizer is present. To the extent that that's the case, it's inconsistent with the obvious definition of outer optimization, i.e. the definition from Evan which I used in the OP. (Or the obvious definitions of outer alignment and robustness, if you want to split that out.)

You do quote Evan but the thing he says is consistent with my outer alignment = objective function conception, and you don't explain why you're justified in interpreting it the other way

He gave a mathematical definition, and the meaning of the expectation operator was pretty clear in-context (you can go read the thread I linked in the OP). It was not consistent with your conception.

You don't engage with any previous definitions of inner alignment in your post, you just assert that the concept everyone else has been using is not "true" inner alignment

I do not assert any such thing, because it is not relevant to the argument. The argument only needs the definition of outer alignment; it says that outer alignment + full optimization = fully aligned system (or robustness + outer alignment + full optimization, if we want to split that out). Once we have that, I could just as easily substitute "silly beans" for "inner alignment": if the system is outer aligned, then silly beans are relevant to alignment only insofar as they exploit imperfections in the outer optimizer.

Again, I'm not trying to argue definitions here. There is is a substantive argument in the OP, which applies regardless of how we define things.

He gave a mathematical definition, and the meaning of the expectation operator was pretty clear in-context (you can go read the thread I linked in the OP). It was not consistent with your conception.

First of all, the quick definition that I gave in that thread was just meant to be a shortened version of the full definition that I've written an entire post on and that I linked to in that thread, which is pretty explicit about only looking at the objective function and not the training process.

Second, I think even the definition I gave in that thread is consistent with only looking at the objective function. The definition I gave there just says to ask whether all policies that maximize reward in expectation are aligned, where the way I meant that expectation to be taken was over all data, not just the training data—that is, the expectation taken over the actual underlying MDP. Under that definition, only the objective function matters, not the training process or training data, as we're only looking at the actual optimal policy in the limit of perfect training and infinite data. As a result, my definition is quite different from yours, including categorizing deception as always an inner alignment failure (see this comment).

We could definitely break up what-I'm-calling-outer-alignment into one piece that says "this reward function would give good policies high reward if the policies were evaluated on the deployment distribution" (i.e. outer alignment) and a second piece which says "policies which perform well in training also perform well in deployment" (i.e. robustness).

Fwiw, I claim that this is the actually useful decomposition. (Though I'm not going to argue for it here -- I'm writing this comment mostly in the hopes that you think about this decomposition yourself.)

I'd say it slightly differently: outer alignment = "the reward function incentivizes good behavior on the training distribution" and robustness / inner alignment = "the learned policy has good behavior on the deployment distribution". Under these definitions, all you care about is inner alignment; outer alignment is instrumentally useful towards guaranteeing inner alignment, but if we got inner alignment some other way without getting outer alignment, that would be fine.

Another decomposition is: outer alignment = "good behavior on the training distribution", robustness / inner alignment = "non-catastrophic behavior on the deployment distribution". In this case, outer alignment is what tells you that your agent is actually useful, and inner alignment is what tells you it is safe.

I thought this was what the mesa optimizers paper was trying to point at, but I share your sense that the things people say and write are inconsistent with this decomposition. I mostly don't engage with discussion of mesa optimization on LW / AIAF because of this disconnect.

I share your concern that the definitions of outer and inner alignment have gotten pretty fuzzy. I think the definitions we gave in “Risks from Learned Optimization” were actually pretty precise but suffered from being limited to the situation where the model is actually an optimizer. The problem is that now, people (including myself) have taken to using the terms outside of that narrow context, which leads to these sorts of disconnects.

I wrote “Outer alignment and imitative amplification” previously to try addressing this. Based at least partly on that post, the decomposition I would favor would be:

  • outer alignment = the reward function incentivizes good behavior in the limit of infinite data
  • inner alignment = the model actually produced by the training process has good behavior according to the reward function.

I personally like this decomposition (and it is notably quite different than John's) as it fits well with my intuitions for what things should be classified as outer vs. inner alignment failures, though perhaps you would disagree.

Yeah I think that decomposition mostly makes sense and is pretty similar to mine.

My main quibble is that your definition of outer alignment seems to have a "for all possible distributions" because of the "limit of infinite data" requirement. (If it isn't all possible distributions and is just the training distribution, then in the IRD lava gridworld the reward that assigns +100 to lava and the policy that walks through lava when possible would be both outer and inner aligned, which seems bad.)

But then when arguing for the correctness of your outer alignment method, you need to talk about all possible situations that could come up, which seems not great. I'd rather have any "all possible situations" criteria be a part of inner alignment.

Another reason I prefer my decomposition is because it makes outer alignment a purely behavioral property, which is easier to check, much more conceptually grounded, and much more in line with what current outer alignment solutions guarantee.

Actually, maybe the main reason why this whole line of thinking seems strange to me is because the whole reason the inner alignment term was coined is to point at a specific type of generalisation failure that we're particularly worried about (see https://www.lesswrong.com/posts/2mhFMgtAjFJesaSYR/2-d-robustness). So you defining generalisation problems as part of outer misalignment basically defeats the point of having the inner alignment concept.

I do want to throw away the idea that deceptive alignment is mainly about the inner optimizer. It's not. If the system were outer aligned to begin with, then deceptive optimizers wouldn't be an issue, and if the system were not outer aligned to begin with, then the same imperfections which give rise to deceptive alignment would still cause problems even in the absence of inner optimizers. For instance, inner optimizers are certainly not the only problem caused by distribution shift.

This does not mean throwing away the whole concept of inner alignment, though. Inner optimizers hacking the training process is still a serious problem.

After discussing with Evan, he has a clearer decomposition in mind:

  • Outer Alignment is according to his definition in this comment, that originally comes from this post: an objective function is aligned if all models optimal for it, after perfect optimization on infinite data, are aligned (including in the deployment phase). The salient point is the hypothesis of infinite data, which thus requires no generalization and no inductive bias consideration.
  • Assuming Outer Alignment, there is still the problem of 2D-Robustness, as presented in this post. Basically, even with the right objective, the result of training might fail to generalize in the deployment environment, for one of two reasons: capability failure, where the model learned is just nowhere near optimal for the objective, and thus having the right objective doesn’t help; and objective failure, where a competent model is learned, but because of the lack of infinite data, it learns a different objective that also works in the training phase.
    So 2D-Robustness splits into Capability-Robustness and Objective-Robustness
  • Within Objective Robustness, there is a specific type of robustness if the learned system is an explicit optimizer: Inner Alignment (as defined in Risks from Learned Optimization)

Note that the names used here are not necessarily how we should call these properties, they’re just the names from the different posts defining them.

Evan thinks, and I agree, that many of the confusion in this post’s comment comes from people having different definitions for inner alignment, and notably thinking that outer Alignment + Inner Alignment = Alignment, which is only the case for mesa-optimization; the general equation is instead Outer Alignment + 2D-Robustness = Alignment. For example, in my own comment, what I define as Inner Alignment is closer to 2D-Robustness.

Maybe the next step would be to check if that decomposition makes sense, and then try to find sensible names to use when discussing these issues.

The salient point is the hypothesis of infinite data, which thus requires no generalization and no inductive bias consideration.

So, I didn't comment on this earlier because it was orthogonal to the discussion at hand, but this is just wrong. Indeed, the fact that it's wrong is the primary reason that generalization error is a problem in the first place.

If the training data is taken from one distribution, and deployment exposes the system to a different distribution, then infinite data will not solve that problem. It does not matter how many data points we have, if they're from a different distribution than deployment. Generalization error is not about number of data points, it is about a divergence between the process which produced the data points and the deployment environment.

That said, while this definition needs some work, I do think the decomposition is sensible.

You're absolutely right, but that's not what I meant by this sentence, nor what Evan thinks.

Here "infinite data" literally means having the data for the training environment and the deployment environment. It means that there is no situation where the system sees some input that was not available during training, because every possible input appears during training. This is obviously impossible to do in practice, but it allow the removal of inductive bias consideration at the theoretical level.

Here "infinite data" literally means having the data for the training environment and the deployment environment.

This also doesn't work - there's still a degree of freedom in how much of the data is from deployment, and how much from training. Could be 25% training distribution, could be 98% training distribution, and those will produce different optimal strategies. Heck, we could construct it in such a way that there's infinite data from both but the fraction of data from deployment goes to zero as data size goes to infinity. In that case, the optimal policy in the limit would be exactly what's optimal on the training distribution.

When we're optimizing for averages, it doesn't just matter whether we've ever seen a particular input; it matters how often. The system is going to trade off better performance on more-often-seen data points for worse performance on less-often-seen data points.

That's only true for RL—for SL, perfect loss requires being correct on every data point, regardless of how often it shows up in the distribution. For RL, that's not true, but for RL we can just say that we're talking about the optimal policy on the MDP that the model will actually encounter over its existence.

for SL, perfect loss requires being correct on every data point, regardless of how often it shows up in the distribution

This is only true if identical data points always have the same label. To the extent that's true in real data sets, it's purely an artifact of finite data, and is almost certainly not true of the underlying process.

Suppose I feed a system MRI data and labels for whether each patient has cancer, and train the system to predict cancer. MRI images are very high dimensional, so the same image will probably never occur twice in the data set; thus the system can be correct on every datapoint (in training). But if the data were actually infinite, this would fall apart - images would definitely repeat, and they would probably not have the same label every time, because an MRI image does not actually have enough information in it to 100% perfectly predict whether a patient has cancer. Given a particular image, the system will thus have to choose whether this image more often comes from a patient with or without cancer. And if that frequency differs between train and deploy environment, then we have generalization error.

And this is not just an example of infinities doing weird things which aren't relevant in practice, because real supervised learners do not learn every possible function. They have inductive biases - some of the learning is effectively unsupervised or baked in by priors. Indeed, in terms of bits of information, the exponentially vast majority of the learning is unsupervised or baked in by priors, and that will always be the case. The argument above thus applies not only to any pair of identical images, but any pair of images which are treated the same by the "unsupervised/inaccessible dimensions" of the learner.

Taking more of an outside view... if we're imagining a world in which the "true" label is a deterministic function of the input, and that deterministic function is in the supervised learner's space, then our model has thrown away everything which makes generalization error a problem in practice. It's called "robustness to distribution shift" for a reason.

if we're imagining a world in which the "true" label is a deterministic function of the input, and that deterministic function is in the supervised learner's space, then our model has thrown away everything which makes generalization error a problem in practice.

Yes—that's the point. I'm trying to define outer alignment so I want the definition to get rid of generalization issues.

We want a definition which separates out generalization issues, not a definition which only does what we want in situations where generalization issues never happened in the first place.

If we define outer alignment as (paraphrasing) "good performance assuming we've seen every data point once", then that does not separate out generalization issues. If we're in a situation where seeing every data point is not sufficient to prevent generalization error (i.e. most situations where generalization error actually occurs), then that definition will classify generalization error as an outer alignment error.

To put it differently: when choosing definitions of things like "outer alignment", we do not get to pick what-kind-of-problem we're facing. The point of the exercise is to pick definitions which carve up the world under a range of architectures, environments, and problems.

My own perspective about outer/inner alignment, after reading the Risks from Learned Optimization and lots of other posts and talking to Evan, is that:

  • An outer alignment failure is when you mess up the objective (the objective you train on wouldn't be the right objective on the deployment environment)
  • An inner alignment failure is when the objective generalize well (optimizing it on the the deployment environment would produce the wanted behavior), but training on the training environment causes the learning of a different objective (whether implicit or explicit) that is not exactly the training objective.

To give an example, let's assume the tasks of reaching the exit door of 2D mazes. And let's say that on the training mazes, all exit doors are red; but on some other environments, they are of a different color.

  • Having a reward that gives +1 for reaching red objects and 0 otherwise is an outer alignment failure, because this reward would not train the right behavior in some mazes (the ones with blue exit doors, for example).
  • Having a reward that gives +1 for reaching the exit door and 0 otherwise will ensure outer alignment, but there is still a risk of inner alignment failure because the model could learn to just reach red objects. Note that this inner alignment failure can be caused either by the model learning an explicit and wrong objective (as in mesa-optimizers), or just because reaching red objects leads to simple models that are optimal for the training function.

Another way to put it is that for an outer-alignment failure, it's generally not possible for the training to end up with an aligned model, whereas for an inner alignment failure, there is actually an optimal model that is aligned, but the data/environments provided push towards a misaligned optimal model.

I think the salient point of disagreement is that you seem to assume that for an objective to be actually the right one, all optimal policies should be aligned and generalize the deployment distribution. But I don't think that's a useful assumption, because it's pretty much impossible IMO to get such perfect objectives.

On the other hand, I do think that we can get objectives according to which some aligned models are optimal. And splitting the failure modes into outer and inner alignment failures allows us to discuss whether we should try to change the objective or the data.

Lastly, a weakness of my definition is that sometimes it might be harder to find the right data/training environment for the "good objective" than to change the objective into another objective that is also aligned, but for which finding the right optimal policy is easier. In that case, solving an inner alignment failure would be easier by changing the base objective, which is really counter intuitive.

I disagree with this definition of outer alignment and I think it is quite distinct from my definition.

My definition says that an objective function is outer aligned if all models optimal under in the limit of perfect optimization and unlimited data are aligned.

My understanding of your definition is that you want to say that a training process is outer aligned if all models optimal under that training process in the limit of perfect optimization are aligned.

The major differences between our definitions appear to be that mine only talks about the objective function while yours talks about the whole training process and mine takes the infinite data limit whereas yours does not. As a result, however, I think our definitions look quite different—under my definition, for example, deception is always an inner alignment failure as you should never get deception in the limit of infinite data (since a deceptive model has to defect on some data point).

you should never get deception in the limit of infinite data (since a deceptive model has to defect on some data point).

I think a model can be deceptively aligned even if formally it maps every possible input to the correct (safe) output. For example, suppose that on input X the inference execution hacks the computer on which the inference is being executed, in order to do arbitrary consequentialist stuff (while the inference logic, as a mathematical object, formally yields the correct output for X).

Sure—we're just trying to define things in the abstract here, though, so there's no harm in just defining the model's output to include stuff like that as well.

So, if a distribution shift causes a problem but there's no inner optimizer, would that count as an inner alignment problem?

Also, infinite data does not at all imply that the training distribution is the same as the deployment distribution. Even if an inner optimizer defects on some data point, that point can be rare in the training distribution but more common in deployment, resulting in a high reward in training but bad behavior in deployment.

You could instead define outer optimization as "all models optimal under  in the limit of perfect optimization and evaluated on the deployment distribution are aligned", which seems more like what you intend.

This all still seems to leave us with a notion of "inner alignment" which does not necessarily involve any inner optimizers.

So, if a distribution shift causes a problem but there's no inner optimizer, would that count as an inner alignment problem?

Note that I'm only defining outer alignment here, not inner alignment. I think inner alignment is actually more difficult to define than outer alignment. In “Risks from Learned Optimization,” we define it as whether the mesa-optimizer's mesa-objective is aligned with the base objective, but that definition only makes sense if you actually have a mesa-optimizer. If you want a more general definition, you could just define inner alignment as whether the model actually produced by the training process has good behavior according to the base objective.

Also, infinite data does not at all imply that the training distribution is the same as the deployment distribution. Even if an inner optimizer defects on some data point, that point can be rare in the training distribution but more common in deployment, resulting in a high reward in training but bad behavior in deployment.

I mean “infinite data” to include both all training data and all deployment data. In terms of what distribution, at least if we're doing supervised learning (e.g. imitative amplification) it shouldn't matter, since I'm asking for perfectly optimal performance, so it needs to be optimal for every data point. It gets a bit more complicated for RL, but you can just say that it needs to be optimal relative to the deployment MDP.

Ok, I buy this as a valid definition. Can you give some intuition for why it's useful? Once we divorce the notion of "inner alignment" from inner optimizers, I do not see the point of having a name for it - the language overload will just going to lead to an unfounded intuition that inner optimizers are the main problem.

It's worth noting that “inner optimizer” isn't a term we used in “Risks from Learned Optimization” nor is it a term I use myself or encourage others to use—rather, we used mesa-optimizer, which doesn't have the same parallel with inner alignment. In hindsight, we probably should have separated things out and used inner alignment exclusively for the general definition and “mesa-alignment” or some other similar term for the more specific one (perhaps I should try to write a post attempting to fix this).

I agree with John that we should define inner alignment only in contexts where mesa-optimisers exist, as you do in Risks from Learned Optimization (and as I do in AGI safety from first principles); that is, inner alignment = mesa-alignment. Under the "more general" definition you propose here, inner alignment includes any capabilities problem and any robustness problem, so it's not really about "alignment".

I think it's possible to formulate definitions of inner alignment that don't rely on mesa-optimizers but exclude capabilities (e.g. by using concepts like 2-D robustness), though I agree that it gets a lot trickier when you try to do that, which was a major motivator for why we just went with the very restrictive mechanistic definition that we ended up using in Risks.

Ok, I just updated the end of the OP to account for these definitions. Let me know if the reworded argument makes sense.

UPDATE: after discussion in the comments, I think the root of the disagreements I had with Evan and Richard is that they're thinking of "inner alignment" in a way which does not necessarily involve any inner optimizer at all. They're thinking of generalization error as "inner alignment failure" essentially by definition, regardless of whether there's any inner optimizer involved. Conversely, they think of "outer alignment" in a way which ignores generalization errors.

I don't think this is true. I never said that inner alignment didn't involve mesa-optimizers—in fact, I explicitly said previously that you could define it in the more specific way or the more general way. My point here is just that I want to define outer alignment as being specifically about the objective function in the abstract and what it incentivizes in the limit of infinite data.

Do you agree that (a) you're thinking of "outer alignment" in a way which excludes generalization error by definition, and (b) generalization error can occur regardless of whether any inner optimizer is present?

Yes—I agree with both (a) and (b). I just don't think that outer and inner alignment cover the full space of alignment problems. See this post I just published for more detail.

Oh excellent, glad to see a fresh post on it.