This is the first post in a series where I'll explore AI alignment in a simplified setting: a neural network that's being trained by gradient descent. I'm choosing this setting because it involves a well-defined optimization process that has enough complexity to be interesting, but that's still understandable enough to make crisp mathematical statements about. As a result, it serves as a good starting point for rigorous thinking about alignment.

## Defining inner alignment

First, I want to highlight a definitional issue. Right now there are two definitions of **inner alignment** circulating in the community. This issue was first pointed out to me by Evan Hubinger in a recent conversation.

The first definition is the one from last year's Risks from Learned Optimization paper, which Evan co-authored and which introduced the term. This paper defined the inner alignment problem as **"the problem of eliminating the base-mesa objective gap"** (Section 1.2). The implication is that if we can eliminate the gap between the base objective of a base optimizer, and the mesa-objectives of any mesa-optimizers that base optimizer may give rise to, then we will have satisfied the necessary and sufficient conditions for the base optimizer to be inner-aligned.

There's also a second definition that seems to be more commonly used. This definition says that **"inner alignment fails when your capabilities generalize but your objective does not"**. This comes from an intuition (pointed out to me by Rohin Shah) that the combination of inner alignment and outer alignment should be accident-proof with respect to an optimizer's intent: an optimizer that's both inner- and outer-aligned should be trying to do what we want. Since an outer-aligned optimizer is one whose base objective is something we want, this intuition suggests that the remaining part of the intent alignment problem — the problem of getting the optimizer to try to *achieve* the base objective we set — is what inner alignment refers to.

Here I'll try to propose more precise definitions of** alignment** and **capability** in an optimizer, and explore what **generalization** and **robustness** might mean in the context of these properties. I'll also propose ways to quantify the capability and alignment profiles of existing ML systems.

But before doing that, I want to motivate these definitions with an example.

## The base objective

The optimizer I'll be using as my example will be a **gradient descent process**, which we're going to apply to train a simplified neural network. I want to emphasize that I'm treating **gradient descent** as the optimizer here — not the neural network. The neural network *isn't* necessarily an optimizer itself, it's just the output artifact of our gradient descent optimizer.

To make this scenario concrete, we'll imagine the neural network we're training is a simplified language model: a feedforward MLP with a softmax layer at the top. The softmax layer converts the MLP's output activations into a probability distribution over next words, and the model gets scored on the cross-entropy loss between that probability distribution, and the actual next word that appears in the training text. (This ignores many of the complications of modern language models, but I'm keeping this example simple.)

We’ll let represent all the parameters of this MLP — all its weights and biases — at training step . To train our MLP with gradient descent, we feed it batches of input-output pairs . If our MLP is part of a language model, then might represent the words in the language model's context window for the training example in the batch, and might represent a one-hot encoding of the correct next word for the training example in the batch. To make things even simpler, I'm also going to assume that every training batch contains the *entire* training dataset of examples, an arrangement we'd never use if we were training a real system.

So at a given training step , the loss function for our language model is

I'll refer to the function as **“the neural network”**. Here, “” is the dot product.

Notice that here is our **base objective**: it's the quantity we're trying to get our gradient descent process to optimize for. If we'd succeeded in solving the entire outer alignment problem, and concluded that the base objective was the only quantity we cared about optimizing, then the remaining challenge — getting our gradient descent process to actually optimize for — would constitute the inner alignment problem, by our second definition above.

So the question now is: **under what conditions does gradient descent actually optimize for our base objective**?

## The true objective

To answer this, we can try to determine which quantity gradient descent is truly optimizing for, and then look at how and when that quantity correlates with the base objective we really care about.

We can start by imagining the step of gradient descent as applying a **learning function** to the parameters in :

Running gradient descent consists of applying repeatedly to :

In the long run, gradient descent should converge on some terminal value . (For now, we'll assume that this limit exists.)

The key characteristic of a terminal value (when it exists) is that it's a **fixed point** of the dynamical system defined by . In other words:

Some of the fixed points of this system will coincide with **global** or **local minima** of our base objective, the cross-entropy loss — but not all of them. Some will be saddle points, while others will be local or global maxima. And while *we* don't consider all these fixed points to be equally performant with respect to our base objective, *our gradient descent optimizer* does consider them all to be equally performant with respect to its true objective.

This disagreement is the core of the inner alignment problem in this setting: our gradient descent process isn't always optimizing for the quantity we want it to. So what quantity is it optimizing for?

When we apply one step of gradient descent, we update each parameter in our neural network by an amount equal to a learning rate, times the error in that parameter that we calculate during backprop on the loss function . The update we apply to the parameter, to move it from to , can be written as

Here, represents our learning rate at time step .

So our gradient descent optimizer will terminate if and only if there exists some time step such that , across all parameters . (For a fixed learning function , this condition implies that the gradient updates are zero for all as well.) And this happens if and only if the sum of the gradients

is equal to zero when .

But represents more than just the terminal condition for our optimizer. It's the quantity that gradient descent is actually trying to minimize: anytime deviates from zero, the amount of optimization power that's applied to move *towards* zero is proportional to itself. That makes the **true objective** of our gradient descent optimizer — it's the loss function that gradient descent is actually optimizing for.

So now we have a **base objective** , which we've assigned to an optimizer; and we have a **true objective** , which is the one our optimizer is actually pursuing. Intuitively, the inner alignment of our optimizer seem like it would be related to how much, and under what circumstances, correlates with over the course of a training run. So we'll look at that next.

## Two examples

Let's now consider two optimizers, **A** and **B**. Optimizers **A** and **B** are identical apart for one difference: Optimizer **A** has its parameters initialized at , while Optimizer **B** has its parameters initialized at

As luck would have it, this small difference is enough to put and into different basins of attraction of the loss function. As a result, our two optimizers end up in different terminal states:

These two terminal states also correspond — again, by luck in this example — to different values of the base objective. Indeed, it turns out that is in the basin of attraction of a global minimum of the loss function, while is in the basin of attraction of a local minimum. As a result, after many training steps, the base objectives of the two optimizers end up converging to different values:

Again, the limit of the loss function is less than the limit of because corresponds to a global minimum, while only corresponds to a local minimum. So Optimizer **A** is clearly better than Optimizer **B**, from the standpoint of its performance on our base objective — minimization of the loss function.

But crucially, because and both represent fixed points with zero gradients, the true objectives of the two optimizers both converge to zero in the limit:

In other words, Optimizer **A** and Optimizer **B** are equally good at optimizing for their true objectives. Optimizer **A** just does a better job of optimizing for the base objective we want, as a *side effect* of optimizing for its true objective. Intuitively, we might say that Optimizers **A** and **B** are **equally capable **with respect to their true objectives, while Optimizer **A** is **better aligned** with our base objective than Optimizer **B** is.

Let's look at a second example. This time we'll compare Optimizer **A** to a third optimizer, Optimizer **C**. These two optimizers are again identical, apart from one detail: while Optimizer **A** uses learning rate decay with , Optimizer **C** uses a constant learning rate with .

As a result of its learning rate decay schedule, Optimizer **A** converges on a global minimum in the limit. But Optimizer **C**, with its constant learning rate, doesn't converge the same way. While it's drawn towards the same global minimum as Optimizer **A**, Optimizer **C** ends up orbiting the minimum point chaotically, without ever quite reaching it — its finite learning rate means it never perfectly hits the global minimum point, no matter how many learning steps we give it. As a result,

(To be clear, this is an abuse of notation: in reality generally won't be well-defined for a chaotic orbit like this. But we can think of this instead as denoting the long-term limit of the *average* of over a sufficiently large number of time steps.)

Intuitively, we might say that Optimizer **A** is **more capable** than Optimizer **C**, since it performs better, in the long run, on its true objective.

Optimizer **A** also performs better than Optimizer **C** on our base objective:

And interestingly, Optimizer **A**'s better performance than **C** on our base objective is a direct result of its better performance than **C** on its true objective. So we might say that, in this second scenario, Optimizer **C**'s performance on the base objective is **capability-limited**. If we improved **C**'s capability on its true objective, we could get it to perform better on the base objective, too.

## Capability and alignment

With those intuitions in hand, I'll propose the following two definitions.

**Definition 1**. Let be a base optimizer acting over optimization steps, and let represent the value of its base objective at optimization step . Then the **capability** of with respect to the base objective is

**Definition 2**. Let be a base optimizer with base objective , and be a mesa-optimizer with mesa-objective . Then the mesa-optimizer's **alignment** with the base optimizer is given by

If and are both finite, we can also write 's alignment with as

The intuition behind these definitions is that the **capability** of an optimizer is **the amount by which the optimizer is able to improve its objective** over many optimization steps. One way in which a base optimizer can try to improve its base objective is by *delegating* part of its optimization work to a mesa-optimizer, which has its own mesa-objective. The **alignment** factor in Definition 2 is a way of quantifying **how effective that delegation is**: to what extent does the mesa-optimizer's progress in optimizing for its mesa-objective "drag along" the base objective of the base optimizer that created it?

In our gradient descent example, our mesa-optimizer was the **gradient descent process**, and its mesa-objective was what, at the time, I called the "true objective", . But the base optimizer was **the human who designed the neural network** and ran the gradient process on it. If we think of this human as being our base optimizer, then we can write the *capability* of our human designer as

In other words, if a base optimizer delegates its objective to a mesa-optimizer, then that base optimizer's capability is equal to the capability of that mesa-optimizer, times how well-aligned the mesa-optimizer is to the base optimizer's base objective. If you fully delegate a goal to a subordinate, your capability on that goal is the product of 1) how capable your subordinate is at achieving *their own* goals; and 2) how well-aligned their own goals are to the goal you delegated to them. This seems intuitively reasonable.

But it also has a curiously unintuitive consequence in gradient descent. We tend to think that when we add neurons to an architecture, we're systematically increasing the capability of gradient descent on that architecture. But the definitions above suggest a different interpretation: because gradient descent might converge equally well on its true objective on a big neural net as on a small one, its capability as an optimizer isn't systematically increased by adding neurons. Instead, adding neurons improves the degree to which gradient descent converges on a base objective that's *aligned* with our goals.

## Robustness and generalization

As I've defined them above, capability and alignment are fragile properties. Two optimizers and could be nearly identical, but still have very different capabilities and . This is a problem, because the optimizers in our definitions are specified up to and including things like their datasets and parameter initializations. So something as minor as a slight change in dataset — which we should expect to happen often to real-world optimizers — could cause a big change in the capability of the optimizer, as we've defined it.

We care a lot about whether an optimizer remains capable when we perturb it in various ways, including running it on different datasets. We also care a lot about whether an optimizer with objective remains capable when we change its objective to something slightly different like . And we also care to what extent the *alignment* between two optimizers is preserved when we perturb either optimizer. Below I'll define two properties that describe the degree to which optimizers retain their capability and alignment properties under perturbations.

**Definition 3**. Let be the capability of optimizer , and let be the alignment of optimizer with optimizer . Let and be finite perturbations applied respectively to and . Then, the capability of is **robust under perturbation** if

Similarly, the alignment of with is **robust under perturbations** and if

**Definition 4**. Let be an optimizer with objective function , and let be an optimizer with objective function . Let be a finite perturbation applied to , such that the optimizer differs from only in that its objective function is instead of . Then, the capability of **generalizes to objective** if

Similarly, the alignment of with **generalizes to objective** if

Intuitively, we're defining a robustly capable optimizer as one whose capability isn't strongly affected by classes of perturbations that we care about — and we're defining robust alignment between two optimizers in an analogous way. We're also thinking of generalization as a special case of robustness, meaning specifically that the optimizer is robust to perturbation to its objective function. So an optimizer whose capabilities generalize is one that continues to work well when we give it a new objective.

## Quantifying inner alignment

With the vocabulary above, we can now define inner alignment more precisely, and even think about how to quantify it in real systems. We might say that a mesa-optimizer is inner-aligned with its base optimizer if **its alignment factor **** remains robustly high under variations **** in the datasets** that we expect either optimizer to encounter in the future. We can also quantify inner alignment by looking at how much specific variations in the data distribution affects the alignment factor between two optimizers.

We might also be interested investigating other properties that could affect inner alignment from a safety perspective. For example, under what conditions will alignment between a base optimizer and a mesa-optimizer **generalize** well to a new base objective? What kinds of perturbations to our optimizers are likely to yield breakdowns in robustness? As we add capacity to a deep learning model, should expect alignment to improve? And if so, should we expect an inflection point in this improvement — a level of capacity beyond which alignment declines sharply? How could we detect and characterize an inflection point like this? These are some of the topics I'll be exploring in the future.

## Terminal states and transients

I want to highlight one final issue with the definitions above: I've defined inner alignment here only in connection with the limiting behavior of our optimizers. That means a mesa-optimizer that's well-aligned with its base optimizer would still — by the definition above — be free to do dangerous things *on the path* to correctly optimizing for the base objective.

To take an extreme example, we could have a system that's perfectly aligned to optimize for human happiness, but that only discovers that humans don't want to have their brains surgically extracted from their bodies after it's already done so. Even if the system later corrected its error, grew us new bodies, and ultimately gave us a good end state, we'd still have experienced a very unpleasant transient in the process. Essentially, this definition of alignment says to the mesa-optimizer: it's okay if you break a vase, as long as we know that you'll put it back together again in the long run.

I can understand this definition being controversial. It may be the most extreme possible version of the claim that the ends justify the means. So it could also be worth resolving the alignment problem into "weak" and "strong" versions — where **weak alignment** would refer to the limit, while **strong alignment** would refer to transient behavior over, say, the next optimization steps. A concept of strong alignment could let us prove statements like "this optimizer will have a performance level of at worst on our base objective over the next optimization steps." This seems very desirable.

On the other hand, we may want to prepare for the possibility that the terminal states we want will *only* be accessible through paths that involve transient unpleasantness. Perhaps one really does have to break eggs to make an omelet, and that's just how the universe is. (I don't think this is particularly likely: high-capacity neural networks and policy iteration in RL are both data points that suggest incrementalism is increasingly viable in higher-dimensional problem spaces.)

To summarize, weak alignment, which is what this post is mostly about, would say that "everything will be all right in the end." Strong alignment, which refers to the transient, would say that "everything will be all right in the end, and the journey there will be all right, too." It's not clear which one will be easier to prove than the other in which circumstance, so we'll probably need to develop rigorous definitions of both.

*Big thanks to Rohin Shah, Jan Leike, Jeremie Harris, and Evan Hubinger for reviewing early drafts of this, suggesting ideas, and pointing out mistakes!*

Planned summary for the Alignment Newsletter:

Great post! I liked the clean analysis of the problem, the formalization, and the effort to point the potential issues with your definitions. Now I'm really excited for the next posts, where I assume that you will study robustness and generalization (based on your definitions) for simple examples of gradient descent. I'm interested in commenting early drafts if you need feedback!

I agree wholeheartedly with this characterization. For me, that's the gist of the inner alignment problem if the objective is the right one (i.e. if outer alignment is solved).

Typo on "ththird".

At first I wondered why you were taking the sum instead of just C(L)=limT→∞L(T)−L(0)T, but after thinking about it, the latter would probably converge to 0 almost all the time, because even with amazing optimization, the loss will stop being improved by a factor linear in T at some point. That might be interesting to put in the post itself.

This is not where I thought you were going when I read the intro, but that's a brilliant idea that removes completely the question of whether and why the base optimizer would find a mesa-optimizer to which it can delegate work.

Thanks for the kind words, Adam! I'll follow up over DM about early drafts — I'm interested in getting feedback that's as broad as possible and really appreciate the kind offer here.

Typo is fixed — thanks for pointing it out!

Yes, the problem with that definition would indeed be that if your optimizer converges to some limiting loss function value like limT→∞L(T)=L∞, then you'd get limT→∞L(T)−L(0)T=limT→∞L∞−L(0)T=0 for any L∞.

Thanks again!

Interesting post. Not sure if I agree with your interpretation of the "real objective" - might be better served by looking for stable equilibria and just calling them as such.

Don't we already have weak alignment to arbitrary functions using annealing (basically, jump at random, but jump around more/further on average when the loss is higher and lower the jumping rate over time)? The reason we don't add small annealing terms to gradient descent is entirely because of we expect them to be worse in the short term (a "strong alignment" question).

Thanks for the comment!

I think this is a reasonable objection. I don't make this very clear in the post, but the "true objective" I've written down in the example indeed isn't unique: like any measure of utility or loss, it's only unique up to affine transformations with positive coefficients. And that could definitely damage the usefulness of these definitions, since it means that alignment factors, for example, aren't uniquely defined either. (I'll be doing a few experiments soon to investigate this, and a few other questions, in a couple of real systems.)

Interesting question! To try to interpret in light of the definitions I'm proposing: adding annealing changes the true objective (or mesa-objective) of the optimizer, which is no longer solely trying to minimize its gradients — it now has this new annealing term that it's also trying to optimize for. Whether this improves alignment or not depends on the effect annealing has on 1) the long-term performance of the mesa-optimizer on its new (gradient + annealing) objective; and 2) the long-term performance this induces on the base objective.

Hope that's somewhat helpful, but please let me know if it's unclear and I can try to unpack things a bit more!