The observations I make here have little consequence from the point of view of solving the alignment problem. If anything, they merely highlight the essential nature of the inner alignment problem. I will reject the idea that robust alignment, in the sense described in Risks From Learned Optimization, is possible at all. And I therefore also reject the related idea of 'internalization of the base objective', i.e. I do not think it is possible for a mesa-objective to "agree" with a base-objective or for a mesa-objective function to be “adjusted towards the base objective function to the point where it is robustly aligned.” I claim that whenever a learned algorithm is performing optimization, one needs to accept that an objective which one did not explicitly design is being pursued. At present, I refrain from attempting to propose my own adjustments to the framework, or to build on the existing literature or to develop my own theory. I am certainly not against doing any of those things, but they are things to possibly be pursued later; none of them is the purpose of this post.

To make my main point, I will introduce only a bare minimum of mathematical notation. We will show that a mesa-objective always has a different type signature to a base objective and that the default assumption ought to be that there is no way to compare them in general and certainly no general way to interpret what it means for them to ‘agree’. Suppose that an optimizer is searching through a space  of systems. At this time, I do not want to attempt to unpack what it means to 'search', but, naively, we can imagine that there is an objective function  , which determines something that we might call the 'search criterion'. The idea of course is that the optimizer is a system that is 'searching' through the set  and judging different points according to the criterion that higher values of  are better.

In the background, there is some 'task' and naively we can think of this as being represented by a 'task space'  which consists of all of the different possible 'presentations' or 'instances' of the task. For example, perhaps the task is choosing the next move in a game of Go or the next action in a real-time strategy video game. In these examples, a given   would represent a board position in Go, say, or a single snapshot of the game-state in the video game. Then, in general, given  and , we can think that  is the output of  on the task instance  or the action taken by  when presented with  (i.e.  denotes the next board move in Go or the next action to be taken in the video game). So each element of  defines a map from the task space  to some kind of output space or space of possible actions, which we need not notate.

Now, it is possible that there exists  which works in the following way: Whenever the output of  on an instance  of the task needs to be evaluated, i.e. whenever  is computed, what happens is that  searches over another search space  and looks for elements that score highly according to some other objective function . Whenever this is the case, we say that such an  is a mesa-optimizer and that the original optimizer - the one that searches over  - is the base optimizer. Notice that in some way, elements of  must in turn correspond to outputs/actions, because given some , the mesa-optimizer  conducts a search over  to determine what output  is, but that is all just part of the internal workings of  and we need not 'know' or notate how this correspondence works. 

In Risks From Learned Optimization, Hubinger et al. write:

In such a case, we will use base objective to refer to whatever criterion the base optimizer was using to select between different possible systems and mesa-objective to refer to whatever criterion the mesa-optimizer is using to select between different possible outputs.

So: The mesa-objective is the criterion that  is using in its search: It expresses the idea that higher values of  are better. And the base objective refers to the criterion that higher values of  are better.

Inner Alignment, Robust Alignment, and Pseudo Alignment

The domain of  is the space   - the space of systems that the base optimizer is searching over (and which can be represented mathematically as a space of functions, each of which is from  to the output or 'action' space). On the other hand, the domain of  is  . As mentioned above, we might want to think of  as corresponding to (a subset of) the output space, but either way, a priori, there is nothing to suggest that  and  are not different spaces. The two objective functions used as criteria in these searches have different domains and it is not clear how to compare them. 

In Risks From Learned Optimization, it is written that "The problem posed by misaligned mesa-optimizers is... the gap between the base objective and the mesa-objective... We will call the problem of eliminating the base-mesa objective gap the inner alignment problem...". I think that they are absolutely right to point to the difference between the base objective and a mesa-objective as being the source of an important issue, but I find referring to it as a "gap", at least at the level of generality posited, to be somewhat misleading. We are not dealing with two objects that are in principle comparable but just so happen to be separated by a gap (a gap waiting to be narrowed by the correct clever idea, say). Instead, the difference, which is due to the different type signatures of the objective functions, is essential in character and rather means that they are, in general, incomparable. 

Consider the definitions of robust alignment and pseudo alignment:

We will use the term robustly aligned to refer to mesa-optimizers with mesa-objectives that robustly agree with the base objective across distributions and the term pseudo-aligned to refer to mesa-optimizers with mesa-objectives that agree with the base objective on past training data, but not robustly across possible future data (either in testing, deployment, or further training).

What might it possibly mean to have mesa-optimizers with mesa-objectives that "agree" with the base objective on past training data or "across distributions"? Again, the base objective refers to a criterion used to select between different systems. How can a mesa-objective, a criterion that a particular one of these systems uses to select between different actions, 'agree' or 'disagree' with it on any particular set of data or "across distributions"? Without further development of the framework, or further explanation, it's impossible to know precisely what this could mean. Robust alignment seems at best to be a very odd, extreme case (where somehow we have ended up with something like  and/or  ?) and at worst simply impossible.

Later, Hubinger attempts to clarify the terminology in a separate post: Clarifying Inner Alignment Terminology. This attempt at clarification and increased rigour should obviously be encouraged, but it is immediately clear that some of the main definitions are still unsatisfactory: The last of seven definitions is the definition is that of Inner Alignment:

Inner Alignment: A mesa-optimizer is inner aligned if the optimal policy for its mesa-objective is impact aligned with the base objective it was trained under.

This version of the definition seems to turn crucially on the notion that a policy could be "impact aligned" with the base objective. Let us turn to Hubinger's own definition of "Impact Alignment", from the same post, to find out what this means precisely: 

(Impact) Alignment: An agent is impact aligned (with humans) if it doesn't take actions that we would judge to be bad/problematic/dangerous/catastrophic.

It seems that we are only told what impact alignment means in the context of an agent and humanity. So we are still missing what seems to be the very core of this edifice: What does it really mean for a mesa-optimizer to be - in whatever is the appropriate sense of the word - 'aligned'? What could it mean for a mesa-objective to 'agree with' the base objective?

Internalization of the base objective

In the Deceptive Alignment post, the idea of “Internalization of the base objective” is introduced. Arguably this is the point at which one might expect the issues I have raised to be most fleshed out, because in highlighting the possibility of “internalization” of the base objective, i.e. that it is possible for a mesa-objective function to be “adjusted towards the base objective function to the point where it is robustly aligned,”  there is an implicit claim that robust alignment really can occur. So to understand this phenomenon, we might look for an explanation as to how this occurs. But the ensuing analysis is somewhat weak and vague, to the point that it is almost just a restatement of the claim that it purports to explain: 

information about the base objective flows into the learned algorithm via the optimization performed by the base optimizer—the base objective is built into the mesa-optimizer as it is adapted by the base optimizer.

I could try to give my own interpretation of what happens when information about the base objective “flows into” the learned algorithm “via” the optimization process, but I would be making something up that does not appear in the text. And what follows is really just a discussion of some possible ways by which the mesa-optimizer comes to be able to use information about the base objective (it could get information about the base objective directly via the base optimizer or it could get it from the task inputs). None of it goes towards alleviating the specific concerns laid out above and none of it really explains with any conviction how true “internalization” happens. Moreover, a footnote admits that in fact the two routes by which the mesa-optimizer may come to be able to use information about the base objective do not even neatly correspond to the dichotomy given by ‘internalization of the base objective’ vs. ‘modelling of the base objective’.


 My observations here run counter to any argument which suggests it is possible to 'close the gap' between the base and mesa objectives. As stated above, this suggests that the inner alignment problem has an essential nature: I claim that whenever mesa-optimization occurs, one needs to accept that internally, there is pressure towards a goal which one did not explicitly design.

Of course a close reading of what has been said here really only shows that we cannot rely on the specific formalization I have used (though it may be no more than a few mathematical functions) while still maintaining the exact theoretical framework described in Risks From Learned Optimization. Therefore, either we can try to revise the framework slightly, essentially omitting the notions of robust alignment and 'internalization of the base objective' and focussing more on revised versions of 'proxy alignment' and 'approximate alignment' as descriptors of what is essentially the best possible situation in terms of alignment. Or, it may be the case that the fault is with my formalization and that what I claim are conceptual issues are little more than notational or mathematical curiosities. If the latter is indeed the case, then at the very least, we need to be explicit about whatever tacit assumptions have been made that imply that formalization along the lines I have outlined cannot provide a permissible analysis. For example, I can certainly imagine that it may be possible to add in details on a case-by-case basis or at least to restrict to a specific explicit class of base objectives and then explicitly define how to compare mesa-objectives to them. Perhaps those who object to my view will claim that this is what is really going on in people's minds and it's just that it has not been spelled out. However, at present, I believe that at the level of generality that Risks From Learned Optimization strives for, we simply cannot speak of mesa-objectives ‘agreeing with’ or even really of being ‘adjusted towards’ base objectives.


As a final set of remarks, I wanted to briefly discuss the general attitude I have taken here. One might read this and think either that yes, this all seems reasonable, but since it is not about addressing the alignment problem at all, what was the point? Or perhaps one might think that it could all be avoided if only I were to make a more charitable reading of the Risks From Learned Optimization posts in the first place. Am I acting in bad faith?... Surely I "get what they mean"? Indeed, often I do feel like I can see or could guess what the authors are getting at. Why then, have I gone out of my way to take them at their word to such a great extent, just so I can point out inconsistencies? 

I want to end by describing some general, if somewhat vague and half-baked, thoughts about this kind of theoretical/conceptual AI Alignment work and hopefully this will help to answer the above questions. In my humble opinion, one of the things that this type of work ought to be 'worried about' is that it exists in a kind of no-man's land between on the one hand more traditional academic work in fields like computer science and philosophy and on the other hand more 'mainstream' ML Safety, shall we say. For a while I have been wondering whether or not this kind of theoretical alignment work is doomed to remain in this no man's land, propped up by a few success stories but mostly fed by a steady stream of informal arguments, futurological speculation, and 'hand-waving' on blogs and comment sections. I of course do not fully know, but here are a couple of things that have come to mind when trying to think about this: Firstly, when we do not have the luxury of mathematical proof nor the crutch of being backed up by working code and empirical results, it is even more important to subject arguments to a high level of scrutiny. There should be (and hopefully can be) a high bar of intellectual and academic rigour for theoretical/conceptual work in this area. It needs to strive to be as clean and clear as possible. And it's worth saying that one reason for this is so that it stands on its own two feet, so to speak, when interrogated outside of communities like this one. Secondly, I feel it is important that the best arguments and ideas we have - and good critiques of them - appear 'in the literature'. I certainly don't advocate for a completely traditional model of dissemination and publication (there are many advantages to the Alignment Forum and the prevailing rationalist/EA/longtermist ecosystem and their ways of doing things) and of course many great ideas start out as hand-waving and speculation, but it will ultimately not be enough that some idea is 'generally known' in the online/EA alignment communities or can be put together by combing through comment sections and the minds of the relevant people if said idea is never really or cannot be 'written up' in a truly convincing way. As I've said, these remarks are not fully fleshed out and further discussion here doesn't really seem appropriate. For now, the idea was to explain some of my motivation for taking time to post something like this. All discussion and comments are welcome.





New Comment
5 comments, sorted by Click to highlight new comments since:

If I've understood it correctly, I think this is a really important point, so thanks for writing a post about it. This post highlights that mesa objectives and base objectives are typically going to be of different "types", because the base objective will typically be designed to evaluate things in the world as humans understand it (or as modelled by the formal training setup) whereas the mesa objective will be evaluating things in the AI's world model (or if it doesn't really have a world model, then more local things like actions themselves as opposed to their distant consequences).


Am I acting in bad faith?... Surely I "get what they mean"?

I'm certainly glad to see people suspending their sense of "getting it" when it comes to reference (aka pointers, aka representation) since I don't think we have solid foundations for these topics and I think they are core issues in AI alignment.

Thank you for writing this post. I think the issue goes even deeper. The reward function doesn't even have the type signature of "objective."

Your post seems to be focused more on pointing out a missing piece in the literature rather than asking for a solution to the specific problem (which I believe is a valuable contribution). Regardless, here is roughly how I would understand “what they mean”:

Let  be the task space,  the output space,  the model space,  our base objective, and  the mesa objective of the model for input . Assume that there exists some map  mapping internal objects to outputs by the model, such that .

Given this setup, how can we reconcile  and ? Assume some distribution  over the task space is given. Moreover, assume there exists a function  mapping tasks to utility functions over outputs, such that . Then we could define a mesa objective as  where  if  and otherwise we define  as some very small number or  (and replace  by  above). We can then compare  and  directly via some distance on the spaces  and .

Why would such a function  exist? In stochastic gradient descent, for instance, we are in fact evaluating models based on the outputs they produce on tasks distributed according to some distribution . Moreover, such a function should probably exist given some regularity conditions imposed on an arbitrary objective  (inspired by the axioms of expected utility theory).

Why would a function  exist? Some function connecting outputs to the internal search space has to exist because the model is producing outputs. In practice, the model might not optimize  perfectly and thus might not always choose the argmax (potentially leading to suboptimality alignment), but this could probably still be accounted for somehow in this model. Moreover,  could theoretically differ between different inputs, but again one could probably change definitions in some way to make things work.

If  is a mesa-optimizer, then there should probably be some way to make sense of the mathematical objects describing mesa objective, search space, and model outputs as described above. Of course, how to do this exactly, especially for more general mesa-optimizers that only optimize objectives approximately, etc., still needs to be worked out more.

These issues of preferences over objects of different types (internal states, policies, actions, etc.) and how to translate between them are also discussed in the post Agents Over Cartesian World Models.

Nice post.

Therefore, either we can try to revise the framework slightly, essentially omitting the notions of robust alignment and 'internalization of the base objective' and focussing more on revised versions of 'proxy alignment' and 'approximate alignment' as descriptors of what is essentially the best possible situation in terms of alignment.

Have you seen Hubinger's more recent post, More variations on pseudo-alignment ? It amends the list of pseudo-alignment types originally listed in "Risks of Learned Optimization" to include a couple more.

Your claim above that the best we could hope for may be a form of proxy alignment or approximate alignment reminds me of following pseudo-alignment type he introduced in that more recent post. In the description of this type, he also seems to agree with you that robust alignment is very difficult or "unstable" (though perhaps you go further in saying its impossible):

Corrigible pseudo-alignment. In the paper, we defined corrigible alignment as the situation in which "the base objective is incorporated into the mesa-optimizer's epistemic model and [the mesa-optimizer's] objective is modified to 'point to' that information." We mostly just talked about this as a form of robust alignment—however, as I note in "Towards a mechanistic understanding of corrigibility," this is a very unstable operation, requiring you to get your pointer just right. Thus, I think it's better to talk about corrigible alignment as the class of possible relationships between the base and mesa-objectives defined by the model having some sort of pointer to the base objective, including both corrigible robust alignment (if the pointer is robust) and corrigible pseudo-alignment (if the pointer is to some sort of non-robust proxy). In particular, I think this distinction is fairly important to why deceptive alignment might be more likely than robust alignment, as it points at why robust alignment via corrigibility might be quite difficult (which is a point we made in the paper, but one which I think is made much clearer with this distinction).