Comparing Four Approaches to Inner Alignment

Lucas Teixeira

Early work on this was supported by CEEALAR and was finished during an internship at Conjecture under the mentorship of Adam Shimi.

Thank you to Evan Hubinger and Rohin Shah for answering some questions related to this post.

Epistemic Status: First palimpsest of many to come.

Intro and Motivation

I am broadly interested in pluralism in scientific development, and more practically interested in supporting the diversification of conceptual alignment.

Towards that end, I believe that having a systematic analysis of the cruxes and methodologies which split up the field into distinct agendas could help clarify how exactly diversity is created and sustained in our field, and what exactly it is we wish to diversify in order to better manage our collective portfolio of research bets.

As a case study, this post will investigate four different approaches to inner alignment. I’ll be taking a look at the different definitions which they use for “outer alignment” and conjecturing on how:

Despite inconsistencies across approaches, the different definitions utilized by each approach maintain coherent when understood against the backdrop of the aims local to that approach
Which cruxes set these different approaches apart.

This post is distillational in nature, and as such most of the ideas which I present here are not novel and not my own. The claims in this post should also be read as a part of an ongoing exploratory process, and any pushback (especially from those whose work I cite) would be beneficial.

The Approaches

The Mechanistic Approach

The mechanistic approach deconfused a lot of previous work on optimization daemons and provided the now canonical argument for why we’d expect mesa-optimizers and inner misalignment, which is summarized as follows:

Traditionally, most of the optimization pressure (or in other words, searching procedures) is implemented looking for a specific model which performs competently on a task during training time. However, as ML becomes more and more powerful, models will be trained to more complicated tasks across a wider range of environments. This puts more pressure for a model to generalize better, and raises the incentives for the model to delegate some of its optimization power to deployment time, rather than training time. It is in those cases where we get a mesa-optimizer, and it is with emergence of mesa-optimizers where the issues of misalignment between the mesa-optimizer’s objectives and the objectives which we were training for arise.

The mechanistic approach has also created different evaluations and methods of evaluations of alignment proposals. This approach is defined by an aim towards conceptual clarity, and not only in the employment of mechanistic definitions in both proposed solutions and problem statements, but also with their continued refinement.

The Empiricist approach

In contrast to the mechanistic approach, the empirical approach’s strategy is mostly focused on developing knowledge of inner alignment by creating empirical experiments studying the phenomena. Tightly coupled to this strategy is a favoring of empirical operationalizability over mechanisticality as a precondition for the definitions which they use. They will often allude to the intentional stance as an influence and cite 2-D Robustness as a core frame for the alignment problem. They favor terms such as objective robustness and generalization over mesa-optimizers and inner alignment.

A redrafting of the classical argument for why we’d see inner misalignment in their language is as follows: Under certain non-diversified environments, a set of actions may be coherent with the pursuit of more than one goal, call them G_1 and G_2, where G_2 will refer to our intended goal. If we are trained in those environments, and we only deploy to those types of environments, then all is good, since the pursuit of either goal is indistinguishable from the other. However, if we deploy in an environment which is more complex, the statistical relationship between G_1 and G_2 might not hold, and we might encounter a situation where our model will continue to competently pursue G_1 while failing to pursue G_2. This malign failure is worse than a benign failure, since in the case where G_1 is a net negative (or worst, an existential catastrophe) we actually perform that net negative out into the world rather than just fizzling into what seems to be more or less random behavior.

The True Names Approach

To date, the work which has occurred under this approach is somewhat idiosyncratic to John’s take on alignment. It is thus defined by the assertion that the field of alignment does not yet have a sufficient level of clarity in its terms and definitions to pursue well formed questions which would solve alignment. Naturally, it sees most alignment problems as downstream from deconfusing agency and the development of a theory of existing agents. The implications of this position on inner alignment is best characterized by the arguments John has taken against the definitions of outer alignment employed by the mechanistic and empiricist approach, who’s major claims are the following:

If the objective of your model fails to generalize to all cases, then you have not picked the right “training objective” for your model.
The only time when you have picked the right training objective is when the global minimum of your loss function is your actual desire and wishes across all possible distributions
If we are outer aligned, then the only times where we can have an inner alignment failure is when we have failed to be fully outer optimized.
The third claim only happens in practice because we utilize imperfect search procedures to train neural nets

The Environmental Stakes Approach

The frames utilized in this approach were established in Paul’s post on Low-Stakes Alignment where Paul introduces the concept of low-stakes alignment, as an alternative to the outer alignment definition as it’s held by the mechanistic approach. A key characteristic which sets this approach apart is a commitment to clean definitions which can be employed in Paul’s worst case guarantee algorithmic research methodology which can be understood as these series of steps:

Create a story where the best current existing alignment techniques fail to prevent doom.
Strip that story down to its simplest moving parts so that we have a sufficient condition for doom.
Design some algorithm which prevents doom specifically in that case
Repeat steps 1-3
Unify the different algorithms you’ve produced into one

Relatedly, this approach tends to be pessimistic towards methodologies which seek to open endedly create theories of agency which go against this constraint.

The Cruxes

Non-object level normative questions aside, I’ve identified two cruxes discussed in the literature which seem upstream of the significant splits of practice between the approaches: realism about rationality, and whether agent behavior implies agent structure. Below are the questions laid out in detail, and some of the positions which can be associated with each question.

Realism about Rationality

How precise can a theory of rational agency be? How many layers of abstraction or steps of indirection can it scale to? What kind of knowledge can a theory of rationality generate? What kinds of actions does this allow us to perform?

Precise enough that we can construct well formed questions, and aid general deconfusion but not much more. (I put people like Rohin, who encourage certain kinds of conceptual work but on the whole aren’t invested in it)
Precise enough that we’ll be able to construct a specific method for AGI which will have good safety guarantees (I put most MIRI people in this camp)
Precise enough that any AGI which is built, regardless of method, can be aligned (This is John’s take, and this position motivates his preoccupation with constructing a descriptive theory of existing agents rather than the typical theory of ideal agents pursued by early MIRI)

Does Agent Behavior Imply Agent Structure

(Note that these positions aren’t along a singular axis, I read this as mostly a consequence of the fact that this question isn’t well posed, and should be understood as an open theory problem)

The tuple following the approach’s name will signify which position they’re taking on the realism about rationality debates and whether agent behavior implies agent structure, respectively:

Mechanistic (2, (1 or 2))
Empiricist (1, 3)
True Names (3, 1)
Environmental Stakes (1, currently unsure)
- Note that the environmental stakes approach is differentiated from the others primarily on methodological grounds.

Outer Alignment Definitions: Expressivity and Coherence

The Mechanistic Approach

Definitions

Expressivity

With the ontology proposed in RFLO and the definitions here provided, we are able to, in natural language conversations, distinguish between the different problems of outer and inner alignment. These definitions also allow us to assess and evaluate different alignment proposals. However, neither of them can be operationalized when doing empirical experiments on inner alignment. The first definition would require us to robustly be capable of locating the learned objective inside of a model, which is currently empirically infeasible. The second definition would require us to know the behavior of a model at the limit of infinite data and perfect training, which is infeasible in principle. Thus, neither definition allows us to empirically verify whether or not something is outer aligned and as a consequence we can’t make the relevant distinctions between outer and inner alignment.

The Empiricist Approach

Definitions

“a model is outer aligned if it performs desirably on the training distribution”
“was any bad feedback provided on the actual training data?”^[1]

Expressivity

These behavioral definitions allow us to do both empirical experiments and diagnostics. However, they are silent when it comes to explaining why a particular model is either benign or malign, it can only say that they are. This makes solving for problems such as deception more difficult. On a more general note, it is still unclear to me how well empirical results and guarantees will generalize to the cases of superintelligences.

The True Names Approach

Definitions

“If there is any system which performs well in the training environment but not in the deployment environment, then that’s an outer alignment failure.”

Expressivity

This take doesn't allow us to express much about the distinctions of both problems and solutions of inner and outer alignment. I view this as a consequence of the commitment to the idea that most alignment problems are downstream from the broadest sense of agency deconfusion, both humans (the thing which we’re optimizing for; outer alignment) and AIs (their training and development; inner alignment).

The Environmental Stakes Approach

Definitions

““A situation is low-stakes if we care very little about any small number of decisions. That is, we only care about the average behavior of the system over long periods of time (much longer than the amount of time it takes us to collect additional data and retrain the system).”

Expressivity

Since the problem statement described by the definition is completely external to the algorithm which you’re trying to engineer, we can do pencil and paper theorizing on it without running too much risk that we’ll be assuming away the problem. It’s important to note that this definition is part of an ontology which has been adopted by labs doing empirical research.

Ad Hoc Thoughts on the Relationships Between Approaches

Mechanistic : Empiricist

The empiricist approach interprets the mechanistic conditions for mesa-optimizers as being too strict. Following from that, the definition of outer alignment deployed by the mechanistic approach isn’t amicable to empirical operationalization. It's worth noting that even Evan (who is most strongly associated with the mechanistic approach) employed more behavioral definitions of outer and inner alignment when first putting a call out for empirical experiments. This relates to the fact that the empiricist approach does not see the judgment of whether the objective function is aligned in isolation as productive, since that is not something which can be empirically verified.

A point which comes to mind is that the empiricist approach has yet to produce any kind of evaluation scheme similar to the previous efforts of the mechanistic approach. Given the behavioral nature of their definitions, I suspect that such evaluations would take a different shape than the ones proposed by the mechanistic approach.

Mechanistic : True Names

Two principle differences between these two different approaches. First is optimization under uncertainty. There may be scenarios where we can’t score all domain items (such as the regime where we don’t have enough human foresight to be able to answer a particular problem), and robustly pointing to someone’s preference in that regime is difficult. Second is the idea that the mechanistic approach actually separates the objective function from other concerns (data and priors) in its evaluation and definition of outer alignment. John has argued against this separation, stating that outer alignment amounts to creating a robust pointer towards the true name of our desires, and that such a pointer isn’t even well defined without considering the data and prior. I imagine that this position is coupled to the notion that deconfusing human values is a necessary condition for solving alignment, and suspect that these same positions would generalize to others who share such a view.

Mechanistic : Environmental Stakes

The major points of divergence between the mechanistic and stakes approaches are stated above in the stakes approach summary and expressivity description; namely the stakes approach’s capacity to support both theoretical guarantees and empirical work. Given the behavioral nature of the definitions employed by the environmental stakes approach, much like the empiricist’s approach I also suspect its evaluations of alignment proposals would take a different shape. Seeing how Paul’s methodology relies heavily on informal stories, it is at this moment unclear to me what a formal guarantee of “no plausible story of egregious misalignment can be constructed” would look like.

Empiricist : True Names

The intentional stance is inconsistent with a theory of existing agency, or any attempt which tries to reduce a set of all things which can reasonably be interpreted as agents to a set of mechanistic properties. Another point of disagreement which they share is the locality of a goal to a particular (dataset/distribution), since for John (bla bla bla). However, they are in agreement with the idea that we should be treating the following tuple (data, priors, and objective function) as a unit, although they arrive at this conclusion for different reasons: with the true names approach believing that the pointer’s problem is upstream of the outer alignment problem, and the empiricist approach being more broadly concerned with robustness and devising ways to test and develop it.

Empiricist : Environmental Stakes

I’ll note that there’s something to be said that the definitions of both approaches have been employed in empirical tests. Both have a let’s ignore distributional shift for now kind of feel, however the environmental stakes approach is motivated by arriving at a clean subproblem while the empiricist approach is more motivated by deriving empirical results, and avoiding the problem of specifying a “perfect reward function”

True Names : Environmental Stakes

Although the environmental stakes’ definitions don’t reduce the idea of “choosing the right objective” with choosing an objective function, it does seem that these two approaches are in disagreement as far as which subproblem is important to first tackle. They are both in agreement that there are necessary problems which are upstream of both the outer and inner alignment definitions provided by the mechanistic approach, with the true names approach being committed to deconfusing agency prior to trying to work across the theory-practice gap, rather than thinking about taking on the problem as is. One concrete point of difference between these two approaches is that the environmental stakes definitions assume away problems of embeddedness.

Future Work

Adding the perspective of other researcher’s to the mix
- Armstrong’s Take on Model Splintering
- Vanessa’s take on avoiding traps and non-Cartesian Daemons
It is now clear to me that the framework of ontological commitments and theoretical expressivity is not general nor clear enough to scale to provide support for the kind of observations which I'm hoping to make (how do different alignment approaches relate in their methodologies, theories, and definitions). Systematizing the claims into the Scientonomy taxonomy and showing relationships between the claims of the different approaches using their diagrammatic notation would be a positive step in that direction.
Operationalism: What is missing from the scientonomic framework however, is a strong pragmatist basis which could actually provide prescriptions for the development of new systems of practice. However, while I have familiarity with the philosophical underpinnings of pragmatism and feel pretty comfortable using pragmatist methods in ethnographic research, it’s use in more historical research is still somewhat opaque to me. Reviewing some texts which make use of such methods in historical research, and looking specifically at how they justify inferences about practices from static historical data would be instructional in this regard.

^{^}
Quote from Goal Misgenralization: Why Correct Specifications Aren’t Enough For Correct Goals, a forthcoming paper

20

Comparing Four Approaches to Inner Alignment

20

Intro and Motivation

The Approaches

The Mechanistic Approach

The Empiricist approach

The True Names Approach

The Environmental Stakes Approach

The Cruxes

Realism about Rationality

Does Agent Behavior Imply Agent Structure

Outer Alignment Definitions: Expressivity and Coherence

The Mechanistic Approach

Definitions

Expressivity

The Empiricist Approach

Definitions

Expressivity

The True Names Approach

Definitions

Expressivity

The Environmental Stakes Approach

Definitions

Expressivity

Ad Hoc Thoughts on the Relationships Between Approaches

Mechanistic : Empiricist

Mechanistic : True Names

Mechanistic : Environmental Stakes

Empiricist : True Names

Empiricist : Environmental Stakes

True Names : Environmental Stakes

Future Work