Early work on this was supported by CEEALAR and was finished during an internship at Conjecture under the mentorship of Adam Shimi.
Thank you to Evan Hubinger and Rohin Shah for answering some questions related to this post.
Epistemic Status: First palimpsest of many to come.
I am broadly interested in pluralism in scientific development, and more practically interested in supporting the diversification of conceptual alignment.
Towards that end, I believe that having a systematic analysis of the cruxes and methodologies which split up the field into distinct agendas could help clarify how exactly diversity is created and sustained in our field, and what exactly it is we wish to diversify in order to better manage our collective portfolio of research bets.
As a case study, this post will investigate four different approaches to inner alignment. I’ll be taking a look at the different definitions which they use for “outer alignment” and conjecturing on how:
This post is distillational in nature, and as such most of the ideas which I present here are not novel and not my own. The claims in this post should also be read as a part of an ongoing exploratory process, and any pushback (especially from those whose work I cite) would be beneficial.
The mechanistic approach deconfused a lot of previous work on optimization daemons and provided the now canonical argument for why we’d expect mesa-optimizers and inner misalignment, which is summarized as follows:
Traditionally, most of the optimization pressure (or in other words, searching procedures) is implemented looking for a specific model which performs competently on a task during training time. However, as ML becomes more and more powerful, models will be trained to more complicated tasks across a wider range of environments. This puts more pressure for a model to generalize better, and raises the incentives for the model to delegate some of its optimization power to deployment time, rather than training time. It is in those cases where we get a mesa-optimizer, and it is with emergence of mesa-optimizers where the issues of misalignment between the mesa-optimizer’s objectives and the objectives which we were training for arise.
The mechanistic approach has also created different evaluations and methods of evaluations of alignment proposals. This approach is defined by an aim towards conceptual clarity, and not only in the employment of mechanistic definitions in both proposed solutions and problem statements, but also with their continued refinement.
In contrast to the mechanistic approach, the empirical approach’s strategy is mostly focused on developing knowledge of inner alignment by creating empirical experiments studying the phenomena. Tightly coupled to this strategy is a favoring of empirical operationalizability over mechanisticality as a precondition for the definitions which they use. They will often allude to the intentional stance as an influence and cite 2-D Robustness as a core frame for the alignment problem. They favor terms such as objective robustness and generalization over mesa-optimizers and inner alignment.
A redrafting of the classical argument for why we’d see inner misalignment in their language is as follows: Under certain non-diversified environments, a set of actions may be coherent with the pursuit of more than one goal, call them G_1 and G_2, where G_2 will refer to our intended goal. If we are trained in those environments, and we only deploy to those types of environments, then all is good, since the pursuit of either goal is indistinguishable from the other. However, if we deploy in an environment which is more complex, the statistical relationship between G_1 and G_2 might not hold, and we might encounter a situation where our model will continue to competently pursue G_1 while failing to pursue G_2. This malign failure is worse than a benign failure, since in the case where G_1 is a net negative (or worst, an existential catastrophe) we actually perform that net negative out into the world rather than just fizzling into what seems to be more or less random behavior.
To date, the work which has occurred under this approach is somewhat idiosyncratic to John’s take on alignment. It is thus defined by the assertion that the field of alignment does not yet have a sufficient level of clarity in its terms and definitions to pursue well formed questions which would solve alignment. Naturally, it sees most alignment problems as downstream from deconfusing agency and the development of a theory of existing agents. The implications of this position on inner alignment is best characterized by the arguments John has taken against the definitions of outer alignment employed by the mechanistic and empiricist approach, who’s major claims are the following:
The frames utilized in this approach were established in Paul’s post on Low-Stakes Alignment where Paul introduces the concept of low-stakes alignment, as an alternative to the outer alignment definition as it’s held by the mechanistic approach. A key characteristic which sets this approach apart is a commitment to clean definitions which can be employed in Paul’s worst case guarantee algorithmic research methodology which can be understood as these series of steps:
Relatedly, this approach tends to be pessimistic towards methodologies which seek to open endedly create theories of agency which go against this constraint.
Non-object level normative questions aside, I’ve identified two cruxes discussed in the literature which seem upstream of the significant splits of practice between the approaches: realism about rationality, and whether agent behavior implies agent structure. Below are the questions laid out in detail, and some of the positions which can be associated with each question.
How precise can a theory of rational agency be? How many layers of abstraction or steps of indirection can it scale to? What kind of knowledge can a theory of rationality generate? What kinds of actions does this allow us to perform?
(Note that these positions aren’t along a singular axis, I read this as mostly a consequence of the fact that this question isn’t well posed, and should be understood as an open theory problem)
The tuple following the approach’s name will signify which position they’re taking on the realism about rationality debates and whether agent behavior implies agent structure, respectively:
With the ontology proposed in RFLO and the definitions here provided, we are able to, in natural language conversations, distinguish between the different problems of outer and inner alignment. These definitions also allow us to assess and evaluate different alignment proposals. However, neither of them can be operationalized when doing empirical experiments on inner alignment. The first definition would require us to robustly be capable of locating the learned objective inside of a model, which is currently empirically infeasible. The second definition would require us to know the behavior of a model at the limit of infinite data and perfect training, which is infeasible in principle. Thus, neither definition allows us to empirically verify whether or not something is outer aligned and as a consequence we can’t make the relevant distinctions between outer and inner alignment.
These behavioral definitions allow us to do both empirical experiments and diagnostics. However, they are silent when it comes to explaining why a particular model is either benign or malign, it can only say that they are. This makes solving for problems such as deception more difficult. On a more general note, it is still unclear to me how well empirical results and guarantees will generalize to the cases of superintelligences.
This take doesn't allow us to express much about the distinctions of both problems and solutions of inner and outer alignment. I view this as a consequence of the commitment to the idea that most alignment problems are downstream from the broadest sense of agency deconfusion, both humans (the thing which we’re optimizing for; outer alignment) and AIs (their training and development; inner alignment).
Since the problem statement described by the definition is completely external to the algorithm which you’re trying to engineer, we can do pencil and paper theorizing on it without running too much risk that we’ll be assuming away the problem. It’s important to note that this definition is part of an ontology which has been adopted by labs doing empirical research.
The empiricist approach interprets the mechanistic conditions for mesa-optimizers as being too strict. Following from that, the definition of outer alignment deployed by the mechanistic approach isn’t amicable to empirical operationalization. It's worth noting that even Evan (who is most strongly associated with the mechanistic approach) employed more behavioral definitions of outer and inner alignment when first putting a call out for empirical experiments. This relates to the fact that the empiricist approach does not see the judgment of whether the objective function is aligned in isolation as productive, since that is not something which can be empirically verified.
A point which comes to mind is that the empiricist approach has yet to produce any kind of evaluation scheme similar to the previous efforts of the mechanistic approach. Given the behavioral nature of their definitions, I suspect that such evaluations would take a different shape than the ones proposed by the mechanistic approach.
Two principle differences between these two different approaches. First is optimization under uncertainty. There may be scenarios where we can’t score all domain items (such as the regime where we don’t have enough human foresight to be able to answer a particular problem), and robustly pointing to someone’s preference in that regime is difficult. Second is the idea that the mechanistic approach actually separates the objective function from other concerns (data and priors) in its evaluation and definition of outer alignment. John has argued against this separation, stating that outer alignment amounts to creating a robust pointer towards the true name of our desires, and that such a pointer isn’t even well defined without considering the data and prior. I imagine that this position is coupled to the notion that deconfusing human values is a necessary condition for solving alignment, and suspect that these same positions would generalize to others who share such a view.
The major points of divergence between the mechanistic and stakes approaches are stated above in the stakes approach summary and expressivity description; namely the stakes approach’s capacity to support both theoretical guarantees and empirical work. Given the behavioral nature of the definitions employed by the environmental stakes approach, much like the empiricist’s approach I also suspect its evaluations of alignment proposals would take a different shape. Seeing how Paul’s methodology relies heavily on informal stories, it is at this moment unclear to me what a formal guarantee of “no plausible story of egregious misalignment can be constructed” would look like.
The intentional stance is inconsistent with a theory of existing agency, or any attempt which tries to reduce a set of all things which can reasonably be interpreted as agents to a set of mechanistic properties. Another point of disagreement which they share is the locality of a goal to a particular (dataset/distribution), since for John (bla bla bla). However, they are in agreement with the idea that we should be treating the following tuple (data, priors, and objective function) as a unit, although they arrive at this conclusion for different reasons: with the true names approach believing that the pointer’s problem is upstream of the outer alignment problem, and the empiricist approach being more broadly concerned with robustness and devising ways to test and develop it.
I’ll note that there’s something to be said that the definitions of both approaches have been employed in empirical tests. Both have a let’s ignore distributional shift for now kind of feel, however the environmental stakes approach is motivated by arriving at a clean subproblem while the empiricist approach is more motivated by deriving empirical results, and avoiding the problem of specifying a “perfect reward function”
Although the environmental stakes’ definitions don’t reduce the idea of “choosing the right objective” with choosing an objective function, it does seem that these two approaches are in disagreement as far as which subproblem is important to first tackle. They are both in agreement that there are necessary problems which are upstream of both the outer and inner alignment definitions provided by the mechanistic approach, with the true names approach being committed to deconfusing agency prior to trying to work across the theory-practice gap, rather than thinking about taking on the problem as is. One concrete point of difference between these two approaches is that the environmental stakes definitions assume away problems of embeddedness.
Quote from Goal Misgenralization: Why Correct Specifications Aren’t Enough For Correct Goals, a forthcoming paper