In her opening statement for the Munk debate on AI risk, Melanie Mitchell addressed a proposed example of AGI risk: an AGI tasked with fixing climate change might decide to eliminate humans as the source of carbon emissions. She says:

This is an example of what's called the fallacy of dumb superintelligence.[1] That is, it's a fallacy to think a machine could be 'smarter than humans in all respects', but still lack any common sense understanding of humans, such as understanding why we made the request to fix climate change.

I think this “fallacy” is a crux of disagreement about AI x-risk, by way of alignment difficulty. I've heard this statement from other reasonably well-informed risk doubters. The intuition makes sense. But most people in alignment would dismiss this out of hand as being itself a fallacy. Understanding these two positions not only clarifies the discussion, but suggests reasons we're overlooking a promising approach to alignment.

This "fallacy" doesn’t establish that alignment is easy. Understanding what you mean doesn’t make the AGI want to do that thing. Actions are guided by goals, which are different from knowledge. But this intuition that understanding should help alignment needn’t be totally discarded. We now have proposed alignment approaches that make use of an AIs understanding for its alignment. They do this by “pointing” a motivational system at representations in a learned knowledge system, such as “human flourishing”. I discuss two alignment plans that use this approach and seem quite promising.

Early alignment thinking assumed that this type of approach  was not viable, since AGIs could "go foom" (learn very quickly and unpredictably). This assumption appears not to be true of sub-human levels of training in deep networks, and that may be sufficient for initial alignment.

With a system capable of unpredictable rapid improvement, it would be madness to let it learn prior to aligning it. It might very well grow smart enough to escape before you get a chance to stop its learning to perform alignment. Thus, its goals (or a set of rewards to shape them) must be specified before it starts to learn. In that scenario, the way we specify goals cannot make use of the AI's intelligence. Mitchell’s “fallacy” is itself a fallacy under this logic. An AGI that understands what we want can easily do things we very much don't want.

But early foom now seems unlikely, so our thinking should adjust. Deep networks don’t increase in capabilities unpredictably, at least prior to human-level and recursive self improvement. And that may be far enough for initial alignment to succeed. The early assumption of AGI “going foom” now seems unlikely to be true. I think that early assumption has left a mistake in our collective thinking: that an AGI’s knowledge is irrelevant to making it do what we want.[2] 

How to safely use an AI’s understanding for alignment

Deep networks learn at a relatively predictable pace, in the current training regime. Thus, their training can be paused at intermediate levels that include some understanding of human values, but before they achieve superhuman capabilities. Once a system starts to reflect and direct its own learning, this smooth trajectory probably won’t continue. But we can probably stop at a safe but useful level of intelligence/understanding, if we set that level carefully and cautiously. We probably can align an AGI partway through training, and thus make use of its understanding of what we want.

There are three particularly relevant examples of this type of approach. The first, RLHF, is relevant because it is widely known and understood. (I among others don’t consider it a promising approach to alignment by itself.) RLHF uses the LLM’s trained “understanding” or “knowledge” as a substrate for efficiently specifying human preferences. Training on a limited set of human judgments about input-response pairs causes the LLM to generalize these preferences remarkably well. We are “pointing to” areas in its learned semantic spaces. Because those semantics are relatively well-formed, we need to do relatively little pointing to define a complex set of desired responses.

The second example is natural language alignment of language model agents (LMAs). This seems like a very promising alignment plan, if LMAs become our first AGIs. This plan consists of designing the agent to follow top-level goals stated in natural language (e.g., "get OpenAI a lot of money and political influence") including alignment goals (e.g., "do what Sam Altman wants, and make the world a better place".) I've written more about this technique, and the ensemble of techniques it can "stack" with, here.

This approach follows the above general scheme. It pauses training to do alignment work by pre-training the LLM, and inserting alignment goals before launching the system as an agent. (This is mid-training, if that agent continues to perform continuous learning, as seems likely.) If the AI is sufficiently intelligent, it will pursue those goals as stated, including their rich and contextual semantics. Choosing these goal statements wisely is still a nontrivial outer alignment problem; but the AI’s knowledge is the substrate for defining its alignment. 

Another promising alignment plan that follows this general pattern is Steve Byrnes' Plan for mediocre alignment of brain-like [model-based RL] AGI. In this plan, we induce the nascent AGI (paused at useful but controllable level of understanding/intelligence) to represent the concept we want it aligned to (e.g., “think about human flourishing” or “corrigibility” or whatever). We then set the weights from the active units in its representational system into its critic system. Since the critic system is a steering subsystem that determines its values and therefore its behavior, inner alignment is solved. That concept has become its “favorite”, highest-valued set of representations, and its decision-making will pursue everything semantically included in that concept as a final goal.

Now, contrast these techniques with alignment techniques that don’t make use of the system’s knowledge.Shard Theory and other proposals for aligning AGI by using the right set of rewards is one example. This requires accurately guessing how the system’s representations will form, and how those rewards will shape the agent’s behavior as they develop. Hand-coding a representation of any but the simplest goal (see diamond maximization) seems so difficult that it’s not generally considered a viable approach.

These are sketches of plans that need further development and inspection for flaws. And they only produce an initial, loose ("mediocre") alignment with human values, in the training distribution. The alignment stability problem of generalization and change of values remains unaddressed. Whether the alignment remains satisfactory after further learning, self-modification, or in new (out of distribution) circumstances seems like a complex problem that deserves further analysis.

This approach of leveraging an AI’s intelligence and “telling it what we want” by pointing to its representations seems promising. And these two plans seem particularly promising. They apply to types of AGI we are likely to get (language model agents, RL agents, or a hybrid); they are straightforward enough to implement, and straightforward enough to think about in detail prior to implementing them.

I’d love to hear specific pushback on this direction, or better yet, these specific plans. AI work seems likely to proceed apace, so alignment work should proceed with haste too. I think we need the best plans we can make and critique, applying to the types of AGI we’re most likely to get, even if those plans are imperfect. 

  1. ^

    Richard Loosemore appears to have coined the term in 2012 or before. He addresses this argument here, reaching similar conclusions to those here: Do what I mean is not automatic, but neither is it particularly implausible to code an AGI to infer intentions and check with its creators when they’re likely to be violated.

  2. ^

    See the recent post Evaluating the historical value misspecification argument. It expands on the historical context for these ideas, particularly the claim that we should adjust our estimates of alignment difficulty in light of AI that has reasonably good understanding of human values. I don't care who thought what when, but I do care how the collective train of thought reviewed there might have misled us slightly. The discussion on that post clarifies the issues somewhat. This post is intended to offer a more concrete answer to a central question posed in that discussion: how we might close the gap between AI understanding our desires, and actually fulfilling them by making its decisions based on that understanding. I’m also proposing that the key change from historical assumptions is the predictablility of learning therefore the option of safely performing alignment work on a partly-trained system.

New Comment