Goals selected from learned knowledge: an alternative to RL alignment

Seth Herd

Summary:

Alignment work on network-based AGI focuses on reinforcement learning. There is an alternative approach that avoids some, but not all, of the difficulties of RL alignment. Instead of trying to build an adequate representation of the behavior and goals we want, by specifying rewards, we can choose its goals from the representations it has learned through any learning method.

I give three examples of this approach: Steve Byrnes’ plan for mediocre alignment (of RL agents); John Wentworth’s “retarget the search” for goal-directed mesa-optimizers that could emerge in predictive networks; and natural language alignment for language model agents. These three approaches fall into a natural category that has important advantages over more commonly considered RL alignment approaches.

An alternative to RL alignment

Recent work on alignment theory has focused on reinforcement learning (RL) alignment. RLHF and Shard Theory are two examples, but most work addressing network-based AGI assumes we will try to create human-aligned goals and behavior by specifying rewards. For instance, Yudkowsky’s List of Lethalities seems to address RL approaches and exemplifies the most common critiques: specifying behavioral correlates of desired values seems imprecise and prone to mesa-optimization and misgeneralization in new contexts. I think RL alignment might work, but I agree with the critique that much optimism for RL alignment doesn’t adequately consider those concerns.

There’s an alternative to RL alignment for network-based AGI. Instead of trying to provide reinforcement signals that will create representations of aligned values, we can let it learn all kinds of representations, using any learning method, and then select from those representations what we want the goals to be.

I’ll call this approach goals selected from learned knowledge (GSLK). It is a novel alternative not only to RL alignment but also to older strategies focused on specifying an aligned maximization goal before training an agent. Thus, it violates some of the assumptions that lead MIRI leadership and similar thinkers to predict near-certain doom.

Goal selection from learned knowledge (GSLK) involves allowing a system to learn until it forms robust representations, then selecting some of these representations to serve as goals. This is a paradigm shift from RL alignment. RL alignment has dominated alignment discussions since deep networks became the clear leader in AI. RL alignment attempts to construct goal representations by specifying reward conditions. In GSLK alignment, the system learns representations of a wide array of outcomes and behaviors, using any effective learning mechanisms. From that spectrum of representations, goals are selected. This shifts the problem from creation to selection of complex representations.

This class of alignment approaches shares some of the difficulties of RL alignment proposals, but not all of them. Thus far GSLK approaches have received little critique or analysis. Several recent proposals share this structure, and my purpose here is to generalize from those examples to identify the category.

I think this approach is worth some careful consideration because it’s likely to actually be tried. This approach applies both to LLM agents, and to most types of RL agents, and to agentic mesa-optimization in large foundation models. And it’s pretty obvious, at least in hindsight. If the first agentic AGI is an LLM agent, an RL agent, or a combination of the two, I think it’s fairly likely that this will be part of the alignment plan whose success or failure determines all of our fates. So I’d like to get more critique and analysis of this approach.

A metaphor: communicating with an alien

Prior to giving examples of GSLK alignment, I’ll share a loose metaphor that captures some intuitive reasons for optimism them. Suppose you had to convey to an alien what you meant by “kindness” without sharing a language. You might show it many instances of people helping other people and animals. You’d probably include some sci-fi depictions of aliens and humans helping each other. That might work. But it might not; the alien, if it was more alien than you expected, might deduce something related but importantly wrong like “charity” with a negative connotation.

If you could somehow read that alien’s mind fairly well, you could signal “that!” when it’s thinking about something like kindness. If you repeated that procedure, it seems more likely that you’d get it to understand what you’re trying to convey. This is one way of selecting goals, by interpretability. Better yet, if the alien has some grasp of a shared language, you could use the word “kindness” and a bunch more words to try to convey what you’re talking about.

Goal selection from learned knowledge is like using language and/or “mind reading” (in the form of interpretability methods) to identify or evoke the alien’s existing knowledge of the concept you want to convey. RL alignment is like trying to convey what you mean solely by giving examples.

Plans for GSLK alignment

To clarify, I want to briefly mention the three proposals I know of that take this approach (constitutional AI doesn’t fit this category, despite similarities^[1]). Each allows humans to select an AGI’s goals from representations it’s learned.

Steve Byrnes’ Plan for mediocre alignment of brain-like [model-based RL] AGI.
- Take an actor-critic RL agent that’s been trained, but hasn’t yet escaped
- Tell it “think about human flourishing”, then record that state
  - Or otherwise try to evoke a representation you want as a goal
- Set high weights between its representational system and its critic system’s “good” representations
- Because the critic controls decisions (and planning if it can do planning), you now have an AI whose most important goal is (its understanding of) human flourishing.
Aligning language model agents using natural language:
- Take a language model agent designed to make plans and pursue goals stated in natural language
- Make its goal something like “pursue human flourishing primarily, and secondarily make me a bunch of money” or whatever you want
- You now have an agent whose most important goal is (its understanding of) human flourishing
John Wentworth's plan How To Go From Interpretability To Alignment: Just Retarget The Search. The following bullet points are quoted from that article (formatting adjusted):
- Given these two assumptions [mesa-optimization and appropriate abstractions; see post], here’s how to use interpretability tools to align the AI:
  - Identify the AI’s internal concept corresponding to whatever alignment target we want to use (e.g. values/corrigibility/user intention/human mimicry/etc).
  - Identify the retargetable internal search process.
  - Retarget (i.e. directly rewire/set the input state of) the internal search process on the internal representation of our alignment target.
  - Just retarget the search. Bada-bing, bada-boom.
  - My summary: You now have an agent whose most important goal is (its understanding of) human flourishing (or whatever outer alignment target you chose)

Each of these is a method to select goals from learned knowledge. None of them involve constructing goals (or just aligned behavior) using RL with carefully selected reinforcements.

I’m sure you can see potential problems with each of these. I do too. There are a lot of caveats, problems to solve, and bells and whistles to be added to those simple summaries. But here I want to focus on the overlap between these approaches, and the advantage they give over RL alignment plans.

Advantages of GSLK over RL

Wentworth’s statement of his method’s strengths applies to this whole class of approaches:

But the main reason to think about this approach, IMO, is that it’s a true reduction of the problem. Prosaic alignment proposals have a tendency to play a shell game with the Hard Part of the problem, move it around and hide it in different black boxes but never actually eliminate it. “Just Retarget the Search” directly eliminates the inner alignment problem. No shell game, no moving the Hard Part around. It still leaves the outer alignment problem unsolved, it still needs assumptions about natural abstractions and retargetable search, but it completely removes one Hard Part and reduces the problem to something simpler.

Each of these techniques does this, for the same reasons. There are still problems to be solved. Correctly identifying the desired goal representations, and wisely choosing the goal representations you want (goalcrafting) are still nontrivial problems. But worrying about mesa-optimization (the inner alignment problem), is gone. Misgeneralization in new contexts is still a problem, but it's arguably easier to solve with these approaches, and with wise selection of goals. More on this below.

GSLK approaches allow learning beyond RL to be useful for alignment. Recent successes in transformers and other architectures suggest that predictive learning may be superior to RL in creating rich representations of the world. Predictive learning is driven by a vector signal of information about what actually occurred, whereas RL uses a scalar signal reflecting only the quality of outcomes. Predictive learning also avoids the limitations of external labeling required in RL. The brain appears to use predictive learning for its “heavy lifting” in the cortex, with RL in subcortical areas to select actions and goals from those rich cortical representations.^[2] RL agents appear to benefit from similar combinations.

Goal selection from learned knowledge is contingent on being able to stop an AGI’s training to align it. But deep network learning progresses relatively predictably, at least as far as we’ve trained them. So stopping network-based systems after substantial learning seems likely to work. There are ways this can go wrong, but those don’t seem likely enough to prevent people from trying it.^[3] I’ve written more about how this predictable rate of learning allows us to use its understanding of what we want to make AGI a “genie that cares what we want” in The (partial) fallacy of dumb superintelligence.

RL alignment focuses on creating aligned goals by rewarding preferable outcomes. Examples include RLHF (Reinforcement Learning from Human Feedback) for Large Language Models (LLMs) and the Shard Theory's suggestion of selecting a set of algorithmic rewards to achieve alignment. The challenge lies in ensuring that the set of reinforcements collectively forms a representation of the desired goals, a process that seems unreliable. For instance, specifying something as complex and abstract as human flourishing, that remains stable in all contexts, by pointing to specific instances seems difficult and fallible. Even conveying the relatively simple goal “do what this guy wants” by rewarding examples seems fraught with generalization problems. This is the basis of squiggle maximizer concerns.

Remaining challenges

Some of those concerns also apply to goals selected from learned knowledge. We could make the selection poorly, or abstractions that adequately describe our values within the current context might fail to generalize to very different contexts. The strength of GSLK over RL alignment is that we have better representations to select from, so it’s more likely that they’ll generalize well. This is particularly apparent for systems that have acquired a grasp of natural language; language tends to generalize fairly well, since the meaning of words is dependent on surrounding language. However, the functional meaning of words does change for humans as we learn and encounter new contexts, so this does not entirely solve the problem. Those concerns can also be addressed by mechanistic interpretability; it can be used to better understand and select goals from learned representations. However, even with advanced mechanistic interpretability, there remains a risk of divergence between the model’s understanding of concepts like human flourishing and our own.

Concerns about incorrectly specified or changing meanings of goal representations are unavoidable in any alignment scheme including deep networks. It’s impossible to know their representations fully, and their functional meaning changes if any part of the network continues learning. Our best mechanistic interpretability will almost certainly be imperfect. And generalizing from current representations to apply them out of context is also difficult to predict. I think these difficulties strongly suggest that we will attempt to retain direct control and the ability to modify our AGI, rather than attempting to specify outer alignment and allow it full autonomy (and eventually sovereignty). I think that Corrigibility or DWIM is an attractive primary goal for AGI in part because “Do what I mean, and check with me” reduces the complexity of the target, making it easier to adequately learn and identify, but outer alignment is separable from the GSLK approach to inner alignment.

I’ve been trying to understand and express why I find natural language alignment and “mediocre” alignment for actor-critic RL so much more promising than any other alignment techniques I’ve found. I’ve written about how they work directly on steering subsystems and how they have low alignment taxes, including applying to the types of AGI we’re most likely to develop first. They’re also a distinct alternative to RL alignment, so they seem important to consider separately.

^{^}
Constitutional AI from Anthropic has some of the properties of GSLK alignment but not others. Constitutional AI does select goals from its learned knowledge in one important sense. It specifies criteria for its output, (a weak but real sense of “goal”), using a "constitution" stated in natural language. But it's not an alternative to RL, because it applies those “goals” entirely through an RL process. The other methods I mention include no RL in their alignment methods. The “plan for mediocre alignment” applies to RL agents, but the method of setting critic weights to the selected goal representations overwrites the goal representations created through RL training. See that post for important caveats about whether it would work to entirely overwrite the RL-created goal representations. Similarly, natural language alignment has no element of RL, but could be applied in parallel with RL training - but that would re-introduce the problems of mesa-optimization and goal mis-specification inherent to RL alignment.
^{^}
I think this division of labor between RL and other learning mechanisms is nearly a consensus in neuroscience. I'm not sure only because polls are rare, and neuroscientists are often contrarians. Steve Byrnes has summarized evidence for this in [Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering and the remainder of his excellent sequence Intro to Brain-Like-AGI Safety.
^{^}
LLMs might display emergent agency at some point in their training, but it seems likely we can train them farther without that, or detect that agency. Current LLMs appear to have adequate world knowledge to mostly understand the most relevant concepts. I wouldn’t trust them to adequately understand “human flourishing” in all contexts, but I think they adequately understand “Your primary goal is to make sure this team can shut you down for adjustments”. Such a corrigibility or "do what I mean and check" goal also punts on the problem of selecting a perfect set of goals for all time.

19