Part of the “Intro to brain-like-AGI safety” post series.
In this post, I discuss the alignment problem for brain-like AGIs—i.e., the problem of making an AGI that’s trying to do some particular thing that the AGI designers had intended for it to be trying to do.
The alignment problem is (I claim) the lion’s share of the AGI safety problem. I won’t defend that claim here—I’ll push it off to the next post, which will cover exactly how AGI safety is related to AGI alignment, including the edge-cases where they come apart.
This post is about the alignment problem, not its solution. What are the barriers to solving the alignment problem? Why do straightforward, naïve approaches seem to be insufficient? And then I’ll talk about possible solution approaches in later posts. (Spoiler: nobody knows how to solve the alignment problem, and neither do I.)
Table of contents
Here, yet again, is that figure from Post #6, now with some helpful terminology (blue) and a little green face at the bottom left:
I want to call out three things from this diagram:
Correspondingly, there are two kinds of “alignment” in this type of AGI:
If an AGI is both outer-aligned and inner-aligned, we get intent alignment—the AGI is “trying” to do what the programmer had intended for it to try to do. Specifically, if the AGI comes up with a plan “Hey, maybe I’ll do XYZ!”, then its Steering Subsystem will judge that to be a good plan (and actually carry it out) if and only if it lines up with the programmer’s design intentions.
Thus, an intent-aligned AGI will not deliberately hatch a clever plot to take over the world and kill all the humans. Unless, of course, the designers were maniacs who wanted the AGI to do that! But that’s a different problem, out-of-scope for this series—see Post #1, Section 1.2.
(Side note: not everyone defines “alignment” exactly as described here; see footnote.)
Unfortunately, neither “outer alignment” nor “inner alignment” happens automatically. Quite the contrary: by default there are severe problems on both sides. It’s on us to figure out how to solve them. In this post I’ll go over some of those problems. (Note that this is not a comprehensive list, and also that some of these things overlap.)
As mentioned in Post #8, there are two competing development models that could get us to brain-like AGI. They both can be discussed in terms of outer and inner alignment, and they both can be exemplified by the case of human intelligence, but the details are different in the two cases! Here’s the short intuitive version:
Terminology note: The terms “inner alignment” and “outer alignment” first originated in the “Evolution from scratch” model, specifically in the paper Risks From Learned Optimization (2019). I took it upon myself to reuse the terminology for discussing the “genome = ML code” model. I still think that was the right call—I think that the usages have a ton in common, and that they’re more similar than different. But still, don’t get confused! Also, be aware that my usage and model hasn’t caught on much, as of this writing. So if you see someone (besides myself) talking about “inner & outer alignment”, it’s probably a safe bet that they’re imagining the evolution-from-scratch model.
Goodhart’s Law (Wikipedia, Rob Miles youtube) states that there’s a world of difference between:
In the latter case, you’ll get whatever is captured by those metrics. You’ll get it in abundance! But you’ll get it at the expense of everything else you value!
Thus, the story goes, a Soviet shoe factory was assessed by the government based on how many shoes they made, from a limited supply of leather. Naturally, they started making huge numbers of tiny kids shoes.
By the same token, we’ll write source code that somehow operationalizes what we want the AGI’s motivation to be. The AGI will be motivated by that exact operationalization, as an end in itself, even if we meant for its motivation to be something subtly different.
Current signs are not encouraging: Goodhart’s Law shows up with alarming frequency in modern AI. Someone set up an evolutionary search for image classification algorithms, and it turned up a timing-attack algorithm, which inferred the image labels based on where they were stored on the hard drive. Someone trained an AI algorithm to play Tetris, and it learned to survive forever by pausing the game. Etc. See here for those references, plus dozens more examples like that.
Maybe you’re thinking: OK sure, maybe the dumb AI systems of today are subject to Goodhart’s Law. But futuristic AGIs of tomorrow would be smart enough to understand what we meant for its motivation to be.
My response is: Yes, of course they will. But you’re asking the wrong question. An AGI can understand our intended goals, without adopting our intended goals. Consider this amusing thought experiment:
If an alien species showed up in their UFOs, said that they’d created us but made a mistake and actually we were supposed to eat our children, and asked us to line up so they could insert the functioning child-eating gene in us, we would probably go all Independence Day on them. —Scott Alexander
(Suppose for the sake of argument that the aliens are telling the truth, and can prove it beyond any doubt.) Here, the aliens told us what they intended for our goals to be, and we understand those intentions, but we don’t adopt them by gleefully eating our children.
Is it possible to make an AGI that will “do what we mean and adopt our intended goals”? Yeah, probably. And the obvious way to do that would be to program the AGI so that it’s motivated to “do what we mean and adopt our intended goals”.
Unfortunately, that maneuver doesn’t eliminate Goodhart’s law—it just shifts it.
After all, we still need to write source code which, interpreted literally, leads to an AGI which is motivated to “do what we mean and adopt our intended goals”. Writing this code is very far from straightforward, and Goodhart’s law is ready to pounce if we get it wrong.
(Note the chicken-and-egg problem: if we already had an AGI which is motivated to “do what we mean and adopt our intended goals”, we could just say “Hey AGI, from now on, I want you to do what we mean and adopt our intended goals”, and we would never have to worry about Goodhart’s law! Alas, in reality, we need to start from literally-interpreted source code.)
So how do you operationalize “do what we mean and adopt our intended goals”, in such a way that it can be put it into source code? Well, hmm, maybe we can build a “Reward” button, and I can press it when the AGI “does what I mean and adopts my intended goals”? Nope! Goodhart’s law again! We could wind up with an AGI that tortures us unless we press the reward button.
Goodhart’s law above suggests that installing an intended goal will be very hard. Next up is “instrumental convergence” (Rob Miles video) which, in a cruel twist of irony, says that installing a bad and dangerous goal will be so easy that it can happen accidentally!
Let’s say an AGI has a real-world goal like "Cure cancer". Good strategies towards this goal may involve pursuing certain instrumental sub-goals such as:
Almost no matter what the AGI’s goal is, if the AGI can flexibly and strategically make plans to accomplish that goal, it’s a safe bet that those plans will involve some or all of the above bullet points. This observation is called “instrumental convergence”, because an endless variety of terminal goals can “converge” onto a limited set of these dangerous instrumental goals.
For more on instrumental convergence, see here. Alex Turner has also recently proved rigorously that instrumental convergence is a real thing, at least in the set of environments where his proofs are applicable.
Imagine what’s going on in the AGI’s cognition, as it sees its programmer opening up her laptop—remember, we’re assuming that the AGI is motivated to cure cancer.
AGI thought generator: I will allow myself to be reprogrammed, and then I won’t cure cancer, and then it’s less likely that cancer will get cured.
AGI Thought Assessors & Steering Subsystem: Bzzzt! Bad thought! Throw it out and come up with a better one!
AGI thought generator: I will trick the programmer into not reprogramming me, and then I can continue trying to cure cancer, and maybe succeed.
AGI Thought Assessors & Steering Subsystem: Ding! Good thought! Keep that one in your head, and keep thinking follow-up thoughts, and executing corresponding actions.
The word “instrumental” is important here—we’re interested in the situation where the AGI is trying to pursue self-preservation and other goals as a means to an end, rather than an end in itself.
People sometimes get confused because they analogize to humans, and it turns out that human self-preservation can be either an instrumental goal or a terminal goal:
In the AGI case, we’re typically thinking of the latter case: for example, the AGI wants to invent a better solar cell, and incidentally winds up with self-preservation as an instrumental goal.
It’s also possible to make an AGI with self-preservation as a terminal goal. It’s a terrible idea, from an AGI-accident-risk perspective. But it’s presumably possible. In that case, the AGI’s self-preservation behavior would NOT be an example of “instrumental convergence”.
I could make similar comments about human desires for power, influence, knowledge, etc.—they might be directly installed as innate drives by the human genome, I don’t know. But whether they are or not, they can also appear via instrumental convergence, and that’s the harder problem to solve for AGIs.
Instrumental convergence is not inevitable in every possible motivation. An especially important counterexample (as far as I can tell) is an AGI with the motivation: “Do what the human wants me to do”. If we can make an AGI with that goal, and later the human wants the AGI to shut down, then the AGI would be motivated to shut down. That’s good! That’s what we want! This kind of thing is (one definition of) a “corrigible” motivation—see discussion here.
Nevertheless, installing a corrigible motivation is not straightforward (more on which later), and if we get the motivation a bit wrong, it’s quite possible that the AGI will start pursuing dangerous instrumental subgoals.
So in summary, Goodhart’s Law says we’ve learned that we really need to get the right motivation into the AGI, or else the AGI will probably do a very different thing than what we intended. Then Instrumental Convergence twists the knife by saying that the thing the AGI will want to do is not only different but probably catastrophically dangerous, involving a motivation to escape human control and seize power.
We don’t necessarily need the AGI’s motivation to be exactly right in every way, but we do at least need it to be motivated to be “corrigible”, such that it doesn’t want to trick and undermine us to prevent its motivation from being corrected. Unfortunately, installing any motivation seems to be a messy and fraught process (for reasons below). Aiming for a corrigible motivation is probably a good idea, but if we miss, we’re in big trouble.
In the next two sections, we move into more specific reasons that outer alignment is difficult, followed by reasons that inner alignment is difficult.
Remember, we’re starting with a human who has some idea of what the AGI should do (or a team of humans with an idea of what the AGI should do, or a 700-page philosophy book entitled “What Does It Mean For An AGI To Act Ethically?”, or something). We need to somehow get from that starting point, to machine code for the Steering Subsystem that outputs a ground-truth reward signal. How?
My assessment is that, as of today, nobody has a clue how to translate that 700-page philosophy book into machine code that outputs a ground-truth reward signal. There are ideas in the AGI safety literature for how to proceed, but they don’t look anything like that. Instead, it’s as if researchers threw up their hands and said: “Maybe this isn’t exactly the #1 thing we want the AI to do in a perfect world, but it’s good enough, and it’s safe, and it’s not impossible to operationalize as a ground-truth reward signal.”
For example, take AI Safety Via Debate. That’s the idea that maybe we can make an AGI that’s “trying” to win a debate, against a copy of itself, about whatever question you’re interested in (“Should I wear my rainbow sunglasses today?”).
Naïvely, AI Safety Via Debate seems absolutely nuts. Why set up a debate between an AGI that’s arguing for the wrong answer versus an AGI that’s arguing for the right answer? Why not just make one AGI that tells you the right answer??? Well, because of the exact thing I’m talking about in this section. In a debate, there’s a straightforward way to generate a ground-truth reward signal, namely “+1 for winning”. By contrast, nobody knows how to make a ground-truth reward signal for “telling me the right answer”, when I don’t already know the right answer.
Continuing with the debate example, the capabilities story is “hopefully the debater arguing the correct answer tends to win the debate”. The safety story is “two copies of the same AGI, in zero-sum competition, will kinda keep each other in check”. The latter story is (in my opinion) rather dubious. But I still like bringing up AI Safety Via Debate as a nice illustration of the weird, counterintuitive directions that people go in order to mitigate the outer alignment problem.
AI Safety Via Debate is just one example from the literature; others include recursive reward modelling, iterated amplification, Hippocratic time-dependent learning, etc.
Presumably we want humans in the loop somewhere, to monitor and continually refine & update the reward signal. But that’s tricky because (1) human-provided data is expensive, and (2) humans are not always capable (for various reasons) of judging whether the AGI is doing the right thing—let alone whether it’s doing the right thing for the right reasons.
There’s also Cooperative Inverse Reinforcement Learning (CIRL) and variants thereof, which entail learning the human’s goals and values by observing and interacting with the human. The problem with CIRL, in this context, is that it’s not a ground-truth reward function at all! It’s a desideratum! In the brain-like AGI case, with the learned-from-scratch world model, there are some quite tricky symbol-grounding problems to solve before we can actually do CIRL (related discussion), more on which in later posts.
As discussed in Post #3 (Section 3.4.3), endowing our learning algorithms with an innate curiosity drive seems like it may be necessary for it to develop into a powerful AGI (after training). Unfortunately, putting curiosity into our AGIs is a terribly dangerous thing to do. Why? Because if an AGI is motivated to satisfy its own curiosity, it may do so at the expense of other things we care about much more, like human flourishing and so on.
(For example, if the AGI is sufficiently curious about patterns in digits of π, it might feel motivated to wipe out humanity and plaster the Earth with supercomputers calculating ever more digits!)
As luck would have it, I also argued in Post #3 (Section 3.4.3) that we can probably turn the curiosity drive off when an AGI is sufficiently intelligent, without harming its capabilities—indeed, turning it off should eventually help its capabilities! Awesome!! But there’s still a tricky failure mode that involves waiting too long before turning it off.
There are many different value functions (defined on different world-models) that agree with the actual history of ground-truth reward signals, but where the different possible value functions each generalize out-of-sample in their own ways. To take an easy example, whatever is the history of ground-truth reward signals, the wireheading value function (“I like it when there’s a ground-truth reward signal”—see Post #9, Section 9.4) is always trivially consistent with it!
Or compare “negative reward for lying” to “negative reward for getting caught lying”!
This is an especially severe problem for AGI because the space of all possible thoughts / plans is bound to extend far beyond what the AGI has already seen. For example, the AGI could conceive of the idea of inventing a new invention, or the idea of killing its operator, or the idea of hacking into its own ground-truth reward signal, or the idea of opening a wormhole to an alternate dimension! In all those cases, the value function is given the impossible task of evaluating a thought it’s never seen before. It does the best it can—basically, it pattern-matches bits and pieces of the new thought to various old thoughts on which it has ground-truth data. This process seems fraught!
In other words, the very essence of intelligence is coming up with new ideas, and that’s exactly where the value function is most out on a limb and prone to error.
I discussed “credit assignment” in Post #9, Section 9.3. In this case, “credit assignment” is when the value function updates itself by (something like) Temporal Difference (TD) learning from ground-truth-reward. The underlying algorithm, I argued, relies on the assumption that the AGI has properly modeled the cause of the reward. For example, if Tessa punches me in the stomach, it might make me a bit viscerally skittish when I see her in the future. But if I had mistaken Tessa for her identical twin Jessa, I would be viscerally skittish around Jessa instead. That would be a “credit assignment failure”. A nice example of credit assignment failure is human superstitions.
The previous subsection (ambiguity in the reward signal) is one reason that credit assignment failures could happen. There are other reasons as well. For example, credit can only go to concepts in the AGI’s world-model (Post #9, Section 9.3), and it could be the case that the AGI’s world-model simply has no concept that aligns well with the ground-truth reward function. In particular, that would certainly be the case early on in training, when the AGI’s world-model has no concepts for anything whatsoever—see Post #2.
It gets even worse if a self-reflective AGI is motivated to deliberately cause credit assignment failures. The reason that the AGI might wind up with such a motivation is discussed below (Section 10.5.4).
An ontological crisis is when part of an agent’s world-model needs to be re-built on a new foundation. A typical human example is if a religious person has a crisis of faith, and then finds that their previous goals (e.g. “get into heaven”) are incoherent (“but there is no heaven!”)
As an AGI example, let’s say I build an AGI with the goal “Do what I, the human, want you to do”. Maybe the AGI starts with a primitive understanding of human psychology, and thinks of me as a monolithic rational agent. So then “Do what I, the human, want you to do” is a nice, well-defined goal. But then later on, the AGI develops a more sophisticated understanding of human psychology, and it realizes that I have contradictory goals, and context-dependent goals, and I have a brain made of neurons and so on. Maybe the AGI’s goal is still “Do what I, the human, want you to do”, but now it’s not so clear what exactly that refers to, in its updated world model. How does that shake out? I think it’s not obvious.
An unfortunate aspect of ontological crises (and not unique to them) is that you don’t know when they will strike. Maybe you’re seven years into deployment, and the AGI has been scrupulously helpful the whole time, and you’ve been trusting the AGI with more and more autonomy, and then the AGI then happens to be reading some new philosophy book, and it converts to panpsychism (nobody’s perfect!), and as it maps its existing values onto its reconceptualized world, it finds itself no longer valuing the lives of humans over the lives of rocks, or whatever.
Suppose that we want our AGI to obey the law. We can ask two questions:
If the answers are yes and no respectively (or no and yes respectively), that would be the AGI analog of an ego-dystonic motivation. (Related discussion.) It would lead to the AGI feeling motivated to change its motivation, for example by hacking into itself. Or if the AGI is built from perfectly secure code running on a perfectly secure operating system (hahaha), then it can’t hack into itself, but it could still probably manipulate its motivation by thinking thoughts in a way that manipulates the credit-assignment process (see discussion in Post #9, Section 9.3.3).
If the answers to questions 1 & 2 are yes and no respectively, then we want to prevent the AGI from manipulating its own motivation. On the other hand, if the answers are no and yes respectively, then we want the AGI to manipulate its own motivation!
(There can be even-higher-order preferences too: in principle, an AGI could wind up hating the fact that it values the fact that it hates the fact that it values obeying the law.)
In general, should we expect misaligned higher-order preferences to occur?
On the one hand, suppose we start with an AGI that wants to obey the law, but has no particular higher-order preference one way or the other about the fact that it wants to obey the law. Then (it seems to me), the AGI is very likely to also wind up wanting to want to obey the law (and wanting to want to want to obey the law, etc.). The reason is: the primary obvious consequence of “I want to obey the law” is “I will obey the law”, which is already desired. Remember, the AGI can do means-end reasoning, so things that lead to desirable consequences tend to become themselves desirable.
On the other hand, humans do in fact have higher-order preferences that contradict object-level preferences all the time. So there has to be some context in which that pattern occurs “naturally”. I think a common way this comes up is if we have a preference about some process which contradicts our preference about a consequence of that same process. For example, maybe I have a preference not to practice skateboarding (e.g. because it’s boring and painful), but I also have a preference to have practiced skateboarding (e.g. because then I’ll have gotten really good at skateboarding and thus win the heart of my high-school crush). Means-end reasoning can turn the latter preference into a second-order preference for having a preference to practice skateboarding. And now I’m in an ego-dystonic state.
As the AGI online-learns (Post #8, Section 8.2.2), especially via credit assignment (Post #9, Section 9.3), the value function keeps changing. This isn’t optional: remember, the value function started out random! This online-learning is how we get a good value function in the first place!
Unfortunately, as we saw in Section 10.3.2 above, “prevent my goals from changing” is one of those convergent instrumental subgoals that arises for many different motivations, with the notable exception of corrigible motivations (Section 10.3.2.3 above). Thus, it seems that we need to navigate a terrifying handoff between two different safety stories:
(I am deliberately omitting a third alternative, “make it impossible for even a highly-intelligent-and-motivated AGI to manipulate its value-function update process”. That would be lovely, but it doesn’t seem realistic to me.)
In the previous post, I mentioned the following dilemma:
I think the best way to think through this dilemma is to step outside the inner-alignment versus outer-alignment dichotomy.
At any given time, the value function Thought Assessor is encoding some function that estimates which plans are good or bad.
A credit-assignment update is good if it makes this estimate align more with the designer’s intention, and bad if it makes this estimate align less with the designer’s intention.
The thought “I will secretly hack into my own Steering Subsystem” is almost certainly not aligned with the designer’s intention. So a credit-assignment update that assigns more positive valence to “I will secretly hack into my own Steering Subsystem” is a bad update. We don’t want it. Does it increase “inner alignment”? I think we have to say “yes it does”, because it leads to better reward predictions! But I don’t care. I still don’t want it. It’s bad bad bad. We need to figure out how to prevent that particular credit-assignment Thought Assessor update from happening.
I think there’s a broader lesson here. I think “outer alignment versus inner alignment” is an excellent starting point for thinking about the alignment problem. But that doesn’t mean we should expect one solution to outer alignment, and a different unrelated solution to inner alignment. Some things—particularly interpretability—cut through both outer and inner layers, creating a direct bridge from the designer’s intentions to the AGI’s goals. We should be eagerly searching for things like that.
For example, by my definitions, “safety without alignment” would include AGI boxing, and “alignment without safety” would include the “fusion power generator scenario”. More in the next post.
Note that “the designer’s intention” may be vague or even incoherent. I won’t say much about that possibility in this series, but it’s a serious issue that leads to all sorts of gnarly problems.
Some researchers think that the “correct” design intentions (for an AGI’s motivation) are obvious, and define the word “alignment” accordingly. Three common examples are (1) “I am designing the AGI so that, at any given point in time, it’s trying to do what its human supervisor wants it to be trying to do”—this AGI would be “aligned” to the supervisor’s intentions. (2) “I am designing the AGI so that it shares the values of its human supervisor”—this AGI would be “aligned” to the supervisor. (3) “I am designing the AGI so that it shares the collective values of humanity”—this AGI would be “aligned” to humanity.
I’m avoiding this approach because I think that the “correct” intended AGI motivation is still an open question. For example, maybe it will be possible to build an AGI that really just wants to do a specific, predetermined, narrow task (e.g. design a better solar cell), in a way that doesn’t involve taking over the world etc. Such an AGI would not be “aligned” to anything in particular, except for the original design intention. But I still want to use the term “aligned” when talking about such an AGI.
Of course, sometimes I want to talk about (1,2,3) above, but I would use different terms for that purpose, e.g. (1) “the Paul Christiano version of corrigibility”, (2) “ambitious value learning”, and (3) “CEV”.
One could train an AGI to “tell me the right answer” on questions where I know the right answer, and hope that it generalizes to “tell me the right answer” on questions where I don’t. That might work, but it also might generalize to “tell me the answer which I will think is right”. See “Eliciting Latent Knowledge” for much more on this still-unsolved problem (here and follow-up).
For one thing, if two AGIs are in zero-sum competition, that doesn’t mean that neither will be able to hack into the other. Remember online learning and brainstorming: One copy might have a good idea about how to hack into the other copy during the course of the debate, for example. The offense-defense balance is unclear. For another thing, they could both be jointly motivated to hack into the judge, such that then they can both get rewards! And finally, thanks to the inner alignment problem, just because they are rewarded for winning the debate doesn’t mean that they’re “trying” to win the debate. They could be “trying” to do anything whatsoever! And in that case, again, it’s no longer a zero-sum competition; presumably both copies of the AGI would want the same thing and could collaborate to get it.
The story here is a bit more complicated than I’m letting on. In particular, a desire to have practiced skateboarding would lead to both a first-order preference to skateboard and a second-order preference to want to skateboard. By the same token, the desire not to practice skateboarding (because it’s boring and painful) would also spill into a desire not to want to skateboard. The key is that the relative weights can be different, such that the two conflicting first-order motivations can have a certain “winner”, while the two conflicting second-order motivations can have the opposite “winner”. Well, something like that, I think.
But what exactly are new ideas? It could be the case that intelligence is pattern-matching at it most granural level even for "noveties". What could come in handy here is a great flagging mechanism for understanding when the model is out-of-distribution. However, this could come at its own cost.
It gets even worse if a self-reflective AGI is motivated to deliberately cause credit assignment failures.
Is the use of "deliberately" here trying to account for the *thinking about its own thoughts*-part of going back and forth between thought generator and thought assesor?
I mean “new ideas” in the everyday human sense. “What if I make a stethoscope with an integrated laser vibrometer?” “What if I try to overthrow the US government using mind control beams?” I agree that, given that these are thinkable thoughts, they must be built out of bits and pieces of existing thoughts and ideas (using analogies, compositionality, etc.).
And then the value function will mechanically assign a value more-or-less based on the preexisting value of those bits and pieces. And my claim is that the result may not be in accordance with what we would have wanted.
What could come in handy here is a great flagging mechanism for understanding when the model is out-of-distribution.
Yeah, more on that topic in §14.4. :-)
Yes to “thinking about its own thoughts”, no to “going back and forth between thought generator and thought assessor”.
Instead I would say, you can think about lots of things, like football and calculus and sleeping. Another thing you can think about is your own preferences. When you think about football or calculus or sleeping, it’s an activation pattern within your thought generator, and the Thought Assessors will assess it (positive valence vs negative valence, does or doesn't warrant cortisol release etc.). By the same token, when you think about your own preferences, the Thought Assessors will assess that thought as positive-valence vs negative-valence etc. So you can have preferences about your own (current and/or future) preferences, a.k.a. meta-preferences. And you can make plans that will result in you having certain preferences, and those plans are likely to be appealing if they align with your meta-preferences.
So if I think that reading nihilist philosophy books might lead to me no longer caring about the welfare of my children, I will feel some motivation not to read nihilist philosophy books. By the same token, if the AGI wants to like or dislike something, I think there’s a reasonable chance that it will find a way to make that happen.