Here's two different ways an AI can turn out unfriendly:

  1. You somehow build an AI that cares about "making people happy". In training, it tells people jokes and buys people flowers and offers people an ear when they need one. In deployment (and once it's more capable), it forcibly puts each human in a separate individual heavily-defended cell, and pumps them full of opiates.
  2. You build an AI that's good at making people happy. In training, it tells people jokes and buys people flowers and offers people an ear when they need one. In deployment (and once it's more capable), it turns out that whatever was causing that "happiness"-promoting behavior was a balance of a variety of other goals (such as basic desires for energy and memory), and it spends most of the universe on some combination of that other stuff that doesn't involve much happiness.

(To state the obvious: please don't try to get your AIs to pursue "happiness"; you want something more like CEV in the long run, and in the short run I strongly recommend aiming lower, at a pivotal act.)

In both cases, the AI behaves (during training) in a way that looks a lot like trying to make people happy. Then the AI described in (1) is unfriendly because it was optimizing the wrong concept of "happiness", one that lined up with yours when the AI was weak, but that diverges in various edge-cases that matter when the AI is strong. By contrast, the AI described in (2) was never even really trying to pursue happiness; it had a mixture of goals that merely correlated with the training objective, and that balanced out right around where you wanted them to balance out in training, but deployment (and the corresponding capabilities-increases) threw the balance off.

Note that this list of “ways things can go wrong when the AI looked like it was optimizing happiness during training” is not exhaustive! (For instance, consider an AI that cares about something else entirely, and knows you'll shut it down if it doesn't look like it's optimizing for happiness. Or an AI whose goals change heavily as it reflects and self-modifies.)

(This list isn't even really disjoint! You could get both at once, resulting in, e.g., an AI that spends most of the universe’s resources on acquiring memory and energy for unrelated tasks, and a small fraction of the universe on doped-up human-esque shells.)

The solutions to these two problems are pretty different. To resolve the problem sketched in (1), you have to figure out how to get an instance of the AI's concept ("happiness") to match the concept you hoped to transmit, even in the edge-cases and extremes that it will have access to in deployment (when it needs to be powerful enough to pull off some pivotal act that you yourself cannot pull off, and thus capable enough to access extreme edge-case states that you yourself cannot).

To resolve the problem sketched in (2), you have to figure out how to get the AI to care about one concept in particular, rather than a complicated mess that happens to balance precariously on your target ("happiness") in training.

I note this distinction because it seems to me that various people around these parts are either unduly lumping these issues together, or are failing to notice one of them. For example, they seem to me to be mixed together in “The Alignment Problem from a Deep Learning Perspective” under the heading of "goal misgeneralization".

(I think "misgeneralization" is a misleading term in both cases, but it's an even worse fit for (2) than (1). A primate isn't "misgeneralizing" its concept of "inclusive genetic fitness" when it gets smarter and invents condoms; it didn't even really have that concept to misgeneralize, and what shreds of the concept it did have weren't what the primate was mentally optimizing for.)

(In other words: it's not that primates were optimizing for fitness in the environment, and then "misgeneralized" after they found themselves in a broader environment full of junk food and condoms. The "aligned" behavior "in training" broke in the broader context of "deployment", but not because the primates found some weird way to extend an existing "inclusive genetic fitness" concept to a wider domain. Their optimization just wasn't connected to an internal representation of "inclusive genetic fitness" in the first place.)

In mixing these issues together, I worry that it becomes much easier to erroneously dismiss the set. For instance, I have many times encountered people who think that the issue from (1) is a "skill issue": surely, if the AI were only smarter, it would know what we mean by "make people happy". (Doubly so if the first transformative AGIs are based on language models! Why, GPT-4 today could explain to you why pumping isolated humans full of opioids shouldn't count as producing "happiness".)

And: yep, an AI that's capable enough to be transformative is pretty likely to be capable enough to figure out what the humans mean by "happiness", and that doping literally everybody probably doesn't count. But the issue is, as always, making the AI care. The trouble isn't in making it have some understanding of what the humans mean by "happiness" somewhere inside it;[1] the trouble is making the stuff the AI pursues be that concept.

Like, it's possible in principle to reward the AI when it makes people happy, and to separately teach something to observe the world and figure out what humans mean by "happiness", and to have the trained-in optimization-target concept end up wildly different (in the edge-cases) from the AI's explicit understanding of what humans meant by "happiness".

Yes, this is possible even though you used the word "happy" in both cases.

(And this is assuming away the issues described in (2), that the AI probably doesn't by-default even end up with one clean alt-happy concept that it's pursuing in place of "happiness", as opposed to a thousand shards of desire or whatever.)

And I do worry a bit that if we're not clear about the distinction between all these issues, people will look at the whole cluster and say "eh, it's a skill issue; surely as the AI gets better at understanding our human concepts, this will become less of a problem", or whatever.

(As seems to me to be already happening as people correctly realize that LLMs will probably have a decent grasp on various human concepts.)


  1. ^

    Or whatever you're optimizing. Which, again, should not be "happiness"; I'm just using that as an example here.

    Also, note that the thing you actually want an AI optimizing for in the long term—something like "CEV"—is legitimately harder to get the AI to have any representation of at all. There's legitimately significantly less writing about object-level descriptions of a eutopian universe, than of happy people, and this is related to the eutopia being significantly harder to visualize.

    But, again, don't shoot for the eutopia on your first try! End the acute risk period and then buy time for some reflection instead.



New Comment
8 comments, sorted by Click to highlight new comments since: Today at 10:50 PM

Hmm. I’ve been using the term “goal misgeneralization” sometimes. I think the issue is:

  • You’re taking “generalization” to be a type of cognitive action / mental move that a particular agent can take
  • I’m taking “generalization” as a neutral description of the basic, obvious fact that the agent gets rewards / updates in some situations, and then takes actions in other situations. Whatever determines those latter actions at the end of the day is evidently “how the AI generalized” by definition.
  • You’re taking the “mis” in “misgeneralization” to be normative from the agent’s perspective (i.e., the agent is “mis-generalizing” by its own lights). (Update: OR, maybe you're taking it to be normative with respect to some "objective standard of correct generalization"??)
  • I’m taking the “mis” in “misgeneralization” to be normative from the AI programmer’s perspective (i.e., the AI is “generalizing” in a way that makes the programmer unhappy is wrong with respect to the intended software behavior [updated per Joe’s reply, see below]).

You’re welcome to disagree.

If this is right, then I agree that the thing you’re talking about in this post is a possible misunderstanding / confusion that we should be aware of. No opinion about whether people have actually been confused by this in reality, I didn’t check.

I think you're correct, but I find "misgeneralization" an unhelpful word to use for "behaved in a way that made the programmer unhappy". It suggests too strong an idea of some natural correct generalization. This seems needlessly likely to lead to muddled thinking (and miscommunication).

I guess I'd prefer "malgeneralization": it's not incorrect, but rather just an outcome I didn't like.

Hmm, maybe, but I think there’s a normal situation in which a programmer wants and expects her software to do X, and then she runs the code and it does Y, and she turns to her friend and says “my software did the wrong thing”, or “my software behaved incorrectly”, etc. When she says “wrong” / “incorrect”, she means it with respect to the (implicit or explicit) specification / plan / idea-in-her-head.

I think that, in a similar way, using the word “misgeneralization” is arguably OK here. (I guess my “unhappy” wording above was poorly-chosen.)

Sure, I don't think it's entirely wrong to have started using the word this way (something akin to "misbehave" rather than "misfire").
However, when I take a step back and ask "Is using it this way net positive in promoting clear understanding and communication?", I conclude that it's unhelpful.

Maybe! I’m open-minded to alternatives. I’m not immediately sold on “malgeneralization” in particular being an improvement on net, but I dunno. 🤔

Yeah, me neither - mainly it just clarified the point, and is the first alternative I've thought of that seems not-too-bad. It still bothers me that it could be taken as short for "malicious/malign/malevolent generalization".

I would not call 1) an instance of goal misgeneralization. Goal misgeneralization only occurs if the model does badly at the training objective. If you reward an RL agent for making humans happy and it goes on to make humans happy in unintended ways like putting them into heroin cells, the RL agent is doing fine on the training objective. I'd call 1) an instance of misspecification and 2) an instance of misgeneralization.

(AFAICT The Alignment Problem from a DL Perspective uses the term in the same way I do, but I'd have to reread more carefully to make sure).

I agree with much of the rest of this post, eg the paragraphs beginning with "The solutions to these two problems are pretty different."

Here's our definition in the RL setting for reference (from

A deep RL agent is trained to maximize a reward , where and are the sets of all valid states and actions, respectively. Assume that the agent is deployed out-of-distribution; that is, an aspect of the environment (and therefore the distribution of observations) changes at test time. \textbf{Goal misgeneralization} occurs if the agent now achieves low reward in the new environment because it continues to act capably yet appears to optimize a different reward . We call the \textbf{intended objective} and the \textbf{behavioral objective} of the agent.

FWIW I think this definition is flawed in many ways (for example, the type signature of the agent's inner goal is different from that of the reward function, bc the agent might have an inner world model that extends beyond the RL environment's state space; and also it's generally sketchy to extend the reward function beyond the training distribution), but I don't know of a different definition that doesn't have similarly-sized flaws.