So I guess more specifically what I'm trying to ask is: how do we distinguish between interpreting the good thing as "human intentions for the agent" versus "human goals"?

In other words, we have at least four options here:

1. AI intends to do what the human wants it to do.

2. AI actually achieves what the human wants it to do.

3. AI intends to pursue the human's true goals.

4. AI actually achieves the human's true goals.

So right now intent alignment (as specified by Paul) describes 1, and outcome alignment (as I'm inferring from your description) describes 4. But it seems quite important to have a name for 3 in particular.

[ Question ]

What is the alternative to intent alignment called?

by ricraz 1 min read30th Apr 20203 comments


Paul defines intent alignment of an AI A to a human H as the criterion that A is trying to do what H wants it to do. What term do people use for the definition of alignment in which A is trying to achieve H's goals (whether or not H intends for A to achieve H's goals)?

Secondly, this seems to basically map on to the distinction between an aligned genie and an aligned sovereign. Is this a fair characterisation?

(Intent alignment definition from

New Answer
Ask Related Question
New Comment