AI ALIGNMENT FORUM
AF

Frontpage

7

Very different, very adequate outcomes

by Stuart_Armstrong
2nd Aug 2019
2 min read
10

7

Frontpage
Very different, very adequate outcomes
1Wei Dai
1Stuart_Armstrong
1Charlie Steiner
New Comment
3 comments, sorted by
top scoring
Click to highlight new comments since: Today at 8:47 PM
[-]Wei Dai6y10

This seems way too handwavy. If q being close enough 0 will cause a disaster, why isn't 5% close enough to 0? How much do you expect switching from q=1 to q=5% to reduce Up? Why?

If moving from q=1 to q=5% reduces Up by a factor of 2, for example, and it turns out that Up is the correct utility function, that would be equivalent to incurring a 50% x-risk. Do you think that should be considered "ok" or "adequate", or have some reason to think that Up wouldn't be reduced nearly this much?

Reply
[-]Stuart_Armstrong6y10

I'm finding these "is the correct utility function" hard to parse. Humans have a bit of Up and a bit of Uh. But we are underdefined systems; there is no specific value of q that is "true". We can only assess the quality of q using other aspects of human underdefined preferences.

This seems way too handwavy.

It is. Here's an attempt at a more formal definition: humans have collections of underdefined and somewhat contradictory preferences (using preferences in a more general sense than preference utilitarianism). These preferences seem to be stronger in the negative sense than in the positive: humans seem to find the loss of a preference much worse than the gain. And the negative is much more salient, and often much more clearly defined, than that positive.

Given that maximising one preference tends to put the values of others at extreme values, human overall preferences seem better captured by a weighted mix of preferences (or a smooth min of preferences) than by any single preference, or small set of preferences. So it is not a good idea to be too close to the extremes (extremes being places where some preferences have 0% weight put on them).

Now there may be some sense in which these extreme preferences are "correct", according to some formal system. But this formal system must reject the actual preferences of humans today; so I don't see why these preferences should be followed at all, even if they are correct.

Ok, so the extremes are out; how about being very close to the extremes? Here is where it gets wishywashy. We don't have a full theory of human preferences. But, according to the picture I've sketched above, the important thing is that each preference gets some positive traction in our future. So, yes 1% to 5% might no mean much (and smooth min might be better anyway). But I believe I could say:

  • There are many weighted combinations of human preferences that are compatible with the picture I've sketched here. Very different outcomes, from the numerical perspective of the different preferences, but all falling within an "acceptability" range.

Still a bit too handwavy. I'll try and improve it again.

Reply
[-]Charlie Steiner6y10

And of course you can go further and have different U that all have similarly valid claims to be Up, because they're all similarly good generalizations of our behavior into a consistent function on a much larger domain.

Reply
Moderation Log
More from Stuart_Armstrong
View more
Curated and popular this week
3Comments
Mentioned in
9Humans are stunningly rational and stunningly irrational

Let Up be the utility function that - somehow - expresses your preferences[1]. Let Uh be the utility function expresses your hedonistic pleasure.

Now imagine an AI is programmed to maximise U(q)=qUp+(1−q)Uh. If we vary q in the range of 5% to 95%, then we will get very different outcomes. At 5%, we will generally be hedonically satisfied, and our preferences will be followed if they don't cause us to be unhappy. At 95%, we will accomplish any preference that doesn't cause us huge amounts of misery.

It's clear that, extrapolated over the whole future of the universe, these could lead to very different outcomes[2]. But - and this is the crucial point - none of these outcomes are really that bad. None of them are the disasters that could happen if we picked a random utility U. So, for all their differences, they reside in the same nebulous category of "yeah, that's an ok outcome." Of course, we would have preferences as to where q lies exactly, but few of us would risk the survival of the universe to yank q around within that range.

What happens when we push q towards the edges? Pushing q towards 0 seems a clear disaster: we're happy, but none of our preferences are respected; we basically don't matter as agents interacting with the universe any more. Pushing q towards 1 might be a disaster: we could end up always miserable, even as our preferences are fully followed. The only thing protecting us from that fate is the fact that our preferences include hedonistic pleasure; but this might not be the case in all circumstances. So moving q to the edges is risky in the way that moving around in the middle is not.

In my research agenda, I talk about adequate outcomes, given a choice of parameters, or acceptable approximations. I mean these terms in the sense of the example above: the outcomes may vary tremendously from one another, given the parameters or the approximation. Nevertheless, all the outcomes avoid disasters and are clearly better than maximising a random utility function.


  1. This being a somewhat naive form of preference utilitarianism, along the lines of "if the human choose it, then its ok". In particular, you can end up in equilibriums where you are miserable, but unwilling to choose not to be (see for example, some forms of depression). ↩︎

  2. This fails to be true if preference and hedonism can be maximised independently; eg if we could take an effective happy pill and still follow all our preferences. I'll focus on the situation where there are true tradeoffs between preference and hedonism. ↩︎