Very different, very adequate outcomes

Stuart_Armstrong

Very different, very adequate outcomes

2 min read2nd Aug 20193 comments

7

Frontpage

Let $U_{p}$ be the utility function that - somehow - expresses your preferences^[1]. Let $U_{h}$ be the utility function expresses your hedonistic pleasure.

Now imagine an AI is programmed to maximise $U (q) = q U_{p} + (1 - q) U_{h}$ . If we vary $q$ in the range of $5 %$ to $95 %$ , then we will get very different outcomes. At $5 %$ , we will generally be hedonically satisfied, and our preferences will be followed if they don't cause us to be unhappy. At $95 %$ , we will accomplish any preference that doesn't cause us huge amounts of misery.

It's clear that, extrapolated over the whole future of the universe, these could lead to very different outcomes^[2]. But - and this is the crucial point - none of these outcomes are really that bad. None of them are the disasters that could happen if we picked a random utility $U$ . So, for all their differences, they reside in the same nebulous category of "yeah, that's an ok outcome." Of course, we would have preferences as to where $q$ lies exactly, but few of us would risk the survival of the universe to yank $q$ around within that range.

What happens when we push $q$ towards the edges? Pushing $q$ towards $0$ seems a clear disaster: we're happy, but none of our preferences are respected; we basically don't matter as agents interacting with the universe any more. Pushing $q$ towards $1$ might be a disaster: we could end up always miserable, even as our preferences are fully followed. The only thing protecting us from that fate is the fact that our preferences include hedonistic pleasure; but this might not be the case in all circumstances. So moving $q$ to the edges is risky in the way that moving around in the middle is not.

In my research agenda, I talk about adequate outcomes, given a choice of parameters, or acceptable approximations. I mean these terms in the sense of the example above: the outcomes may vary tremendously from one another, given the parameters or the approximation. Nevertheless, all the outcomes avoid disasters and are clearly better than maximising a random utility function.

This being a somewhat naive form of preference utilitarianism, along the lines of "if the human choose it, then its ok". In particular, you can end up in equilibriums where you are miserable, but unwilling to choose not to be (see for example, some forms of depression). ↩︎
This fails to be true if preference and hedonism can be maximised independently; eg if we could take an effective happy pill and still follow all our preferences. I'll focus on the situation where there are true tradeoffs between preference and hedonism. ↩︎

Frontpage

7

Mentioned in

9Humans are stunningly rational and stunningly irrational

Very different, very adequate outcomes

New Comment

3 comments, sorted by

top scoring

Click to highlight new comments since: Today at 3:37 AM

[-]Wei Dai5y10

This seems way too handwavy. If q being close enough 0 will cause a disaster, why isn't 5% close enough to 0? How much do you expect switching from q=1 to q=5% to reduce $U_{p}$ ? Why?

If moving from q=1 to q=5% reduces $U_{p}$ by a factor of 2, for example, and it turns out that $U_{p}$ is the correct utility function, that would be equivalent to incurring a 50% x-risk. Do you think that should be considered "ok" or "adequate", or have some reason to think that $U_{p}$ wouldn't be reduced nearly this much?

[-]Stuart Armstrong5y10

I'm finding these "is the correct utility function" hard to parse. Humans have a bit of $U_{p}$ and a bit of $U_{h}$ . But we are underdefined systems; there is no specific value of $q$ that is "true". We can only assess the quality of $q$ using other aspects of human underdefined preferences.

This seems way too handwavy.

It is. Here's an attempt at a more formal definition: humans have collections of underdefined and somewhat contradictory preferences (using preferences in a more general sense than preference utilitarianism). These preferences seem to be stronger in the negative sense than in the positive: humans seem to find the loss of a preference much worse than the gain. And the negative is much more salient, and often much more clearly defined, than that positive.

Given that maximising one preference tends to put the values of others at extreme values, human overall preferences seem better captured by a weighted mix of preferences (or a smooth min of preferences) than by any single preference, or small set of preferences. So it is not a good idea to be too close to the extremes (extremes being places where some preferences have $0 %$ weight put on them).

Now there may be some sense in which these extreme preferences are "correct", according to some formal system. But this formal system must reject the actual preferences of humans today; so I don't see why these preferences should be followed at all, even if they are correct.

Ok, so the extremes are out; how about being very close to the extremes? Here is where it gets wishywashy. We don't have a full theory of human preferences. But, according to the picture I've sketched above, the important thing is that each preference gets some positive traction in our future. So, yes $1 %$ to $5 %$ might no mean much (and smooth min might be better anyway). But I believe I could say:

There are many weighted combinations of human preferences that are compatible with the picture I've sketched here. Very different outcomes, from the numerical perspective of the different preferences, but all falling within an "acceptability" range.

Still a bit too handwavy. I'll try and improve it again.

[-]Charlie Steiner5y10

And of course you can go further and have different $U$ that all have similarly valid claims to be $U_{p}$ , because they're all similarly good generalizations of our behavior into a consistent function on a much larger domain.

Moderation Log