Attainable Utility Theory: Why Things Matter

by Alex Turner1 min read27th Sep 201910 comments

15

Impact MeasuresWorld Modeling
Frontpage

If you haven't read the prior posts, please do so now. This sequence can be spoiled.

¯\_(ツ)_/¯

11 comments, sorted by Highlighting new comments since Today at 5:58 PM
New Comment

Can you give other conceptions of "impact" that people have proposed, and compare/contrast them with "How does this change my ability to get what I want?"

Also, there's a bunch of different things that "want" could mean. Is that something you've thought about and if so, is it important to pick the right sense of "want"?

(BTW, in these kinds of sequences I never know whether to ask a question midway through or to wait and see if it will be resolved later. Maybe it would help to have a table of contents at the start? Or should I just ask and let the author say that they'll be answered later in the sequence?)

Can you give other conceptions of "impact" that people have proposed, and compare/contrast them with "How does this change my ability to get what I want?"

This is not quite what you're asking for, but I have a post on ways people have thought AIs that minimise 'impact' should behave in certain situations, and you can go through and see what the notion of 'impact' given in this post would advise. [ETA: although that's somewhat tricky, since this post only defines 'impact' and doesn't say how agent should behave to minimise it]

Can you give other conceptions of "impact" that people have proposed, and compare/contrast them with "How does this change my ability to get what I want?"

The next post will cover this.

there's a bunch of different things that "want" could mean. Is that something you've thought about and if so, is it important to pick the right sense of "want"?

I haven't considered this at length yet. Since we're only thinking descriptively right now and in light of where the sequence is headed, I don't know it's important to nail down the right sense. That said, I'm still quite interested in doing so.

In terms of the want/like distinction (keeping in mind that want is being used in its neuroscientific that-which-motivates sense, and not the sense I've been using in the post), consider the following:

A University of Michigan study analyzed the brains of rats eating a favorite food. They found separate circuits for "wanting" and "liking", and were able to knock out either circuit without affecting the other... When they knocked out the "liking" system, the rats would eat exactly as much of the food without making any of the satisifed lip-licking expression, and areas of the brain thought to be correlated with pleasure wouldn't show up in the MRI. Knock out "wanting", and the rats seem to enjoy the food as much when they get it but not be especially motivated to seek it out. Are wireheads happy?

Imagining my "liking" system being forever disabled feels pretty terrible, but not maximally negatively impactful (because I also have preferences about the world, not just how much I enjoy my life). Imagining my "wanting" system being disabled feels similar to imagining losing significant executive function - it's not that I wouldn't be able to find value in life, but my future actions now seem unlikely to be pushing my life and the world towards outcomes I prefer. Good things still might happen, and I'd like that, but they seem less likely to come about.

The above is still cheating, because I'm using "preferences" in my speculation, but I think it helps pin down things a bit. It seems like there's some combination of liking/endorsing for "how good things are", while "wanting" comes into play when I'm predicting how I'll act (more on that in two posts, along with other embedded agentic considerations re: "ability to get").

Or should I just ask and let the author say that they'll be answered later in the sequence?

Doing this is fine! We're basically past the point where I wanted to avoid past framings, so people can talk about whatever (although I reserve the right to reply "this will be much easier to discuss later").


Can you give other conceptions of "impact" that people have proposed, and compare/contrast them with "How does this change my ability to get what I want?"
The next post will cover this.

(no way to double quote it seems...maybe nested BBCode?)

Anyhow, looking forward to that as I was struggling a bit with the claim cannot be a big deal if it doesn't impact my getting what I want without being tautological.

Well, the claim is tautological, after all! The problem with the first part of this sequence is that it can seem... obvious... until you realize that almost all prior writing about impact has not even acknowledged that we want the AI to leave us able to get what we want (to preserve our attainable utility). By default, one considers what "big deals" have in common, and then thinks about not breaking vases / not changing too much stuff in the world state. This attractor is so strong that when I say, "wait, maybe it's not primarily about vases or objects", it didn't make sense.

The point of the first portion of the sequence isn't to amaze people with the crazy surprising insane twists I've discovered in what impact really is about - it's to show how things add up to normalcy, so as to set the stage for a straightforward discussion about one promising direction I have in mind for averting instrumental incentives.

The problem with the first part of this sequence is that it can seem... obvious... until you realize that almost all prior writing about impact has not even acknowledged that we want the AI to leave us able to get what we want (to preserve our attainable utility).

Agreed. This has been my impression from reading previous work on impact.

Let me substantiate my claim a bit with a random sampling; I just pulled up a relative reachability blogpost. From the first paragraph, (emphasis mine)

An incorrect or incomplete specification of the objective can result in undesirable behavior like specification gaming or causing negative side effects. There are various ways to make the notion of a “side effect” more precise – I think of it as a disruption of the agent’s environment that is unnecessary for achieving its objective. For example, if a robot is carrying boxes and bumps into a vase in its path, breaking the vase is a side effect, because the robot could have easily gone around the vase. On the other hand, a cooking robot that’s making an omelette has to break some eggs, so breaking eggs is not a side effect.

But notice now we're talking about "disruption of the agent's environment". Relative reachability is indeed tackling the impact measure problem, so using what we now understand we might prefer to reframe as:

We think about "side effects" when they change our attainable utilities, so they're really just a conceptual discretization of "things which negatively affect us". We want the robot to prefer policies which avoid overly changing our attainable utilities. For example, if a robot is carrying boxes and bumps into a vase in its path, breaking the vase is a side effect, because it's not that easy for us to repair the vase...

Imagine a planet with aliens living on it. Some of those aliens are having what we would consider morally valuable experiences. Some are suffering a lot. Suppose we now find that their planet has been vaporized. By tuning the relative amounts of joy and suffering, we can make it so that the vaporization is exactly neutral under our morality. This feels like a big deal, even if the aliens were in an alternate reality that we could watch but not observe.

Our intuitive feeling of impact is a proxy for how much something effects our values and our ability to achive them. You can set up contrived situations where an event doesn't actually effect our ability to achive our values, but still triggers the proxy.

Would the technical definition that you are looking for be value of information. Feeling something to be impactful means that a bunch of mental heuristics think it has a large value of info?

Can you elaborate the situation further? I’m not sure I follow where the proxy comes apart, but I’m interested in hearing more.

An alien planet contains joy and suffering in a ratio that makes them exactly cancel out according to your morality. You are exactly ambivalent about the alien planet blowing up. The alien planet can't be changed by your actions, so you don't need to cancel plans to go there and reduce the suffering when you find out that the planet blew up. Say that they existed long ago. In general we are setting up the situation so that the planet blowing up doesn't change your expected utility, or the best action for you to take. We set this up by a pile of contrivances. This still feels impactful.

That doesn't feel at all impactful to me, under those assumptions. It feels like I've learned a new fact about the world, which isn't the same feeling. ETA Another example of this was mentioned by Slider: if you're a taxi driver ambivalent between different destinations, and the client announces where they want to go, it feels like you've learned something but doesn't feel impactful (in the way I'm trying to point at).

I think an issue we might run into here is that I don't exist in your mind, and I've tried to extensionally define for you what I'm pointing at. So if you try to find edge cases according to your understanding of exactly which emotion I'm pointing to, then you'll probably be able to, and it could be difficult for me to clarify without access to your emotions. That said, I'm still happy to try, and I welcome this exploration of how what I've claimed lines up with others' experiences.