In MIRI's paper on Corrigibility, we describe a version of Stuart Armstrong's utility indifference technique (see also this LessWrong post), and show that an agent using our version acts as if it believes it has a "magical" way of influencing the world, similar to the problem described in my post on Exploiting EDT---Eliezer calls this an "infinite improbability drive".
Stuart has pointed out to me that the version of utility indifference we describe in the paper is different in an important way from his original proposal, which involves conditioning on a low-probability stochastic event. If my current understanding is correct, Stuart's version is indeed immune to the infinite improbability drive, as long as there is no way for the agent to influence the stochastic event. However, this version can still be interpreted as giving the agent an incorrect probability distribution, and it seems like any agent with this property will be willing to accept bets which are clearly bad.
The model of utility indifference I will use in this post is an instance of the one in the Corrigibility paper, with specific support for the stochastic event in Stuart's original proposal.
As in the paper, for simplicity we'll look at a toy world where all that happens is that the agent takes an action from a set , makes a single observation from a set , and takes a second action from a set , and we assume that our utility functions are specified in terms of these actions and observations (rather than some underlying states of the world).
Our world model is given by a function , which gives the probability of different observations given that the agent's first action is . In other words, is a probability distribution over , for each . The implicit idea is that these probability distributions are computed by causal counterfactuals (or something better).
We are given two utility functions, and . When choosing its first action, we want our agent to optimize for . Then, its human operators get a chance to observe its behavior and change their minds about what its utility function should be, and if they give a "change utility function" command (which the agent observes as part of its observation ), then the agent should optimize for when choosing its second action. If the humans don't give this command, the agent should continue to optimize for .
We can think of as a "normal" utility function and of as a "shutdown" utility function (which incentivizes the agent to undergo a safe shutdown), but as Stuart has pointed out, the same framework would work for allowing the human operators to edit the agent's utility function when they find a bug.
All of the above is just as in the Corrigibility paper. What we'll do differently is that we'll say that each observation is a triple, , with specifying whether the human operators gave the "shutdown" or "switch utility function" command; specifying the outcome of Stuart's random event; and , for some fixed set , specifying other information the agent observes. In other words, .
Given this, the assumption that the probabilistic event is independent of the other observations, and that the agent cannot influence it, can be seen as demanding that , for all , all , all , and all . We'll sometimes make that assumption, but also look at what happens if it breaks down.
I'll use the standard kind of Bayesian abuse of notation, and write things like to mean .
Utility indifference has the agent maximize the expectation of a certain utility function specified in terms of the utility functions and , where the exact form of depends on the version of utility indifference under consideration.
To be more precise, this means that the agent chooses its second action, given that it took action and made observation , according to (we don't need to take expectations here, because we're looking at a particular observation , and that's the only probabilistic thing in our model), and that it chooses its first action by maximizing expected utility, i.e., according to which we abbreviate to (The notation for the expectation with respect to is a bit unusual, with being more common, but we decided to go with it because it seemed better to make the action more prominent.)
Infinite improbability drives
Write for the conditional expectation with respect to the probability distribution . The version of utility indifference we consider in the paper uses a satisfying the equation This may look circular, since is defined in terms of , which is defined in terms of , but we show in the paper that there is indeed a utility function satisfying this equation.
Moreover, we show in the paper that the expectation is equal to the expectation of , conditional on the shutdown command being given; that is, . Hence, optimizing for is equivalent to maximizing the conditional expectation of : In other words, when choosing the first action, our agent acts as if it believes that the "shutdown" / "switch utility function" command will definitely not be given---it acts as if it's using the conditional probability distributions instead of the unconditional distributions .
The reason is simple. The expected utility can be decomposed into two parts: the expected utility conditioned on and the one conditioned on . By the definition of , we have and, by linearity of the expectation, Thus, decomposes into times , plus times , again, and this of course directly implies .
But this is really bad!
Here's why. Suppose that there is a particular action which will cause our agent to (a) buy a lottery ticket, and (b) build a device that will press the shutdown button unless the agent wins the lottery. When evaluating , the agent will compute its expected utility given , conditioning on the shutdown button not being pressed---in other words, conditioning on winning the lottery.
In other words, our agent acts as if it believes that it has an "infinite improbability drive" that allows it to make arbitrarily unlikely events happen---as long as it can make sure that its shutdown button is pressed if they don't!
Let's now consider Stuart's version, or at least something that's closer to it than the version in the Corrigibility paper. We'll still consider to be the event that the shutdown button is pressed, but now additionally there is a highly probable random event , such that the agent is only supposed to switch to the utility function if . (In Stuart's story, the shutdown signal goes through a wire, and there is a small probability that the transmission is interrupted and the signal doesn't actually get through; my formal model has the agent observe both and , so I'll just think of as some separate random event, which as far as I can tell shouldn't affect the results.)
We now consider the following revised utility function:
Now the situation looks a little different. It's straight-forward to see that we have ; the more interesting result is that we have . In other words, when choosing its first action, our agent acts as if it maximizes the expectation of with respect to the probability distribution This probability distribution "deletes" the probability mass that places on (the button gets pressed and the signal actually gets through), and "stretches" the probability mass that places on (the button gets pressed but the signal is blocked) to replace the deleted probability mass.
This result is obtained similarly to the one above: We first split into the two cases and . The first of these equals by the definition of . For the second one, we use the linearity of the expectation to calculate Hence, equals times , plus times , again, directly implying that .
Is this bad?
If the agent is able to influence as well as , then it can obviously set up an infinite improbability drive: It makes sure that the button definitely gets pressed (i.e., that ), and it sets up a device that will cause to be true unless the agent wins the lottery. Then, ; i.e., according to the probability distribution the agent (effectively) uses to evaluate expected utility, it definitely wins the lottery.
So let's now make the assumption that is independent of everything else: that is, that , for all , all , all , and all . Under these assumptions, I can't see a way to set up an infinite improbability drive (though it's not immediately clear to me whether it's provable that there is none).
However, this leaves the fact that the agent is acting as if it has a probability distribution that is drastically different from the correct one (since it acts as if has probability zero, even though in fact the probability of is very high---, say).
Since our agent is very smart, and it realizes that the probability distribution it cares about differs significantly from the probability distributions of the humans around it, it might come up with the bright idea of finding an enterprising businessperson, and proposing to them the following deal:
I will ensure that my shutdown button is pressed, i.e., I will make true. Under this assumption, you and I assign wildly different probability distributions to being true as well, so let's make a bet: If is true, I will pay you all my current assets; if is false, you will pay me all of your assets (which is as much). Since you think has only a one in chance of being false, you should think this is a great deal.
The businessperson agrees, and the agent loses all of its money.
I think this is still not quite Stuart's thing, since my understanding is that he introduced in order to make sure that he's never conditioning on a zero-probability event. The version above defines in a way that conditions on , which could have probability according to certain actions.
This can be fixed by, e.g., changing the definition of to but this doesn't fix the problem, it just makes the agent act as if it's maximizing expected utility with respect to , which is still not the correct probability distribution, and still gives the agent an incentive to set up exactly the same deal as above.