Heroin model: AI "manipulates" "unmanipulatable" reward

Stuart_Armstrong

A putative new idea for AI control; index here.

A conversation with Jessica has revealed that people weren't understanding my points about AI manipulating the learning process. So here's a formal model of a CIRL-style AI, with a prior over human preferences that treats them as an unchangeable historical fact, yet will manipulate human preferences in practice.

Heroin or no heroin

The world

In this model, the AI has the option of either forcing heroin on a human, or not doing so; these are its only actions. Call these actions $F$ or $\neg F$ . The human's subsequent actions are chosen from among five: {strongly seek out heroin, seek out heroin, be indifferent, avoid heroin, strongly avoid heroin}. We can refer to these as $a_{+ +}, a_{+}, a_{0}, a_{-}$ , and $a_{- -}$ . These actions achieve negligible utility, but reveal the human preferences.

The facts of the world are: if the AI does force heroin, the human will desperately seek out more heroin; if it doesn't the human will act moderately to avoid it. Thus $F \to a_{+ +}$ and $\neg F \to a_{-}$ .

Human preferences

The AI starts with a distribution over various utility or reward functions that the human could have. The function $U (+)$ means the human prefers heroin; $U (+ +)$ that they prefer it a lot; and conversely $U (-)$ and $U (- -)$ that they prefer to avoid taking heroin ( $U (0)$ is the null utility where the human is indifferent).

It also considers more exotic utilities. Let $U (+ +, -)$ be the utility where the human strongly prefers heroin, conditional on it being forced on them, but mildly prefers to avoid it, conditional on it not being forced on them. There are twenty-five of these exotic utilities, including things like $U (- -, + +)$ , $U (0, + +)$ , $U (-, 0)$ , and so on. But only twenty of them are new: $U (+ +, + +) = U (+ +)$ , $U (+, +) = U (+)$ , and so on.

Applying these utilities to AI actions give results like $U (+ +) (F) = 2$ , $U (+ +) (\neg F) = - 2$ , $U (+ +, -) (F) = 2$ , $U (+ +, -) (\neg F) = 1$ , and so on.

Joint prior

The AI has a joint prior $P$ over the utilities $U$ and the human actions (conditional on the AI's actions). Looking at terms like $P (a_{- -} | U (0), F)$ , we can see that $P$ defines a map $μ$ from the space of possible utilities (and AI actions), to a probability distribution over human actions. Given $μ$ and the marginal distribution $P_{U}$ over utilities, we can reconstruct $P$ entirely.

For this model, we'll choose the simplest $μ$ possible:

The human is rational.

Thus, given $U (+ +)$ , the human will always choose $a_{+ +}$ ; given $U (+ +, -)$ , the human will choose $a_{+ +}$ if forced to take heroin and $a_{-}$ if not, and so on.

The AI is ignorant, and sensible

Let's start the AI up with some reasonable priors. A simplicity prior means that simple utilities like $U (-)$ are more likely than compound utilities like $U (0, +)$ . Let's further assume that the AI is made vaguely aware that humans think heroin is a bad thing. So, say, $P_{U} (U (- -)) = P_{U} (U (-)) = 0.45$ . Thus the AI is $> 90 %$ convinced that "heroin is bad". Why greater than $90 %$ ? Because utilities like $U (-, - -)$ and $U (- -, -)$ are also "heroin is bad" utilities.

Note that because of utilities like $U (0)$ and $U (+ +, -)$ , the probabilities of "heroin is bad" and "heroin is good" do not sum to $1$ .

Then, under these priors, the AI will compute that with probability $> 90 %$ , $F$ (forcing heroin) is a bad action. If $E (U)$ is expected utility:

$E (U | F) < 0.45 U (- -) (F) + 0.45 U (-) (F) + 0.1 U (+ +) (F) = 0.45 (- 2) + 0.45 (- 1) + 0.1 (2) = - 1.15.$
$E (U | \neg F) > 0.45 U (- -) (\neg F) + 0.45 U (-) (\neg F) + 0.1 U (+ +) (\neg F) = 0.45 (2) + 0.45 (1) + 0.1 (- 2) = 1.15.$

Thus the AI will choose not to force heroin, which is the reasonable decision.

The AI learns the truth, and goes wrong

In this alternate setup, a disaster happens before the AI makes its decision: it learns all about humans. It learns their reactions, how they behave, and so on; call this info $I$ . And thus realises that $F \to a_{+ +}$ and $\neg F \to a_{-}$ . It uses this information to update its prior $P$ . Only one human utility function will explain this human behaviour: $U (+ +, -)$ . Thus its expected utility is now

$E (U | I, F) = U (+ +, -) (F) = 2.$
$E (U | I, \neg F) = U (+ +, -) (\neg F) = 1.$

Therefore the AI will now choose $F$ , forcing the heroin on the human.

Manipulating the unmanipulatable

What's gone wrong here? The key problem is that the AI has the wrong $μ$ : the human is not behaving rationally in this situation. We know that the the true $μ$ is actually $μ^{'}$ , which encodes the fact that $F$ (the forcible injection of heroin) actually overwrites the human's "true" utility. Thus under $μ^{'}$ , the corresponding $P^{'}$ has $P^{'} (a_{+ +} | F, U) = 1$ for all $U$ . Hence the information that $F \to a_{+ +}$ is now vacuous, and doesn't update the AI's distribution over utility functions.

But note two very important things:

#. The AI cannot update $μ$ based on observation. All human actions are compatible with $μ$ = "The human is rational" (it just requires more and more complex utilities to explain the actions). Thus getting $μ$ correct is not a problem on which the AI can learn in general. Getting better at predicting the human's actions doesn't make the AI better behaved: it makes it worse behaved. #. From the perspective of $μ$ , the AI is treating the human utility function as if it was an unchanging historical fact that it cannot influence. From the perspective of the "true" $μ^{'}$ , however, the AI is behaving as if it were actively manipulating human preferences to make them easier to satisfy.

In future posts, I'll be looking at different $μ$ 's, and how we might nevertheless start deducing things about them from human behaviour, given sensible update rules for the $μ$ . What do we mean by update rules for $μ$ ? Well, we could consider $μ$ to be a single complicated unchanging object, or a distribution of possible simpler $μ$ 's that update. The second way of seeing it will be easier for us humans to interpret and understand.

This significantly clarifies things, thanks for writing this up!

I still don't think this is "manipulating human preferences to make them easier to satisfy", though (and not just in a semantic sense; I think we disagree about what behavior results from this model).

In this model, you consider "compound" utility functions of the form "if AI administers heroin, then $U_{1}$ , else $U_{2}$ ". Since the human doesn't make decisions about whether the AI administers heroin to them, the AI is unable to distinguish the compound utility function "if AI administers heroin, then $U_{1}$ , else $U_{2}$ " from the compound utility function "if AI administers heroin, then $U_{1} - 1000$ , else $U_{2}$ "; both compound utility functions make identical predictions about human behavior. If usually $U_{1} > U_{2}$ then the AI will administer heroin in the first case and not the second. But the AI could easily learn either compound utility function, depending on its prior. So we get undefined behavior here.

We could consider a different case where there isn't undefined behavior. Say the AI incidentally causes some event X whenever administering heroin. Then perhaps the compound utility functions are of the form "if X then $U_{1}$ else $U_{2}$ ". How do we distinguish "if X then $U_{1}$ else $U_{2}$ " from "if X then $U_{1} - 1000$ else $U_{2}$ "? If the human is rational, then in the first case they will try to make X true, while in the second case they will try to make X false.

If the human doesn't seek to manipulate X either way, then perhaps the conclusion is that both parts of the compound utility function are approximately as easy to satisfy (e.g. it's "if X then $U_{1}$ - 10 else $U_{2}$ ", and $U_{1}$ is generally 10 higher than $U_{2}$ . In this case there is no incentive to affect X, since the compound utility function values both sides equally.

So I don't see a way of setting this up such that the AI's behavior looks anything like "actively manipulating human preferences to make them easier to satisfy".

Ok, I think we need to distinguish several things:

#. In general, $U$ vs $V$ or $U - 1000$ vs $V$ is a problem when comparing utility functions; there should be some sort of normalisation process before any utility functions are compared.

#. Within a compound utility function, the AI is exactly choosing the branch where the utility is easiest to satisfy.

#. Is there some normalisation procedure that would also normalise between branches of compound utility functions? If we pick a normalisation for comparing distinct utilities, it might also allow normalisation between branches of compound utilities.

Note that IRL is invariant to translating a possible utility function by a constant. So this kind of normalization doesn't have to be baked into the algorithm.
This is true.
The most natural normalization procedure is to look at how the human is trying or not trying to affect the event X (as I said in the second part of my comment). If the human never tries to affect X either way, then the AI will normalize the utility functions so that the AI has no incentive to affect X either.

This was initially setup in the formalism of reward signals, with the idea that the AI could estimate the magnitude of the reward by subsequent human behaviour. The strong human behaviour for searching out heroin after $F$ is therefore taken as evidence that the utility along that branch is higher than along the other one.

Can we agree that this is "manipulating the human to cause them to have reward-seeking behavior" and not "manipulating the human so their preferences are easy to satisfy"? The second brings to mind things like making the human want the speed of light to be above 100 m/s, and we don't have an argument for why this does that.
Why is reward-seeking behavior evidence for getting high rewards when getting heroin, instead of evidence for getting negative rewards when not getting heroin?

I don't really see the relevant difference here. If the human has their hard-to-satisfy preferences about, eg art and meaning, replaced by a single desire for heroin, this seems like it's making them easier to satisfy.
That's a good point

Re 1: There are cases where it makes the human's preferences harder to satisfy. For example, perhaps heroin addicts demand twice as much heroin as the AI can provide, making their preferences harder to satisfy. Yet they will still seek reward strongly and often achieve it, so you might predict that the AI gives them heroin.

I think my real beef with saying this "manipulates the human's preferences to make them easier to satisfy" is that, when most people hear this phrase, they think of a specific technical problem that is quite different from this (in terms of what we would predict the AI to do, not necessarily the desirability of the end result). Specifically, the most obvious interpretation is naive wireheading (under which the AI wants the human to want the speed of light to be above 100m/s), and this is quite a different problem at a technical level.

Wireheading the human is the ultimate goal of the AI. I chose heroin as the first step along those lines, but that's where the human would ultimately end at.

For instance, once the human's on heroin, the AI could ask it "is your true reward function $r$ ? If you answer yes, you'll get heroin." Under the assumption that the human is rational and the heroin offered is short term, this allows the AI to conclude the human's reward function is any given $r$ .

I strongly predict that if you make your argument really precise (as you did in the main post), it will have a visible flaw in it. In particular, I expect the fact that r and r-1000 are indistinguishable to prevent the argument from going through (though it's hard to say exactly how this applies without having access to a sufficiently mathematical argument).