Contrast these two expressions (hideously mashing C++ and pseudo-code):

,
${argmax}_{x} * (& r) (x)$ .

The first expression just selects the action $x$ that maximises $r (x)$ for some function $r ()$ , intended to be seen as a reward function.

The second expression borrows from the syntax of C++; $(& r)$ means the memory address of $r$ , while $* (& r)$ means the object at the memory address of $r$ . How is that different from $r$ itself? Well, it's meant to emphasise the ease of the agent wireheading in that scenario: all it has to do is overwrite whatever is written at memory location $(& r)$ . Then $* (& r)$ can become - whatever the agent wants it to be.

Let's dig a bit deeper into the contrast between reward functions that can be easily wireheaded and those that can't.

The setup

The agent $A$ interacts with the environment in a series of timesteps, ending at time $t = N$ .

There is a 'reward box' $R$ which takes observations/inputs $o_{t}^{R}$ and outputs some numerical reward amount, given by the voltage, say. The reward function is a function of $o_{t}^{R}$ ; at timestep $t$ , that function is $R_{t} ()$ . The reward box will thus give out a reward of $R_{t} (o_{t}^{R})$ . Initially, the reward function that $R ()$ implements is $r () = R_{0} ()$ .

The agent also gets a separate set of observations $o_{t}^{A}$ ; these observations may include full information about $o_{t}^{R}$ , but need not.

Extending the training distribution

Assume that the agent has had a training phase, for negative values of $t$ . And, during that training phase, $R_{t} ()$ was always equal to $r ()$ .

If $A$ is trained as a reinforcement agent, then there are two separate value functions that it $A$ can learn to maximise:

$E \sum_{t = 0}^{N} r (o_{t}^{R})$ , or
$E \sum_{t = 0}^{N} R_{t} (o_{t}^{R})$ .

Since $R_{t} () = r ()$ for $t < 0$ , which is all the $t$ that the agent has ever seen, both fit the data. The agent has not encountered a situation where it can change the physical behaviour of $R ()$ to anything other than $r ()$ - how will it deal with that?

Wireheading is in the eye of the beholder

Now, it's tempting to call $r (o_{t}^{R})$ the true reward, and $R_{t} (o_{t}^{R})$ the wireheaded reward, as the agent redefines the reward function in the reward box.

But that's a judgement call on our part. Suppose that we wanted the agent to maintain a high voltage coming out of $R$ , to power a device. Then $R_{t} (o_{t}^{R})$ is the 'true' reward, while $r ()$ is just information about what the current design of $R$ is.

This is a key insight, and a reason that avoiding wireheading is so hard. 'Wireheading' is not some ontologically fundamental category. It is a judgement on our part: some ways of increasing a reward are legitimate, some are not.

If the agent is to ever agree with our judgement, then we need ways of getting that judgement into the agent, somehow.

Solutions

The agent with a platonic reward

One way of getting the reward into the agent is to formally define it within the agent itself. If the agent knows $r ()$ , and is explicitly trained to maximise it, then it doesn't matter what $R$ changes to - the agent will still be wanting to maximise $r ()$ . Indeed, in this situation, the reward box $R$ is redundant, except maybe as a training crutch.

What about self modification, or the agent just rewriting the observations $o_{t}^{R}$ ? Well, if the agent is motivated to maximise the sum of $r ()$ , then, by the same argument as the standard Omohundro "basic AI drives", it will want to preserve maximising $r ()$ as the goal of its future copies. If we're still worried about this, we could insist that the agent knows about its own code, including $r ()$ , and acts to preserve $r ()$ in the future^[1].

The agent modelling the reward

Another approach is to have the agent treat the data from its training phase as information, and try to model the correct reward.

This doesn't require the agent to have access to $r ()$ , from the beginning; the physical $R$ provides data about, and after the training phase, the agent might disassemble $R$ entirely (thus making $R_{t}$ trivial) in order to get at what $r ()$ is.

This approach relies on the agent having good priors over the reward function^[2], and on these priors being well-grounded. For example, if the agent opens the $R$ box and sees a wire splitting at a certain point, it has to be able to interpret this a data on $r ()$ in a sensible way. In other words, we interpret $R$ as implementing a function between $o_{t}^{R}$ and the reward signal; it's important that the agent, upon disassembling or analysing $R$ , interprets it in the same way.

Thus we need the agent to interpret certain features of the environment (eg the wire splitting) in the way that we want it to. We thus want the agent to have an model of the environment, where the key features are abstracted in the correct way.

Provide examples of difference

The third example is to provide examples of divergence between $r ()$ and $R_{t} ()$ . We could, for example, unplug the $R$ for a time, then manually give the agent the rewards during that time. This distinguishes the platonic $r (o_{t}^{R})$ we want it to calculate, from the physical $R_{t} (o_{t}^{R})$ that the reward function implements.

Modifying the $R_{t} ()$ during the training phase, while preserving $r ()$ as the "true" reward, also allows the agent to learn the difference. Since $R_{t} ()$ is physical while $r ()$ is idealised, it might be a good idea to have the agent explicitly expect noise in the $R_{t} (o_{t}^{R})$ signal; including the expectation of noise in the agent's algorithm will prevent the agent overfitting to the actual $R_{t} ()$ , increasing the probability that it will figure out the idealised $r ()$ .

In the subsection above, we treated features of the reward box as information that the agent had to interpret in the desirable way. Here we are also wanting the agent to learn features correctly, but we are teaching it by example rather than by model - exploring the contours of the feature, illustrating when the feature behaved in a proper informative way, and when it failed to do so.

Extending the problem

All the methods above suffer from the same issue: though they might prevent wireheading within $r ()$ or $R$ , they don't prevent wireheading at the input boundary for $R$ , namely the value of $o_{t}^{R}$ .

As it stands, even the agent with a platonic reward will be motivated to maximise $r ()$ by taking control of the values of $o_{t}^{R}$ .

But the methods that we have mentioned above allow us to (try and) extend the boundary of wireheading out into the world, and hopefully reduce the problem.

To model this, let $W_{t}$ be the world at time $t$ (which is a function of the agent's past actions as well as the world's past), and let $F$ be the set of functions $f$ such that $f (W_{t}) = o_{t}^{R}$ for $t < 0$ .

Similarly to how the agent can't distinguish between $R_{t} ()$ and $r ()$ , if they were equal for all $t < 0$ , the agent can't distinguish whether it should really maximise $f_{1}$ or $f_{2}$ , if they are both in $F$ .

In a sense, the agent's data for $t < 0$ allows it to do function approximation for its objective when $t \geq 0$ . And we want it to learn a good function, not a bad, wireheady one (again, the definition of wireheading depends on our objective).

The easiest and most general $f$ to learn is likely to be the one that outputs the actual physical $o_{t}^{R}$ ; and this is also the most likely candidate for leading to wireheading. If we want the agent to learn a different $f$ , how can we proceed?

Platonic

If we can model the whole world (or enough of it), we can present the agent with a labelled model of the whole environment, and maybe specify exactly what gives a true reward and what doesn't.

This is, of course, impossible; but there are variants on this idea which might work. In these examples, the agent is running on a virtual environment, and knows the properties of this virtual environment. It is then motivated to achieve goals within that virtual environment, but the second that the virtual environment doesn't behave as expected - the second our abstraction about that environment fails - the agent's motivation changes.

Modelling true reward

It might not be feasible to provide the agent with a full internal model of the environment; but we might be able to do part of the job. If we can give the agent a grounded definition of key features of the environment - what counts as a human, what counts as a coffee cup, what counts as a bin - then we can require the agent to model its reward function as being expressed in a particular way by those features.

Again, this is controlling the abstraction level at which the agent interprets the reward signal. So, if an agent sees a human get drenched in hot coffee and fall into a bin, it will interpret that as just described, rather than seeing it as the movement of atoms out in the world - or as the movement of electrons within its own circuits.

Examples of feature properties

Just as in the previous section, the agent could be taught about key features of the environment by example - by showing examples of good behaviour, of bad behaviour, of when abstractions hold (a human hand covered in hot coffee is still a human hand) and of when they fail (a human hand long detached from its owner is not a human, even in part, and need not be treated in the same way).

There are a lot of ideas as to who these principles could be illustrated, and, the more complex the environment and the more powerful the agent, the less likely it is that we have covered all the key examples sufficiently.

Feature information

Humans will not be able to describe all the properties of the key features we want to have as part of the reward function. But this is an area where the agent can ask the humans for more details about them, and get more information. Does it matter if the reward comes as a signal immediately, a delayed signal, or as a letter three weeks from now? When bringing the coffee to the human, what counts as spilling it, and what doesn't? Is this behaviour ok? What about this one?

Unlike issues of values, where its very easy to get humans confused and uncertain by expanding to new situations, human understanding of the implicit properties of features is more robust - the agent can ask many more questions, and consider many more hypotheticals, without the web of connotations of the feature collapsing.

This one way we could combat Goodhart problems, by including all the uncertainty and knowledge. In this case, the agent's definition of the key property of the key features, is... what a human could have told it about them if it had asked.

This is an area where there is an overlap between issues of wireheading, symbol grounding, and self-modelling. Roughly speaking, we want a well-specified, well-grounded reward function, and an agent that can model itself in the world (including knowing the purpose of its various components), and that can distinguish which features of the world it can legitimately change to increase the reward, and which features it should not change to increase reward.

So when an agent misbehaves with a "make humans happy" reward, this might be because terms like "humans" and "happy" are incorrectly defined, or not well-grounded, or because the agent wireheads the reward definition. In practice, there is a lot of overlap between all these failure modes, and they cannot necessarily be cleanly distinguished. ↩︎
The platonic case can be seen as modelling, with a trivial prior. ↩︎

AI ALIGNMENT FORUM
AF