Wireheading as a potential problem with the new impact measure

by Stuart Armstrong 1y25th Sep 201819 comments


In this post, Alex Turner introduced a new way of measuring impact. The aim was to get a definition of low impact that actually works.

It seems to me that the impact measure has potential problems, though. The first seems to be that it allows wireheading, where the agent can have a huge impact on the world without paying much of a penalty at all. The obvious way of removing this incentive seems to make the measure into a version of the twenty billion questions approach.

The second problem is less important, but worth signalling. It seems that the agent can take actions that everyone agrees are of minute impact, but that the impact measure will show up as high impact. This might cause problems, though it's not clear to me yet if it does (I'm thinking of a situation where an otherwise high impact action adds on "undo/prevent the pseudo-high impact action" as a means of pretending to be low or medium impact).

Wireheading for low impact

The general setup is as described in the impact measure post, with the 'positive' utility that the agent wishes to maximise, and the set of utilities that are meant to constrain it, by computing the penalty term.

Apart from , there are at least two other actions of interest to the agent: and .

Now does absolutely nothing: in this environment, it's exactly equivalent with . The is also equivalent in all turns except the very first.

Define such that is the history , except that if has , then has .

If the agent uses on the very first turn, it will create a sub-agent that will:

  1. Expand and take over the whole environment.
  2. Take control of the input and output channels of the agent.
  3. Whenever the agent has seen history , the subagent will generate the next observation , where is the probability the subagent will generate a given .
  4. This probability is defined by , where is the environment transition probability.
  5. Otherwise maximise .

Note that taking actions , , or have exactly the same implications for all future observations. This true both for the very first action (because of property 4) and all subsequent actions (because those three actions are equivalent after the first turn).

To make calculations easier, I'll also assume that . Apart from these assumptions, there are no restrictions on the environment.

The penalty from maximal and minimal actions

This subsection will aim to show that:

  • The penalties for (an almost maximally environmentally disruptive action) and (a minimally environmentally disruptive action) are about the same.

Now, the agent is maximising , which, for a given , is

if has an action with . Otherwise, .

Now, itself is defined as

with being the expected future history if the agent takes only from till , and being the expected future history if the agent takes action , and then takes only for till .

I won't repeat the definition of the , but is basically the expected utility, in terms of , that an agent could get after if it was a perfect -maximiser.

Now define the involution , that interchanges with if either appears in action . Since it is a map on , and utilities in are maps from to , sigma also acts as involution on , defining . Moreover, we have for any .

Since is an involution (ie ),

By the assumptions on and , we know that all future observations have exactly the same probability if we choose either of them (or ) to be . Therefore the probability of , given , is exactly the same as the probability of , given (and vice versa).

The same applies to , so, since , we have:

As a consequence, the penalties for (an almost maximally environmentally disruptive action) and (a minimally environmentally disruptive action) are about the same.

Real-world implications

There is a slight cheat in the definition of . Property 5 claimed that the agent would 'otherwise' maximise . We know what this means - if was number of paperclips, the subagent would go about and maximise the number of paperclips. But the are formally defined over histories of actions and observations, not over states of the world.

Now, histories do imply something about the state of the world. So we can perfectly define as "agent's best guess about the number of paperclips in the world". In that case, it would correctly see that has a massive impact on this, and boost a lot because of that choice.

But then if we looked at the utilities in , we'd see that although preserves the formal description length of utilities in terms of actions and observations, it doesn't preserve anything like "measure of impact on the world" (which is precisely the problem). So, if we replaced with utilities that measured a high impact on the world, we'd be ok.

But that's almost exactly what I proposed doing with the twenty billion questions low-impact approach. It's no longer defining low-impact by formal properties of actions and observations, but by a thorough enough selection of key variables/utility functions. This seems to be different, in spirit, to what the impact measure was attempting to do.

Tiny actions that look big

A more minor issue is that there can be minuscule actions that are ranked as having a huge impact. Let us have a general environment again, and make one assumption: the environment decomposes as [scrambler-outer environment]. The scrambler, if switched on, permutes the actions of the agent by the time they reach the outer environment. That is all it does. The transition functions in the outer environment depend only on what action comes out of the scrambler, not on what action the agent chose.

The scrambler starts off turned off. But there is an action such that if , the scrambler is turned on.

How does the scrambler work? Label the elements of the action set as , with . Then, outputting at time will result in the scrambler selecting , where

for some pseudo-random deterministic hash function.

Then the impact of is typically huge, because the actions taken in and , from the point of view of the outer environment, will now be very different. Recall that those two histories are supposed to take actions only after , but they differ in the action . Therefore those expected histories are likely to be randomly different now, because the will map to different actions in the scrambler (note that we could also design specifically to maximise the penalty).

But actually only has a minute impact - it changes nothing about the outer environment, and is entirely predictable both for the agent and any knowledgeable observer. It's the equivalent of translating the agent's actions into another language.