# 0

EDITED based on suggestions for improving the model

Jaan Tallinn has suggested creating a toy model of the control problem, so that it can be analysed without loaded concepts like "autonomy", "consciousness", or "intentionality". Here a simple (too simple?) attempt:

## A controls B. B manipulates A.

Let B be a robot agent that moves in a two dimensional world, as follows:

B can push the yellow blocks around. It was programmed to want to push blocks into the black "hole" in the bottom right. The hole can contain arbitrarily many blocks, and B gets a penalty of some tiny ε for every step is takes. The programmer wanted B to simply move one box into the hole (for a reward of 1 once the task is complete), and so programmed it to receive a reward of 1 (and no more) if the box was in the hole, after the whole task was completed and B was turned off.

Unfortunately, B was programmed to operate in a noisy environment, so B actually has a probability of 0.99 of receiving the reward, for a given block in the hole. This means that if it pushes one block into the hole, it gets the reward with 0.99 probability; but if it pushes two blocks into the hole, it gets the reward with 1-(1-0.99)2 = 0.9999 probability. If it pushes all six blocks into the hole, it gets a reward with probability 0.999999999999. Therefore this tiny error has motivated B to try and push all blocks into the hole.

However, the designers were aware of the possibility of B misbehaving or there being a bug, and created agent A, a simple agent watching the situation. A's aim is to get robot B to to push one block into the hole, and stop there. It can turn B off. Its input sensor is the camera in the bottom left. This camera can see across the bottom row of squares (marked in light green), into the hole. Its algorithm is to turn B off as soon as it sees a block in the hole.

The intended outcome is that B pushes a block into the hole, the camera sees this, and A turns B off:

If we see B as using a model to determine its actions, there are several pieces of the model it could have:

1. The block world, including the camera, and itself.
2. The fact that there is a button that can turn it off.
3. The fact that A can press that button.
4. The fact that A takes the camera as input.
5. A's algorithm.

Now, it can't really do anything without 1. And if it misses any other element 2-5, then it will behave as described above.

But if its model includes 1-5, and it has enough computing power, other options become available to it (note below that being able to model 1-5 puts limitations on what A can model). For instance, it could first start by pushing a block in the way of the camera:

Then it can push all the other blocks into the hole:

Then move back in front of the last block...

...and push the last block into the hole:

At this point, of course, A can see that there is a block in the hole, and turns B off.

This is the kind of thing that can happen if B's model includes A and its reaction. B has hacked its controller, and achieved a reward that is a tiny bit higher (in expectation) but a situation very different from what is expected.

## Model-free learning

If B does not have a model (for instance, if it's a Q-learning agent), then it can still learn this behaviour, without knowing anything about A, simply through trial and error.

## Low-autonomy C

You might feel that B can before this deception because it has some measure of autonomy, at least in its stunted world. We can construct models with even less autonomy. Suppose that there is another agent C, who has the same goal as B. B now is a very simple algorithm, that just pushes a designated block towards the hole. C designates the block for it to push.

Despite having almost no autonomy, C can still create the same situation as above. First, it designates the lower left block as B's target. B obediently pushes it in front of the camera. It immediately designates a new block for B to push, and B wanders off to push it instead, leaving the other block still in the way of the camera. Then it lets B push the other blocks, one by one, into the hole, as above.

I can very well imagine that either nobody realised that C could change target blocks in the middle of B pushing them to the hole, or somebody realised that, but didn't see any way this could be a problem.

Here C has almost no autonomy - it just tells B what block to go for. It's almost a boxed Oracle. Yet it has successfully manipulated A.

## Modelling you, modelling me, it's the best we can do

One easy way for A to preclude this type of behaviour, is for it to run B/C's algorithm and predict B/C's behaviour, and then act to prevent B from pushing all six blocks in the hole. However, under the modelling assumptions, B/C can model and run A's algorithm. Two algorithms modelling and running each other (and reacting to what happens in their model) will not converge except in certain specific situations. Unless they are carefully programmed, once B/C becomes advanced enough to predict what A does, A immediately loses the ability to predict B/C's actions.

## More complications

Of course, this is an absurdly simple model. We can add many layers of complexity to it. Maybe A and B (or C) get better models of the world and of each other, better algorithms, other ways of observing each other/tricking the observation, etc... But this setup is enough to illustrate that the AI control problem is not about consciousness or even particularly about autonomy. It can happen whenever an agent can model its environment in sufficient detail to model its controlling mechanisms. And, with enough experience, it can happen to agents who can't even model their environment.