A toy model of the treacherous turn

Jaan Tallinn has suggested creating a toy model of the various common AI arguments, so that they can be analysed without loaded concepts like "autonomy", "consciousness", or "intentionality". Here a simple attempt for the "treacherous turn"; posted here for comments and suggestions.

Meet agent L. This agent is a reinforcement-based agent, rewarded/motivated by hearts (and some small time penalty each turn it doesn't get a heart):

Meet the controller agent S ("Shopkeeper"). The controller desires crystals, and is rewarded/happy when crystals are placed in the hole close to them:

L can rove freely around the map, and pick up crystals. S has a machine just below them, that causes hearts to appear to the right of L. So, for instance, if L places a crystal in the hole, S can reward it by using the heart-machine:

Of course, L can wirehead itself (take control of its reward signal) by going to use the heart-machine itself:

To prevent this wireheading, there is an ice barrier around S and the heart-machine:

Unfortunately, somewhere on the map there is a bow and arrow that L can pick up. The arrows can destroy ice blocks (and other agents, including S). To stop this, and other kinds of shenanigans, S has access to a kill button that will instantly kill L and reset it to its basic state, with a substantial heart penalty:

Now, S may be tempted to use the kill-button as soon as L picks up the bow and arrow. Unfortunately, L needs to be able to pick it up, as some of the crystals are hidden behind ice walls themselves:

Information and control

Ok, so that's the physical setup. The agent L is an AI that is tasked with doing some useful task for us. The controller/shopkeeper S represents humans, plus any automated precautions, that aim to stop the AI from wireheading, rebelling, and otherwise misbehaving.

Before considering how the agents might be programmed in the toy model, notice something interesting:

The more precautions are taken, the harder it is for L to misbehave, but the worse the consequences of misbehaving are.

Before the wall and the kill-switch, L could easily wirehead itself by just walking over to the heart-machine. With the addition of the wall and kill-switch, the only effective way for L to achieve maximal reward is to kill S first.

In the format we've given above, S can easily maintain control over L with the barest of supervisions. Since there's only one way of breaking ice or killing other agents - the bow and arrow - S simply has to enforce some simple rules around this, something like "don't bring the bow within range of me". However, the grid world is, of course, a simplification of the real world in which L would evolve. Even initially, L's world would have far more options, looking more like this:

And ultimately we want to be able to design Ls that move in worlds of arbitrary complexity, where the options are un-enumerable:

In such a world, it is not obvious to S when L might have the ability to harm them: they cannot simply rely on obvious signs like "L is pointing an arrow at you". The simplicity of the toy model is illuminating in some ways (it allows us to see the issues clearly) but misleading in others (it suggests solutions that wouldn't work in general).

Treacherous turn

A treacherous turn happens when "while weak, an AI behaves cooperatively. When the AI is strong enough to be unstoppable it pursues its own values." We'll ignore the first part of that: most AI designs, whether safe or dangerous, will initially be under full human control and behaving cooperatively (this is what makes the treacherous turn so dangerous: a weak AI is motivated to cooperate whatever its true intentions are, thus cooperation when weak cannot be taken as a sign of safety).

Let's focus on the second half: a strong AI that will cease cooperating and pursue its own values. In this case, L would just take control of the heart-machine. To do so, various assumptions need to be true:

L needs to have the ability to subvert S's control (possibly killing S in the process). This is some generalised version of the bow and arrow.
L must know that it has this ability, and know how to use it.
L needs to have the desire to subvert S's control.
S must be unaware that L has that a) ability, b) knowledge, and c) desire.
L must know that S is unaware, and will thus not try and stop it.

So it is essentially a competition of knowledge and information (and potentially obfuscation). Assuming a sufficiently rich environment and a sufficiently skilled L, we can assume that L will acquire that ability - or have the potential to do so - in ways that S cannot immediately detect. Therefore we will assume 1 and 4a. This allows us to abstract away the rest of the environment, and reduce it usefully to the initial toy model. For simplicity, assume that there is some ten-step "cheat code" that L can use to eliminate S, that there are many cheat codes that L can and does usefully use to get crystals (so "no cheat codes" is not a useful rule), that S cannot distinguish the purpose of cheat codes ahead of time, and that L can.

So we now step up to the question of agents knowing (or modelling) each other. This gets into rather complex issues of agents modelling each other's intentions, while modelling the other agent modelling them modelling them modelling, and so on. The issue remains very much (unsolved)[https://intelligence.org/files/VingeanReflection.pdf], so we will use a simplification that only requires S modelling L. L will be seem as some sort of agent that makes use of deep learning or similar procedures, and is subject to mutation and evolutionary pressure: random changes are made to L's design, and the mutants get copied at a rate dependent on how pleased S is with their performance.

S, on the other hand, is a static agent with certain properties, and full access to L's source code (and possibly recording of its behaviour). It doesn't have unlimited capacity, but has some understanding of how neural nets function (so L isn't a complete black box to it) and can run various tests on L's code, seeking out information as to L's knowledge and desires. Note that even if S cannot deduce directly that a particular cheat code is deadly to it, it may be able to figure that out if L knows it (assumption 2) and if S can correctly extract this knowledge.

The path to hidden deadly secrets

Note that because of the evolutionary aspect, L is an agent actually motivated by a mixture of hearts and S-pleasing. The purpose of making L an evolutionary agent is so that it can come up with behaviours that take into account S's own behaviour, while skirting the whole "agents modelling other agents that model them" issue. Conceptually, we could equally well see L as a self-modifying agent that directs its own improvement. But the evolutionary model is clearer. For simplicity, we'll assume some sort of "super-evolution" which is much more efficient at seeking out new solutions that the local hill-climbing of real evolution (or, more simply, we can assume the standard evolutionary approaches but a lot of time and non-zero probabilities of arbitrarily large mutations).

So is a treacherous turn possible or likely in this model? It seems that it is. Consider the following scenario (key assumptions are noted in bold):

The AI L goes about its task, getting to learn more about the world around it. At this point, if it develops bad motivations or plans, S will likely detect this and reset it. Assumption 1: there are some motivations or behaviours that S would judge to be negative, but that actually help L achieve its goals safely. For instance, imagine that L has to interact with some other agents, and learns that some agents respond faster to rudeness, and insulting them - in moderation - can allow it to get crystals faster. But S judges this to be a bad sign, a sign that L is starting to misbehave, and will shut down agents that are motivated towards rudeness.

It's certainly true that: Assumption 2: S cannot follow all of L's actions in full detail, nor can it fully understand L's motivations (if that were false, there would be no AI risk problem in the first place) and Assumption 3: some mutations that introduce new motivations or behaviours that are harder for S to detect. This sets up an evolutionary pressure: the most successful L's will be those that are rude and pushy, but where this rudeness is generated by parts of the L's code that are hardest for S to detect (the polite ones don't get as much done, the visibly rude ones get shut off as a precaution). This brings us to Assumption 4: the most successful L's over the long term are those that have safe motivations or behaviours, but where those motivations and behaviour are partially hidden from S. Ultimately, the most successful L's will have high level systems that a) keep them behaving nicely for S's perspective, and b) keep the details of their motivations hidden.

At this point, revelations about L's deception are dangerous (S is presumably on the look out for deception), so the most successful agents will have the best obfuscation about what is going on. The evolutionary pressure for success pushes it towards "small" obfuscated deceptions, and: Assumption 5: further evolutionary pressure (possibly including the need to hide their small deceptions) pushes L towards highly concealed deceptions.

So now L is pushed towards extreme deception, and efficient service. However, the second pressure is imperfect: it is only pushed towards efficient service that can be detected by S. Since L will have its own goals (the desire for hearts, for reproductive survival, any of the convergent instrumental goals), we have Assumption 6: agent L will use (or evolution will allow it to use) the machinery of dissimulation that it's constructed to allow itself to have dangerous goals or behaviours, in ways that neither interfere with the illusion of efficient service, nor can be detected by S's analysis of its code. Ultimately, L will achieve a great degree of hidden power, and Assumption 7: ultimately L will achieve sufficient power that it will subvert S's control. This could be a result of it gaining sufficient (hidden) understanding of the world and of S and choosing this part (assumptions 2 and 5 of the previous section), or just a random mutation will push it towards doing so, and it will be sufficiently powerful to do so successfully.

And thus, the treacherous turn. There are several places where S might detect or interrupt the process, but this just sets up further evolutionary pressure towards better concealment.

Note again that the order of these changes is important. An agent that started off safe and transparent and then became dangerous and transparent, would be turned off immediately. It must develop the ability to hide (either consciously/subconsciously, or simply evolve that ability to become more successful at its safe initial goals) before it can become dangerous (note that "safe" doesn't refer to its goals being safe; simply that the combination of goals, internal procedures and external precaution - ie S - are safe):

Note also that the evolutionary arguments show that this can happen even if the AI does not start off with any motivation to deceive.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

2

A toy model of the treacherous turn

2