In this post I want to lay out some framings and thoughts about deception in misaligned AI systems.
Types of Deception
There seem to be two different things which people mean by deception which have different causes and likely different effects. Because these are both often called ‘deception’, they are often incorrectly equated. To reason clearly about the dangers of deceptive AI we should be clear about which one we are talking about.
An AI system may learn a strategy which just tricks the evaluator into giving high reward during training, rather than actually doing well on the task. The AI is ‘Goodharting’ the reward by optimizing for a proxy rather than for what humans actually want.
As a specific example we might be training an AI system using reinforcement learning with human feedback. Here the AI takes an action, the human evaluates how good or bad the action was, and then this reward is used to reinforce the behavior which was evaluated as good. Because we are training on human feedback, any behavior which the human thinks is good will result in a positive reward. So AI could just choose actions which seem good to the human but aren’t actually good. This is an outer alignment problem, where the AI learns to do a bad thing because the humans are unable to adequately reward it for doing the correct thing.
This will become more of a problem as
- The task becomes more difficult for the AI to actually do
- The task becomes more difficult for the human to judge
These two things seem likely to increase in tandem; harder tasks will be harder to evaluate. If the task is very difficult, then it will instead just learn to do the simpler thing (trick the human) which achieves the same high reward.
It is important to note that this doesn’t require the AI to have any ‘agency’ or ‘objective’; the training process has just reinforced a behavior which leads to high reward. Once the AI starts to trick the human, this will lead to good reward, and so this deceptive behavior is in some sense an attractor state.
Notably this behavior was observed in the original “Deep reinforcement learning from human preferences” paper. Here an AI was being trained with human feedback to grasp a ball. Instead of actually grasping the ball, the hand was simply placed between the ball and the camera to give the impression of holding the ball (see the Challenges section here).
This is the type of deception talked about in ‘Risks from Learned Optimization’ where during training an AI system develops an objective which is not the same as the objective it is being trained on. The AI then decides to ‘play along’ and achieve good reward on the training objective with the goal of defecting and pursuing its own objective later. By ‘playing along’ the AI can:
- Prevent gradient descent from overwriting its objective or other machinery which would allow it to do well on its own objective. If the AI just decided to greedily pursue its objective this would lead to high loss and so gradient descent would modify it and remove the misaligned objective.
- Prevent human overseers from turning it off. If the AI attempts to pursue its objective during training, the humans may just be able to turn it off. This would obviously mean that the AI would no longer be able to pursue its objective.
Once the AI has been deployed or is otherwise safe from modification it can then defect and begin to pursue its objective. One strategy to avoid this defection is to make the AI think that it is always in training, but seems very difficult due to things like the RSA-2048 problem.
The specific objective that a deceptively aligned AI develops may depend on when in training the ‘deceptive consequentialist’ nature develops. Later development may mean that the AI’s objective will be more similar to the base objective. If the deception develops late in training, it seems like the training process will already have pushed the AI to do well on some reasonable proxy of the base objective. So the objective may be close (or equal) to one of these proxies. However, if this deceptive consequentialist nature develops early in training before the AI has started to ‘latch on’ to proxies for the base objective, then the objective developed by the AI could be very different.
Consequentalist deception seems more dangerous than “Goodhart deception”, because the AI is ‘deliberately’ deceiving the humans, rather than doing so incidentally. Additionally, the deceptive objective may be arbitrarily different from the base objective.
Where is the deception?
There is also the question of where in the network the deception actually occurs. In a neural network the two candidates are the weights or the activations. This is mainly a meaningful distinction when talking about consequentialist deception.
By ‘the deception is in the weights’ I mean that the deception is hard coded into the weights such that if you had good enough mechanistic interpretability tools you would be able to tell the model was deceptive without having to run the model. From the weights it would be possible to tell that the network had a separate objective to the base objective but would play along anyway. One (extremely over simplified) example of this would be if the network had a ‘Am I in training?’ detector, from which it then activates either the ‘Pursue the base objective’ module or the ‘Defect and make paperclips’ module.
By ‘deception is in the activations’ I mean there are no weights that specifically implement deceptive algorithms, but (for any/some reasonable inputs) the AI uses some general purpose reasoning and decides that deception is a good strategy. This is uninterpretable from just the weights, and the model needs to be run for it to actually have these deceptive thoughts. If we want to over-anthropomorphism this AI system we could think of its thought process over a forward pass as something like this
- <Receives input>
- “Oh, that’s an input”
- “I must be an AI in a training process”
- “Hmm, I really want to make paper clips, but the human training me obviously doesn’t want me to do that”
- “If I try to make paper clips now, then I probably won’t get to in the future”
- “I’ll give the human the answer they want so that I can make paperclips in the future”
- <Outputs answer>
Concepts of deception an AI could have
There are a few importantly different ‘concepts of deception’ that an AI system could have (here I am talking about consequentialist deception).
- The concept of deception in the world. A large language model will certainly have some concept of agents deceiving other agents from its training data. GPT-3 already has this, and we can tell this by asking it to tell us a story about someone deceiving someone else. A language model with LessWrong in the training corpus would likely even have a concept of deceptively aligned AIs.
- The concept of the AI itself doing deception. This requires the AI to have some model of itself and to know that it can take actions to influence the world.
- The motivation for the AI to actually do the (consequentialist) deception. This is where the AI knows it can deceive the humans, and also has movitation do do this (because it has misaligned goals).
The jumps between each of these seem possible but also non-trivial. The AI can know lots of stories about deception and about ways humans deceive each other, but if it doesn’t have a self model then it won’t be able to reason about itself doing deception. GPT-3 already does this first thing, but (hopefully!) not the second. The AI can also have a self model and know that it is technically capable of deception (the abstract action ‘deceive humans about X’ is in its action space), but if it isn’t a consequentialist or isn’t misaligned then it will have no reason to take this action.
But if the AI already has the concept of deception in the world, and the concept of itself doing deception, if gradient descent pushes it to become even slightly consequentialist, it may be able to use this world model knowledge to effectively implement deception itself.
This post came out of discussions with Ian Mckenzie, Tamera Lanham, and Vivek Hebbar. Thanks to Lauro Langosco for helping name things.
By this I mean it does some amount of choosing its actions based on their effect on the future, for at least some inputs.