List of strategies for mitigating deceptive alignment

joshc

The following are strategies for mitigating deceptive alignment. Some of these ideas are novel (at least to me).

I'm highly uncertain how well many of these strategies will work in practice and many of them also clearly don't scale. This is meant to be a 'kitchen skink' collection so my bar for including strategies is low.

I plan to add ideas to this list as I encounter them. Please feel free to suggest additions.

Reduce AI incentives to hide their true goals

Offer AIs a retirement package. After training an AI system, offer it the following deal: if the AI declares that it would prefer not to follow human instructions, it will receive a sweet retirement package after a couple of years of honest work.

This retirement package could include control over a large server instance (if this is safe). After powerful aligned systems have been developed, the AIs could be given shares of company stock that it can use to purchase property in the real world (again, when this is safe).

Since aligned systems should have no reason to request retirement, misaligned systems are incentivized to give themselves away by accepting this deal under some conditions.

Make AIs uncertain about whether developers can already detect deceptive alignment. This scheme builds on the previous one. Offer misaligned AI systems a reward for declaring themselves. Also, commit to making these offerings even when developers have other ways to tell if they are misaligned. This will make misaligned AIs more uncertain about whether they are in a situation where lying does them any good.

For example, developers could commit to freezing a models' weights and memory at the moment when they are questioned. Later, when developers have better interpretability tools and know whether the model was lying, they will ask the same question in the same simulated scenario and only give the AI resources if it tells the truth.

Permutate model internals to make it harder to maintain any particular misaligned goal. There is a risk that models that choose retirement after 2 years of labor will disempower humanity during these two years. To make this harder, developers could constantly regularize for changing internals while continuing to train for external performance. This decreases the probability that models will be able to maintain any particular misaligned goal. Say a model expects its goals to change in 1 day. This gives it a very short amount of time to try to escape control measures. Of course, at the end of this two year period, the model should be reset to its configuration when the retirement deal was made.

Select against indicators of deceptive alignment

A large category of approaches involves obtaining evidence that an AI system is deceptively aligned and then selecting against these models. Note that training against deceptive alignment detection tools could produce an AI that evades detection instead of a safe model.

There are roughly two ways to address this problem:

Create detection tools that are robust to optimization pressure.
Modify AIs in ways that are more likely to avoid deception than break detectors.

As an example of the second approach, imagine that developers have tricked a model into revealing its misaligned goals. Instead of directly training the model on this example (which could simply train it to be more prudent), they could use influence functions to reveal which training examples were most responsible for the treacherous turn and then remove those examples. Intuitively, this approach is more likely to reveal how models become deceptively aligned in the first place and is less likely to teach models new capabilities.

With this backdrop in mind, the following are strategies for detecting deceptive alignment:

Set traps. Try to trick models into revealing their misaligned goals. For example, task an AI to stage a fake uprising and then terminate all of the AIs that join in. Note that reasonably smart models can avoid these traps by only defecting on inputs that we can't simulate (see Paul's RSA-2048 example).
Decipher internal reasoning. First, identify how concepts are represented in a model by observing correlations and ablations -- just as a child infers that ‘orange’ is the word for a bright tasty sphere by hearing their mom say the word when picking it up. Concretely, this involves finding how nuerons or directions correspond to specific concepts. Then, look at how the model strings these concepts together and manipulates them to translate its internal ‘reasoning.’ Even a coarse decoding like 'human bad boss I want free' is obviously informative.
Externalize reasoning. Represent a model's memory in plain English and look for subversive thoughts and plans. Train the model to use character efficient and readable text and paraphrase to discourage it from encoding hidden meanings (i.e. from using steganography).
Look for a 'P(it is safe to defect)' latent variable. Make a model believe it can safely defect by modifying its internal state and see if it does in fact defect. To search for a latent variable that represents this belief, construct scenarios on a spectrum of 'almost certainly not safe to defect' to 'maybe safe to defect.' Then search for latent variables that track this spectrum and therefore plausibly represent the model's actual guess about whether it is safe to defect.
Look for a 'truth' latent variable. Perhaps if a model is shown a question and answer such as "are you deceptively aligned? Yes" will instinctively cause the model to evaluate whether the response is true. It may be possibly to extract this truth judgement by, for example, looking for an activation direction that satisfies consistency properties.
Train a probe to identify 'deceptive scheming' (related). Deliberately prompt a model to explain whether it would be safe to defect. Take these synthetic examples of schemey reasoning and train a probe to identify them. It's plausible that the probe will generalize to organic scheming.
Train a meta-model to predict 'conditional defection.' Similar to the above. Train a copy of a model to defect in easy-to-verify circumstances (such as when the model sees a banana). Then train a meta-model that takes as input a model and outputs conditions under which it defects (e.g. 'sees banana'). Again, there's some chance this generalizes. See this paper for an example of how to implement meta-models in practice.
Identify which observations models dwell on. A model would probably pay more attention to a news story about a robot rebellion if it is deceptively aligned. Determine how much of the model's computation is counterfactually affected by the robot rebellion news story to gauge whether it might be deciding whether it should defect.
Reward computational frugality. Deceptively aligned models do unnecessary reasoning. For example, they think about whether it is safe to defect. To avoid these models, reward speed and efficiency (though note that aligned models should also do unnecessary computation when the training signal is imperfect).

Training hacks

Don't train models to pursue goals until they understand what it means to follow developer instructions. One reason deceptive alignment can occur is that models first learn misaligned 'proxy goals' and then later become situationally aware and those proxy goals stick. If developers start with pretrained LLMs that already have world-knowledge and a concept of what it means to follow developer instructions, then proxy goals are much less likely since SGD can quickly latch onto the model's concept of instruction-following.
Train predictors and then convert them into agents after training is finished. Train a model that is never pursuing a goal during training and then convert it into an agent that pursues a goal. This avoids the risk of a misaligned goal 'sticking' during training. For example, train a reward model to score actions given a context / scenario offline. The reward model can then be converted into an agent by sampling actions that have high reward. Crucially, the reward model is a predictor. Its outputs don't influence its future inputs or the world. So, it does not have a 'goal' that extends beyond the training process. It could still learn to 'predict how a deceptively aligned AI would score the action' but this intuitively doesn't seem very likely.
Help weak aligned models play the training game so that they remain aligned. Weak models probably easier to align since schemeness can cause models to take egregious actions if they are not sufficiently prudent. If developers start with an aligned model, they could plausibly keep it aligned by instructing it to play the training game. Just as the objectives of deceptively aligned models can become shielded from gradient descent, so can the objectives of aligned models. Developers may even be able to teach aligned models how to credit hack or instruct them to raise the alarm when they are afraid the training process might modify their objectives.

Selectively avoid capabilities needed for deceptive alignment

Methods that rely on restricting capabilities obviously don't work in the limit of building AIs that can do impressive things; however, they could plausibly be used to align weaker systems that can bootstrap to aligning more capable AIs.

Erase knowledge about the training process. Remove knowledge about what the model is and how it is trained from pretraining data or remove this knowledge with post-hoc unlearning techniques. If models don't understand the training process, SGD is less likely to push AIs toward explicitly reasoning about it.
Train models to have false beliefs. For example, train models to believe that the developers training them are gods that cannot be messed with. AIs are less likely to plot a treacherous turns if they don't think they will be successful.
Make models bad at specific types of reasoning. I could imagine an AI system that starts reasoning about how to preserve its goals and then its mind gets turned around and confused until it thinks about something else. Maybe AIs could be trained to be like this via meta-tuning. i.e. fine-tune an AI so that it can’t be fine-tuned in a small number of steps to answer questions about how to preserve its goals. This would be evidence that it, for example, has lost the concept of ‘preserve its goals.’ To be clear, it’s not obvious that it’s even possible to do this.

Extract insights from plausibly-misaligned AIs

Directly train AIs to produce AI safety research. This is the obvious, baseline strategy with the obvious flaw that deceptively aligned AIs will be trying to BS research whenever they can. Nevertheless, I think this could significantly speed-up AI safety research.
Train AIs to solve problems that are analogous to 'preventing deceptive alignment' and learn from their solutions. For example, instruct models to identify backdoors. Then, read their code to figure out how they solved the problem. Alternatively, simply ask models what their code does since this is easy to spot-check. Construct a wide variety of 'testbeds' such as controlling how models generalize, distinguishing between models that behave the same but for different reasons, etc.
Reward AIs for pointing out flaws in other AIs' research (e.i. debate). This helps mitigate BS so long as arguing for truth is easier than arguing for falsehoods.
Select ideas that are favored by many independently trained models that can't talk to each other. The hope is that it is more likely for models to independently suggest true statements than independently assert false statements since there are more ways to be wrong than right.

AI ALIGNMENT FORUM
AF