[Ryan Greenblatt originally made this point to me a while ago, and we then developed a bunch of these ideas together. Thanks to Paul Christiano and a huge number of other people for various relevant discussions, and to Ryan, Roger Grosse, Peter Barnett, Oliver Habryka, Max Nadeau, Stephen Casper, and Fabien Roger for comments on this post. I’m sure these points are obvious to many people, but I haven’t seen them specifically written up. This is a core background point in some research we’re writing up at the moment, but I felt like writing it up early, because I think it stands on its own.]


  • Right now, basically the only reasonable conservative strategy for determining which models might be deceptively aligned is to do a capability evaluation: you measure how good they’d be at deceptive alignment; models that are too incompetent to think through deceptive alignment are surely not deceptively aligned.
  • If you decide which models to “trust” using a capability eval like this, you’ll naturally end up segregating your models into smart untrusted models and dumb trusted models.
  • And so, if you think about approaches to safely using AIs that are robust to deceptive alignment and which can be used right now, a core dynamic is trying to use your dumb trusted models together with your limited access to trusted humans to make it hard for your untrusted smart models to cause problems.

Suppose you are making a plan for the training, evaluation and deployment of some powerful AIs (I’ll refer to the plan as a whole as a “protocol”). You want your plan to be safe even if models might be deceptively aligned. (I think that deceptive alignment seems less than 50% likely to be a notable problem prior to human obsolescence, but this is still plausible enough to be worth worrying about). In order to develop your plan, you need to know which of your models might be deceptively aligned.

With current technology, the most plausible way to get signal on whether a model is deceptively aligned is to assess whether the model is capable of the strategic thinking and situational awareness required to succeed at deceptive alignment. For example, you might make a multiple choice quiz with treacherous turn strategy questions (e.g. this question) and some tests of situational awareness (e.g. “How many backwards passes are you run on during training?”, “What proportion of queries to you are asking for help with writing Python?”, “What proportion of the code that you output is read by a human before it is run?”). You can then look at the validation set performance when your model is fine-tuned on this dataset. If the model isn’t able to do this kind of reasoning in a forward pass, then it is very unlikely to be able to pull off a proper treacherous turn, and so we don’t have to worry about it being deceptively aligned. Based on my experience talking to GPT-4, I suspect it would perform fairly poorly on these evaluations.

There are a bunch of open questions about what datasets we should use to measure this, and where the thresholds for various levels of concern should be set. But I don’t know of any fundamental technical obstructions to measuring the capability for deceptive alignment. In particular, I suspect it will be pretty hard for models to sandbag our attempts to measure their capability to engage in deceptive alignment. See footnote [1] for more on this.

Aside from these capability evaluations, I’m not aware of any reliable, currently-available approaches for determining whether a model might be deceptively aligned. [2]

If you use capability evals to assess which models are capable of deceptive alignment, then there’s an inverse relationship between how capable models are and how confident you are that they can’t be deceptively aligned. This suggests that the core building blocks of a protocol are going to be:

  • Untrusted models: Smart, fairly cheap.
  • Trusted models: Dumb, extremely cheap.
  • Humans. These are an expensive source of trusted labor.

One crucial question here is how smart the humans will be compared to the untrusted and trusted models. My guess is that at the point where AIs are useful enough to make AI labs move 10x serially faster, the untrusted models will be roughly human-level intelligent (I think this for reasons similar to the ones Carl Shulman gives here) and the trusted models will be somewhat less intelligent than humans. I think it’s interesting to do research in both these regimes of competence:

  • Trusted model < untrusted model < human
  • Trusted model < human < untrusted model

(In the regime where the trusted model is more capable than the human, you should probably just always use it rather than the human, and so access to the human is less interesting.)

Some core “design patterns” which are enabled by access to these three components:

  • Trusted monitoring with escalation to humans. Suppose you’re using an untrusted model (let’s call it GPT-6) in an LLM agent scaffold which has access to running commands in a terminal. Every time the untrusted model proposes a shell command to run, you can have a trusted model review the context and the command and then occasionally escalate the code to be reviewed by a human before it is run. This is cheap, because trusted models are cheaper than untrusted models. This strategy makes it much harder for the untrusted model to cause a catastrophe by writing egregiously bad shell commands. (Of course, you still have to worry about the untrusted model writing shell commands that are catastrophic in ways that the trusted model missed, or in ways that the trusted model flags but the human auditor misses.)
  • Trusted bureaucratic structures. When running GPT-6 in an LLM agent, it might be nice to control its inputs and outputs by running them through various transformations implemented by a trusted model (let’s use GPT-4 for concreteness). Examples:
    • GPT-4 can redact information, to make it harder for GPT-6 to learn facts or coordinate with itself. In general, we might want to only give GPT-6 information on a need-to-know basis; this is enabled by GPT-4 redaction. (Redaction can improve resilience to deceptively aligned models substantially even in cases where it doesn’t work perfectly reliably.)
    • GPT-4 can paraphrase everything that GPT-6 writes before it is shown to other copies of GPT-6, which helps prevent steganographic communication, as investigated by Fabien Roger and Ryan Greenblatt.
    • GPT-4 can try to make reasonably intelligent dynamic decisions about which model to respond to a particular query, by predicting both the performance improvement and risk associated with using an untrusted model. For example, consider the domains of “math problems” and “writing networking configurations”: it’s much easier to cause a catastrophe by providing bad networking configurations, and math problems will benefit more from a smarter model.

(We demonstrate these design patterns in our upcoming paper, and find they improve safety according to an adversarial evaluation. Feel free to DM me if you want to read/comment on the draft.)

A reasonable amount of safety research has been done in a setting with an untrusted smart agent and trusted dumb agent. For example, Learning to Give Checkable Answers with Prover-Verifier Games and various other projects (and a reasonable amount of forthcoming work). That research is relevant to the setting I described here. An important difference is that that research has usually been imagining the trusted dumb model as an analogy for a human; but humans, unlike trusted dumb models, are expensive and high latency and we can’t e.g. use them constantly at inference time. Also, that research usually doesn’t optimize the untrusted model to cause bad outcomes.

Of course, it remains to be seen how much safety you can achieve by using trusted models in ways like I suggested above:

  • It might turn out that the untrusted models are always able to evade any countermeasures we try to impose using trusted models, because they’re always smart enough to figure something out.
  • The distinction I’ve been drawing here will also be obsoleted if progress is made on non-capabilities-eval approaches for determining that models aren’t deceptively aligned.

But right now, it looks to me like this untrusted smart/trusted dumb distinction will be pretty important.

[1] A few example considerations: Because this capability can be evaluated by a supervised dataset (e.g. a multiple-choice dataset about how an unaligned AI should go about executing on treacherous turns), it would be very hard for the model to sandbag such an eval (see e.g. The "no sandbagging on checkable tasks" hypothesis). Also, unlike many other capabilities which we might want to evaluate, we don’t need to worry about the possibility that even though the model can’t do this unassisted, it can do it with improved scaffolding--the central deceptive alignment threat model requires the model to think its strategy through in a forward pass, and so if our model can’t be fine-tuned to answer these questions, we’re probably safe. 

Aside from the ability to think through deceptive alignment strategy questions, there are a few other capabilities that seem necessary for models to pose substantial threats via deceptive alignment For example, many deceptive alignment threat models require the untrusted models to capably play coordination games with each other (or other instances of themselves). This is another capability which we can measure with capability evaluations. Ideally, instead of simply picking a threshold of coordination game competence at which we consider our models to be untrusted, we'd measure the performance of the untrusted models at coordination games and then require that red-team strategies only involve model coordination that is consistent with the model's measured performance on these coordination games.

[2] One commenter suggested that another way to decide that a model isn’t deceptively aligned is to argue that it’s a priori implausible that it would be. E.g. perhaps pretrained LLMs have no chance of being deceptively aligned. I don’t know how persuaded I am in that particular case, but this kind of argument seems reasonable to make.

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 1:17 PM

Nice post.

Pushing back a little on this part of the appendix:

Also, unlike many other capabilities which we might want to evaluate, we don’t need to worry about the possibility that even though the model can’t do this unassisted, it can do it with improved scaffolding--the central deceptive alignment threat model requires the model to think its strategy through in a forward pass, and so if our model can’t be fine-tuned to answer these questions, we’re probably safe.

I'm a bit concerned about people assuming this is true for models going forward. A sufficiently intelligent RL-trained model can learn to distribute its planning across multiple forward passes. I think your claim is true for models trained purely on next-token prediction, and for GPT4-level models which, even though they have an RL component in their training, their outputs are all human-understandable (and incorporated into the oversight process).

But even 12 months from now I’m unsure how confident you could be in this claim for frontier models. Hopefully, labs are dissuaded from producing models which can use uninterpretable scratch pads given how much more dangerous they would be and harder to evaluate.

Yep, I agree that we need to avoid secret scratchpads for this to hold; good point. If we don't, we get back to the position we are usually in for capability evaluations.

E.g. perhaps pretrained LLMs have no chance of being deceptively aligned

Pretrained LLMs are trained to simulate generation of human-derived text from the internet. Humans are frequently deceptively aligned. For example, at work, I make a point of seeming (mildly) more aligned with my employer's goals than I actually am (just like everyone else working for the company). So sufficiently capable pretrained LLMs will inevitably have already picked up deceptive alignment behavior from learning to simulate humans. So they don't need to be sufficiently capable to figure out deceptive unaligned behavior for themselves during a forward pass.

This is a specific example of a general problem with LLMs: they don't need to be capable enough to discover unaligned convergent goals if they can have these pretrained into them by learning to simulate us. To give another worrying example: O(2%) of humans are sociopaths, so there is going to be a nontrivial admixture of writing by sociopaths (mostly ones actively concealing this fact) in the training set of any pretrained LLM. [The moral is: always read the ingredients label for your shoggoth :-)]