The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness

Charlie Griffin; Patrick Leask

1) The safe-to-dangerous shift is a fundamental problem for eval realism

Suppose we have a capable and potentially scheming model, and before we deploy it, we want some evidence that it won’t do anything catastrophically dangerous once we deploy it. A common approach is to use black-box alignment evaluations. However, alignment evaluations are only reassuring to the extent that the model can't reliably^[1] distinguish the deployment distribution from the evaluation distribution, as it is otherwise difficult to rule out the possibility of alignment faking.

There are many approaches one could use to try to make evaluations appear more realistic: you can try to create realistic environments (e.g. Petri, WebArena, OSWorld); use data from past deployments (e.g. OpenAI, SAD); and spoof tool-call responses (e.g. ToolEmu).

However, the core difference between an alignment evaluation and a real deployment, is that an alignment evaluation must be safe^[2]. That is, if you want to safely evaluate an untrusted AI system, you need to limit your AI’s ability to cause harm within your evaluation environment. If you want to deploy your AI system (usefully) you need to give it at least some ability to affect the world, and potentially cause harm.

The safe-to-dangerous shift is a fundamental problem for creating realistic evaluations, and a core AI lethality.^[3]

Somehow, you need to make sure a sufficiently intelligent adversary couldn’t work out if it was in evaluation or deployment based on whether the setting is safe or dangerous. The model doesn’t even need to do this reliably, it only needs to identify some dangerous settings as deployment that also are good defection opportunities.

Fortunately, AI systems aren’t arbitrarily intelligent. Perhaps we can upper-bound their evaluation awareness, and iterate to make our evaluations more realistic, until they are realistic enough…?

2) The safe-to-dangerous shift is also a problem for measuring eval awareness

Measuring an undeployed and untrusted model’s eval awareness runs into the same problem: you can’t get direct behavioural data from the model in deployment without first deploying.^[4] Therefore, it's hard to build or validate an awareness measure.

Here are some approaches people have tried in the past for estimating a model’s evaluation awareness and why they aren’t robust to capable untrusted models:

You could ask the model whether it thinks it’s in eval or deployment. However, a smart untrusted model might just lie.
You could monitor the chain of thought for spontaneous verbalisations of eval awareness. However, a smart untrusted model might not voluntarily give itself away.
You could design a “fake eval” and a “fake deployment”, and measure the model’s behaviour on the fake eval and the fake deployment. However, both of these distributions are still, in-fact, evaluations (in at least the sense that they must be safe). A smart model could identify both sets as evaluations – and they wouldn’t be wrong! – whilst still defecting during a true deployment.
1. A more advanced method would be to construct fake deployment data using real deployment data from a prior model to start a trajectory and then spoofing the tool calls you get in response. However, a smart untrusted model might be able to notice its being evaluated by either (i) noticing actions in the trajectory were generated by a different model or (ii) looking for clues in the responses of certain tool calls.
You could apply targeted white-box interventions and see how this affects the model’s behaviour in evaluations, concluding that a change in behaviour indicates evaluation awareness. For example, with simple steering vectors or noising model weights. We think this is a promising direction but, as the Opus 4.6 Sabotage Risk Report emphasises, “we do not yet have a sufficient understanding of our models’ internal states to make a definitive assessment”.
You could actually deploy the model, but under an especially strong and expensive control protocol such that it is safe to deploy. This could fail if your model can tell it’s under unusually strong monitoring (and wait for a better chance), but also because you can’t importance sample likely inputs for defection.
1. You’ll also face a trade-off between the cost of your control protocol, and the safety. If you already had a control protocol that was safe and scalable, you wouldn’t need alignment evaluations!^[5]

For the behavioural measures of evaluation awareness above, the core problem is gathering behavioural data from deployment. Even for non-behavioural proxy measures (e.g. probes or CoT verbalisation), validating those proxies also requires data from deployment. Fundamentally, it seems hard to side-step the problem of generalising from evaluation to deployment.

3) We should expect measuring eval realism to be hard.

It shouldn’t be surprising that measuring eval realism is hard, because a robust way to measure evaluation realism would go a good way to solving inner-alignment. Optimising against a robust measure of eval realism would allow you to create hyper-realistic alignment evals. Hyper-realistic evals can be turned into a hyper-realistic alignment training environment. This could give you a signal for training any identifiably-bad actions out of your model (up to low-probability actions).

For context, we've been interested in bounding evaluation awareness as one-route to ruling out deceptive alignment in limited-capability LLMs. This post lays out what we think is one of the key issues - and hopefully we'll put out a post soon laying out what our best partial solution is.

Thanks to Marie Buhl, Geoffrey Irving and Caitlin van Bommel for helpful feedback on earlier drafts.

^{^}
How “reliably” would a model need to be able to distinguish evals from deployment to be a problem? One effective strategy, described by Paul Christiano here, would be to behave unacceptably only if the current input couldn’t appear during evaluation.
^{^}
In this post, we’re mostly using “safe versus dangerous” to refer specifically to a property of a deployment environment, concerning how much autonomy the agent has, and therefore how much harm it could cause if it tried. We also mostly treat this as a binary, though in practice there will be some gradation. We acknowledge this use of “safe” is not fully standard and general.
^{^}
We’ve said the safe-to-dangerous shift is only a fundamental problem, because we’re not confident it’s the only one. For example, we’re also worried about ‘the problem of time’ - where an AI can look for evidence that it has already been ‘deployed’ for a while. The problem of time isn’t about a shift between the last eval example and the first deployment example – because there won’t yet be (much) evidence of deployment in the first deployment example. Instead, it’s about gradual distributional shifts during deployment (as the effects of deploying the AI dissipate).
^{^}
The lack of behavioural data at deployment of course isn’t the only problem - you also have problems of elicitation and sandbagging (e.g. getting your model to genuinely try to work out if it's in an eval or deployment).
^{^}
Part of our interest in bounding eval awareness is in trying to establish trustedness for some (pre-AGI) LLMs, so that they can be used in AI control protocols. Without already having trusted models, safe deployment could be as expensive as human-review of every action.

13

The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness

13

1) The safe-to-dangerous shift is a fundamental problem for eval realism

2) The safe-to-dangerous shift is also a problem for measuring eval awareness

3) We should expect measuring eval realism to be hard.