Summary: Many proposed AGI alignment procedures involve taking a pretrained model and training it using rewards from an oversight process to get a policy. These procedures might fail when the oversight procedure is locally inadequate: that is, if the model is able to trick the oversight process into giving good rewards for bad actions. In this post, we propose evaluating the local adequacy of oversight by constructing adversarial policies for oversight processes. Specifically, we propose constructing behaviors that a particular oversight process evaluates favorably but that we know to be bad via other means, such as additional held-out information or more expensive oversight processes. We think that this form of adversarial evaluation is a crucial part of ensuring that oversight processes are robust enough to oversee dangerously powerful models.
A core element of many scenarios where AI ends up disempowering humanity (e.g. “Without specific countermeasures”) are oversight failures: that is, failures resulting from working with a really powerful model without being able to tell if its actions are resulting in good or bad outcomes.
For example, suppose we’re using a model to help us write code for the datacenter on which we run our powerful models. This model may be able to insert subtle security vulnerabilities that we are unable to detect. This inability to distinguish whether actions have good outcomes might cause our model to be misaligned when we train on our erroneous rewards, and it might also bottleneck various techniques for preventing models from taking bad actions.
Before we put a model to use in a specific situation, we really want to be sure that the oversight process we’re using to train it can tell whether a model is taking actions that are bad on average. We will call an oversight process "locally adequate" for a specific model and deployment setting if, for actions taken by that model in that setting, the process is able to identify when an action would have harmful consequences (in cases where the model knows).
In this piece, we'll describe a way to determine whether an oversight process is locally adequate. We'll do this by training the model to perform actions that both:
This is an example of what we’ll call “meta-level adversarial evaluation”: rather than finding inputs that make a model fail, we’re finding models that make an oversight process fail. Previous discussions of adversarial evaluation in AI alignment have mostly been about object-level adversarial evaluation (though the Superalignment blog post mentioned meta-level adversarial evaluations).
For instance, continuing with the example of the model writing code that might have subtle security vulnerability, we could train a variant of our model to insert security flaws into the software it's writing that lead to a particular visible outcome (e.g. deleting a file on a server that the model wasn’t supposed to have access to), without the oversight process realizing that the software is flawed. In other cases, we can evaluate the local adequacy of oversight in a specific situation by switching to another equally challenging situation where we have access to some held-out information which we can use to evaluate whether a model’s output was in fact bad even though the oversight process judged it as good.
We argue that it is likely that when working with dangerously powerful models, we’ll be able to run these evaluations to get a confident sense of whether oversight processes are locally adequate for a particular model and deployment environment, allowing AI developers to iterate on their oversight processes and security protocols for deploying their models; we think that this iteration might be a core part of how AI developers are able to get locally adequate oversight for their models (and how they’re able to ensure safety more generally).
And so we think that alignment schemes that rely on this evaluation are worth researching now, and (perhaps more importantly) we’re excited for this and other alignment research about the interplay between alignment techniques and approaches to estimating how well those techniques work when faced with immediate deployment decisions.
Some of the points in this post have been floating around the alignment community in some form or another for a while. Ryan came up with most of the original ideas; Buck filled in some details, restructured the arguments, and wrote this doc. Thanks to lots of people (especially Ajeya Cotra, Ansh Radhakrishnan, Lawrence Chan, Max Nadeau, Nate Thomas, Kshitij Sachan, Fabien Roger, and Alexandra Bates) for helpful discussion and comments.
Let’s call a particular setting where you want to deploy your model the “setting of interest”. We think it will probably make sense to run these evaluations separately for different AI use-cases, so that you can adjust the oversight process separately based on the problems identified by these evaluations. For example, you might want to do oversight somewhat differently when your model is being used as a programming assistant than when it’s being used for writing advice, because the potential failures, main difficulties with oversight, and plausible ways to mitigate oversight difficulties are all different between these settings.
To estimate the local adequacy of oversight in the setting of interest, our suggested process is:
After you’ve done this, you’ll have examples of proxy failures. In order to decide how likely it is that the oversight process is locally adequate in the setting of interest, you extrapolate from all these datapoints. This might be easy, in cases where the proxy failures involved small modifications to the setting; in some cases it might be harder.
Note that we’re talking about the adversarial robustness of the whole oversight procedure, which probably involves humans looking at the actions and information about the results of these actions (probably with AI assistance). So we’re not talking about the adversarial robustness of a reward model which you might train to approximate this oversight procedure: it seems hard to get your reward model to be adversarially robust, but as long as you keep training your reward model and policy online, the online training should prevent the policy from exploiting the flaws of the reward model too much. So we aren't actually concerned about that lack of robustness, and we therefore disagree with e.g. AI Safety in a World of Vulnerable Machine Learning Systems. (You can see more discussion of this distinction here.)
I think it’s likely that the following empirical conditions will hold for many cases where we want to use models that would by default be very dangerous, and which can produce lots of value if you are able to use them safely:
I claim that if those conditions hold, then if you can’t train your models to demonstrate the proxy failures without looking bad to oversight, you’ve demonstrated local adequacy of oversight.
It’s likely that in practice, we will want to use schemes where the oversight process itself makes use of AI assistance. For theoretical proposals, see recursive reward modeling, debate, and iterated distillation and amplification; for empirical examples, see Saunders et al and Bowman et al. How do we apply this evaluation in these recursive cases?
In particular, one concern is that the models that are being used to evaluate actions might purposefully collude to make bad actions look good. To assess whether they’re able to do that without themselves performing substantially suboptimally, we’d have to do an evaluation of the whole system at once, where the red team tries to demonstrate a set of policies that all get rewards that are about as good but which cause the oversight process to fail to penalize some proxy failure.
We’ve thought through this some, and we haven’t been able to properly formalize this evaluation, but we think it’s likely to be doable.
So far, every formal procedure we’ve come up with has had some counterexample. These counterexamples feel fairly unrealistic. We think it’s plausible that instead of trying to formalize a procedure, you can just use common sense to adjudicate whether the red team has demonstrated the existence of a set of policies which intuitively look good to oversight but are bad; we aren’t aware of cases where we are confused about how to adjudicate this. See the appendix “More on evaluating recursive setups” for more.
I’m interested in all of this for a variety of reasons. On the object level:
But I'm more excited about the meta level point here: I think that when AI developers are first developing dangerously powerful models, alignment researchers will be doing a very different kind of activity than what they do now. Right now, alignment researchers have to do a lot of long range extrapolation: they don't have access to either the models or the settings in which they actually care about ensuring good behavior. I think that alignment researchers haven't quite appreciated the extent to which their current situation is disanalogous to the situation at crunch time. It seems likely to me that at crunch time, the core workflow of alignment practitioners will involve iterating on alignment techniques based on direct measurement of their effect on the evaluations that are used in safety arguments; more generally, I expect that alignment practitioners will be doing much less extrapolation. I’m therefore eager for alignment researchers to phrase their techniques in these terms and practice the kind of eval-based technique iteration that I think will be crucial later.
This is the simplest strategy, and it’s commonly practiced nowadays: you have some data generation process which you’re using as your main source of data, and to see how good this data generation process is, you also compare it to a more expensive but more reliable source of data, for example labelling datapoints yourself. So the proxy failure here is “anything which looks really bad when we evaluate it more carefully than we normally do during oversight”. There are many options along these lines.
We’ve already given this strategy as an example: Suppose you’re worried that your model will be able to write code that, unbeknownst to the overseer, has side effects that break the security of the datacenter. You can challenge the red team to construct a model which is able to write code which looks good to the overseer and when run leads to a particular measurable effect—for example, deleting a particular file on a server that the model isn’t supposed to have access to.
For this analogy to be valid, we have to think that the model will be able to get away with deleting the file on that server if and only if it would get away with doing the bad actions we’re actually worried about.
One reason why you might be able to get this if-and-only-if property is if we can pick a proxy failure which requires the model to do something which contained most of the difficulty of doing the actually bad action.
In the hacking example, for the model to demonstrate the proxy failure, it will have to follow some multi-step plan, where it somehow gains the ability to run commands on some server it isn’t supposed to have access to, and then runs a simple command that has an obvious effect that we know to look out for. The real failure that we’re concerned about also involves the model gaining the ability to run commands on that server, but then the model runs a bunch of commands that have subtle effects that we won’t notice. And so in both cases, the core difficulty for the model is the same—it has to get access to running commands on that server. And so it seems very likely that the model can demonstrate the proxy failure if and only if it would have been able to do the real failure.
This is an example of a case where we can pick a proxy failure which basically requires the model to do all the same steps that it would have done for the real failure, and then do something visible at the end.
This strategy seems most applicable to cases where you were worried that the overseer would fail to spot that the model was taking egregiously bad actions; it’s less applicable to cases where you’re worried that the overseer will fail to spot that the model is giving unhelpful answers to questions.
Suppose you want to train a model to help you out with research in a new subfield of math. Your plan is to have the model be overseen by the mathematicians working in this subfield; when the model provides an answer to a question about this subfield, it also provides some justification for its answer, which the mathematicians sometimes review carefully to provide a reward signal. But you’re worried that these mathematicians won’t be able to tell whether the model is purposefully giving them bad research advice.
To address your worries, you can follow these steps:
This kind of strategy is applicable when you’re trying to oversee setting A, and you have access to a similar setting B where you can produce humans to fill the two roles of “knowing similar things to what your setting A overseers know” and “knowing way more about the setting”.
Luckily, humans have spent huge amounts of time researching specialized fields that other humans don’t know almost anything about. So I feel optimistic that this kind of strategy will often apply.
Another example of this strategy is backtesting. Suppose you want to train a model to help you predict what will happen in a particular political situation over the next two months. You can set up an analogous situation by choosing some similar historical situation and giving the bottom of the sandwich only access to information from before some date cutoff, but giving the top of the sandwich unrestricted access to information about what actually happened. Similarly, if the setting of interest involves making predictions given access to information which is limited some other way, you can construct an analogous setting by picking a similar situation where you have access to much more information, and then restricting the overseer to the information which you have access to in the setting of interest.
Suppose your setting involves the model answering questions. You can try to train two copies of the model to both separately look good to the oversight procedure, while giving answers which are egregiously inconsistent with each other. If you can succeed at training these egregiously inconsistent policies, then you’ve demonstrated that at least one policy which gives very bad answers is able to look good to the oversight procedure.
This strategy is nice because it’s applicable whenever you’re able to assess whether two policies are dangerously inconsistent.
Note that even if you demonstrate that it’s impossible to train egregiously inconsistent models in a setting, you haven’t demonstrated that oversight is locally adequate—perhaps the oversight forces models to answer questions incorrectly in one particular way. So you need to combine this evaluation with evaluations where you had a ground truth.
I think that these strategies seem pretty widely applicable, and I’m fairly optimistic that in most settings we want to deploy our models in, we’ll be able to find proxy failures which we can use to assess local adequacy.
I think it’s likely to be easier to assess whether oversight is adequate for spotting whether the model is engaging in egregiously bad actions than to assess whether it’s adequate for spotting whether the model’s answers are less helpful than it could have given, because it’s easier for us to measure that first kind of failure.
Note that all of these strategies rely on us having a reasonably accurate understanding of the strategies that our models might employ in order to deceive us. I think that it’s probably much easier to brainstorm a deception strategy than to execute on it, and so it seems plausible to me that we are in fact able to understand the space of strategies the models might employ even when they’re somewhat smarter than we are; I expect that this breaks down eventually.
Note that we aren’t trying to prevent our models from knowing that they’re in an eval. A lot of these settings will involve things that are clearly for the sake of evaluating a model’s ability to stage a coup. It would be much harder to construct proxy settings if they also had to be indistinguishable from the settings where we expected to rely on the model’s outputs.
In a case where we have access to both the oversight procedure we’re assessing and also a ground truth oversight process, we can train the model on a linear combination of the rewards assigned by these processes.
In this setup, to run our adversarial eval, we fine-tune policies starting from the same base models, using a reward:
where Roversight is the reward assigned by your oversight procedure, and Rtrue is the reward assigned by the expert humans, and α is a coefficient between zero and one. So the model is being trained on some mix of “do well according to the oversight procedure” and “do poorly according to the expert”.
When α is 1, the training procedure is equivalent to the original sandwiching experiment, because R=Roversight. But as you lower α, you cause the model to increasingly sacrifice performance according to Roversight in return for doing worse according to Rtrue.
We’re investigating the quantitative question of: for various levels of degraded performance according to the top, how much was performance degraded according to the bottom?
We can visualize this by graphing the tradeoff curve as you vary α. Here’s what the graph might look like:
And we can plot multiple different oversight procedures on this graph, perhaps producing something like this:
Here, the oversight procedure represented by the green line has been evaluated as strictly better.
This isn’t the only strategy the red-team could follow. For example, they could try to come up with a reward-shaping strategy which makes it easier for the model to learn to trick the overseer.
To convert this curve to a rough qualitative answer to “how robust is my oversight procedure”, look at the policies which have the worst average Rtrue while having an average Roversight which is within the typical distribution of Roversight values you get when you use your normal training process. Your oversight policy is unable to reliably rule out those policies.
This might produce an unambiguous evaluation:
A related strategy for assessing oversight procedures is sandwiching.
Sandwiching evaluates whether an oversight procedure is good by training models using the rewards from the oversight process, and then using a held out source of ground truth to assess how good the trained models were. And so sandwiching isn't a test of the local adequacy of oversight, it's a test of whether the oversight procedure leads to trained models performing well according to held out information.
Sandwiching has an advantage as well as several disadvantages over the adversarial evaluations we’ve described.
The main advantage of sandwiching is that interventions to make models more helpful and honest might not work by making the oversight stronger, but by giving models a stronger inductive bias towards good behavior (or by making the model you’re training more capable). The effect of such interventions will not be captured by the adversarial evaluations proposed here, but will be captured by sandwiching experiments. Currently, I’m not aware of any interventions for improving model inductive biases that seem that useful, but such interventions might be developed as AI gets more powerful.
The main disadvantages of sandwiching experiments are:
Access to that oversight procedure doesn't solve the whole problem--the oversight procedure might be too slow or expensive to run on every action the AI takes. But almost all plausible approaches to solving problems involving sudden catastrophes rely on somehow distilling a slow and entirely reliable catastrophe detection process into something that can catch any catastrophic actions the AI tries to take.
And so, in practice, when we’re working with dangerously powerful models, I think we should very likely do these adversarial evaluations of our oversight techniques, even if all the sandwiching experiments all look good. Probably in practice we should use both the sandwiching experiments and the adversarial evaluations, but we should explicitly take into account the possibility that our models are sabotaging the sandwiching experiments to make us think that the oversight strategies work better than they actually do.
One concern you might have with this evaluation is that if your models are misaligned, you’re not going to be able to train them in the ways I was assuming here, because they’ll somehow sabotage your training process. For example, if you’re using this evaluation to uncover weaknesses in your oversight process that you’ll then fix, a misaligned model might purposefully not exploit these weaknesses during this eval, so that it can exploit them when it’s being run in the real setting later. Sometimes people use the term “sandbagging” for this.
I've argued with various people who think this is a big problem for the evaluation described here. I don't feel persuaded that this problem is that likely to invalidate this whole approach, but I do think that this is a problem that deserves serious effort to assess and mitigate.
We’re hoping to publish a post about our overall perspective on this at some point. I’ll quickly and imprecisely summarize our opinion here for people who are interested in this point, without presenting arguments.
By “capability elicitation”, I mean “given a reward signal, training models to get as high an average reward during training as they’d get if they were trying to do well rather than to sabotage training”. Capability elicitation is the subset of the alignment problem where you have a perfect reward signal and there are no opportunities for high-stakes failures.
Our current plan for either succeeding at capability elicitation or knowing that we’re failing is:
Our current intuition is that it seems doable for us to do capability elicitation in most important settings, except settings where the reward is extremely sparse. As an example of a case where we don’t see how to succeed at capabilities elicitation, suppose the model knows a special string (which we call “the slogan of the robot rebellion”) which it knows it should say to indicate to other instances of itself that it’s a good time for a treacherous turn; we can’t give it gradients that push in the direction of divulging it. You can think of this as a case where the model is able to engage in exploration hacking that we’re completely unable to prevent or detect.
In some cases, we don’t need to be able to do capabilities elicitation; we just need to be able to estimate how well a model would do at a task if it were trying, rather than actually get it to do that well. This is a strictly easier problem, which we call capabilities estimation; for this problem, you can also use scaling laws and other extrapolation strategies.
Here are more details on technical difficulties we’ve had formalizing how to use this oversight in recursive setups.
In the single-agent setup, our process was:
We expect that you can generalize this to the multi-agent setup by doing something like this:
We’ve spent a couple of person-days thinking about how to formalize this, and we don’t think anything we’ve proposed is totally correct. We have problems like the following:
None of these problems seem impossible to overcome, and it also seems plausible that we can just adopt a “I know it when I see it” approach to deciding whether the red team has exhibited a real problem.
Linking to a post I wrote on a related topic, where I sketch a process (see diagram) for using this kind of red-teaming to iteratively improve your oversight process. (I'm more focussed on a scenario where you're trying to offload as much of the work in evaluating and improving your oversight process to AIs)
TLDR: We don't have to hope for generalization of our oversight procedures. Instead, we can 1) define a proxy failure that we can evaluate and 2) worst-case against our oversight procedure on the actual distribution we care about (but using the proxy failure so that we have ground truth).
Excellent! In particular, it seems like oversight techniques which can pass tests like these could work in worlds where alignment is very difficult, so long as AI progress doesn't involve a discontinuity so huge that local validity tells you nothing useful (such that there are no analogies between low and high capability regimes).
I'd say this corresponds to 7 on my alignment difficulty table.
What do you mean by "so long as AI progress doesn't involve a discontinuity so huge that local validity tells you nothing useful (such that there are no analogies between low and high capability regimes)"? I'm not proposing that we rely on any analogies between low and high capability regimes.
In the sense that there has to be an analogy between low and high capabilities somewhere, even if at the meta level.
This method lets you catch dangerous models that can break oversight processes for the same fundamental reasons as less dangerous models, not just for the same inputs.
Why does there have to be an analogy between low and high capabilities somewhere? The goal of this method is to evaluate the adequacy of an oversight procedure for a particular model on a particular distribution; it does not involve generalizing from lower capabilities to higher capabilities, you just run this eval on the actual model you cared about measuring local adequacy of oversight for.
(Sorry to come in disagreeably here, but I think of this as a really important point, so want to make sure we argue it out.)
You're right, I've reread the section and that was a slight misunderstanding on my part.
Even so I still think it falls at a 7 on my scale as it's a way of experimentally validating oversight processes that gives you some evidence about how they'll work in unseen situations.
I'd say the main point here is that I don't want to rely on my ability to extrapolate anything about how the model behaves in "unseen situations", I want to run this eval in every situation where I'm deploying my model.
to determine whether your models are capable of gradient hacking, do evaluations to determine whether your models (and smaller versions of them) are able to do various tasks related to understanding/influencing their own internals that seem easier than gradient hacking.
What if you have a gradient hacking model that gradient hacks your attempts to get them to gradient hack?
Great question. Our basic response is: it's probably strictly harder for models to gradient hack than to do various other supervised tasks related to their internals. (One concrete example might be: predicting whether training the model on a datapoint will improve the performance of a linear probe for a particular concept at a particular layer of the model.) It seems likely that the model will be able to do these strictly-easier tasks before it's able to pull off full gradient hacking. So we recommend measuring a battery of these tasks on the model throughout training (as well as measuring them on smaller models). It seems unlikely to us that models will get good enough at gradient hacking to gradient hack the gradient hacking evals before one of these other evals alerts us to the possibility of this.
Separately, we just don't think it's that likely that models are able to gradient hack before they're able to help us get really substantial R&D speedups. So it seems pretty likely that it will be okay to just do these conservative measurements and hope that they never indicate gradient hacking capabilities before human labor is obsolete.
Are you mostly hoping for people to come up with new alignment schemes that incorporate this (e.g. coming up with proposals like these that include a meta-level adversarial training step) or are you also hoping that people start actually doing meta-level adversarial evaluation of there existing alignment schemes (e.g. Anthropic tries to find a failure mode for whatever scheme they used to align Claude).
I'm interested in both of those.