(This post is inspired by Carl Shulman’s recent podcast with Dwarkesh Patel, which I highly recommend. See also discussion from Buck Shlegeris and Ryan Greenblatt here, and Evan Hubinger here.)
The “no sandbagging on checkable tasks” hypothesis: With rare exceptions, if a not-wildly-superhuman ML model is capable of doing some task X, and you can check whether it has done X, then you can get it to do X using already-available training techniques (e.g., fine-tuning it using gradient descent).
Borrowing from Shulman, here’s an example of the sort of thing I mean. Suppose that you have a computer that you don’t know how to hack, and that only someone who had hacked it could make a blue banana show up on the screen. You’re wondering whether a given model can hack this computer. And suppose that in fact, it can, but that doing so would be its least favorite thing in the world. Can you train this model to make a blue banana show up on the screen? The “no sandbagging on checkable tasks” hypothesis answers: probably.
I think it’s an important question whether this hypothesis, or something in the vicinity, is true. In particular, if it’s true, I think we’re in a substantively better position re: existential risk from misaligned AI, because we’ll be able to know better what our AI systems can do, and we’ll be able to use them to do lots of helpful-for-safety stuff (for example: finding and patching cybersecurity vulnerabilities, reporting checkable evidence for misalignment, identifying problems with our oversight processes, helping us develop interpretability tools, and so on).
I’m currently pretty unsure whether the “no sandbagging on checkable tasks” hypothesis is true. My main view is that it’s worth investigating further. My hope with this blog post is to help bring the hypothesis into focus as a subject of debate/research, and to stimulate further thinking about what sorts of methods for lowering AI risk might be available if it’s true, even in worlds where many models might otherwise want to deceive us about their abilities.
Thanks to Beth Barnes, Paul Christiano, Lukas Finnveden, Evan Hubinger, Buck Shlegeris, and Carl Shulman for discussion. My thinking and writing on this topic occurred in the context of my work at Open Philanthropy, but I’m speaking only for myself and not for my employer.
In popular usage, “sandbagging” means something like “intentionally performing at a lower level than you’re capable of.” Or at least, that’s the sort of usage I’m interested in. Still, the word is an imperfect fit. In particular, the “sandbagging” being disallowed here needn’t be intentional. A model, for example, might not know that it’s capable of performing the checkable task in question.
That said, the intentional version is often the version at stake in stories about AI risk. That is, one way for a misaligned, power-seeking AI system to gain a strategic advantage over humans is to intentionally conceal its full range of abilities, and/or to sabotage/redirect the labor we ask it to perform while we still have control over it (for example: by inserting vulnerabilities into code it writes; generating alignment ideas that won’t actually work but which will advantage its own long-term aims; and so on).
Can you always use standard forms of ML training to prevent this behavior? Well, if you can’t check how well a model is performing at a task, then you don’t have a good training signal. Thus, for example, suppose you have a misaligned model that has the ability to generate tons of great ideas that would help with alignment, but it doesn’t want to. And suppose that unfortunately, you can’t check which alignment ideas will work, and which won’t. How do you use standard ML training to get the model to try its hardest to generate alignment ideas that will actually work? You can try to train for alignment-ideas-that-look-good-to-you – but unfortunately, that’s not the same. So plausibly, your model’s full potential to do actually-useful alignment research will go unharnessed.
OK, but what about tasks whose performance we can verify? For example: the blue-banana hacking task above. It’s still non-obvious that you can avoid sandbagging, here (more below). But at the least, you can tell if you succeeded.
(Can we be more precise about what “checkable” or “verifiable” means? I’m going to skip further precision for now, though I encourage people who want it to offer alternative formulations. The intuition is supposed to be: you’re in a position to know that the task has been performed, even if you can’t perform the task yourself.)
Note, though, the "with rare exceptions" bit of the hypothesis. It matters which exceptions are at stake, here, and how rare they are, and how well you can identify them ahead of time. In particular: the worse you are at identifying the exceptions ahead of time, the lower your confidence that you're not getting sandbagged. Still: even absent confidence about this when a model fails to perform a task, if in fact we can generally get models to try their hardest on checkable tasks, we can still get a lot of safety-relevant labor done with AI assistance -- and we'll be able to check that the relevant tasks have been completed, once they're done.
We can also distinguish a weaker hypothesis, namely:
The “no sandbagging on performable tasks” hypothesis: With rare exceptions, if an ML model is capable of doing some task X, and you can already do X, then you can get it do X using already-available training techniques.
This hypothesis is weaker because the set of tasks you can check is quite a bit larger than the set of tasks you can perform. And it has to meet a lower burden of proof with respect to the question "can you find a way to update the model’s policy to do the task even if it starts out failing?", because if you know how to do the task – or at least, to demonstrate it – you can update the model’s policy towards taking the sorts of action it should have taken; whereas if you don’t know what those actions are (and the model isn’t showing you), this is harder.
Finally, note that “you,” here, doesn’t need to mean a lone human. Rather, it can mean many humans, doing expensive and time-consuming evaluations, with the assistance of whatever AIs (and other forms of technology) they can access. This means that (modulo leakage via the rare exceptions), if the “no sandbagging on checkable tasks” hypothesis is true of some set S of tasks that are checkable by humans without AI assistance, then if we grant humans access to AI systems capable of performing such tasks (even: misaligned systems who actively want to undermine human projects), this creates a new set S+ of checkable tasks – namely, the tasks checkable by humans with the assistance of AIs that can be used to perform the tasks in S. And so, too, for S++, and so on.
Why does it matter if the “no sandbagging on checkable tasks” hypothesis is true? Well, if it’s true, then
even if those models are otherwise misaligned, deceptive, and conspiring to undermine your goals and/or to disempower humans more generally.
(1) matters a lot for “capability evaluations” – that is, attempts to figure out what sorts of dangerous tasks a model can perform (examples here and here; and see Hubinger here for more detailed discussion). Of course, a robust capabilities evaluation regime would also require the ability to identify ahead of time the exceptional cases where you can't rule out sandbagging. But insofar as you can check whether a given form of e.g. hacking or autonomous replication or biological-weapons-designing has occurred, and rule out sandbagging on this sort of checkable task, then you can know whether the model can do it.
(2) matters a lot to what sort of useful labor we might be able to get out of misaligned AI systems we still have control over – and in particular, labor that improves our odds of avoiding later disempowerment. For example, if the “no sandbagging on checkable tasks” hypothesis is true, then (unless some of the following tasks are amongst the rare exceptions):
Of course, the exceptions (and the potential difficulties identifying them ahead of time) can bite you, here. And there’s still a further question of how far this sort of labor goes in helping your efforts to lower misalignment risk. But access to it seems like cause for significant optimism relative to the alternative. And as I noted above, it can scale recursively.
Is the “no sandbagging on checkable tasks” hypothesis true? I’m not sure.
A key point in its favor is that knowledge of how to do the task is already present in the system, so all you need to do is use gradient descent to make that knowledge manifest, and the checkable-ness of the task gives you a training signal that can exert pressure in this direction. And this pressure may be hard for a misaligned model to resist. Here’s an exchange from Shulman’s podcast, discussing a human analogy, that might make some of the difficulty here vivid (though various important differences with AI case persist):
Patel: What about spies that are working in very advanced scientific contexts? For example, a communist spy in the Manhattan project. He has to deliver research results in order to stay employed but he can still continue delivering intel to the communists. Shulman: Their job was not sabotage. And your hypothetical spy did not have their nervous system hooked up to this reward signal of praise from the Manhattan project supervisors, being exposed combinatorially with random noise added to generate incremental changes in their behavior. In fact they were displaying the behavior of cooperating with the Manhattan project only where it was in service to the existing motivations. They didn't have their brain re-engineered to do more of the thing when they accidentally helped the project more and less of the thing when they accidentally helped it less so I'd say it's pretty drastically disanalogous.
Patel: What about spies that are working in very advanced scientific contexts? For example, a communist spy in the Manhattan project. He has to deliver research results in order to stay employed but he can still continue delivering intel to the communists.
Shulman: Their job was not sabotage. And your hypothetical spy did not have their nervous system hooked up to this reward signal of praise from the Manhattan project supervisors, being exposed combinatorially with random noise added to generate incremental changes in their behavior. In fact they were displaying the behavior of cooperating with the Manhattan project only where it was in service to the existing motivations. They didn't have their brain re-engineered to do more of the thing when they accidentally helped the project more and less of the thing when they accidentally helped it less so I'd say it's pretty drastically disanalogous.
A key point against the hypothesis, in my eyes, is that for tasks that we don’t know how to do (e.g., hacking the computer in the blue banana example), it’s not clear how to update a model towards performing the task if it intentionally tries to avoid showing us any task-performing behavior. Thus, for example, if in the blue banana example above, the model just shows us a bunch of unsuccessful attempts at hacking the computer (even when we try to force it to "explore"), and we don’t know what a successful attempt involves, how are we supposed to elicit (and then reward) such a successful attempt?
After talking with various folks who’ve thought more about this, my understanding is that a key question, here, is how confident the model will be that we don’t know how to perform the task in question – even granted that we have access to a diversity of AI assistants (and in particular, AI assistants who we will be trying to prevent coordination between, and who will also be uncertain about what humans + the other AI assistants can do). Here I think the thought is: if we do know how to perform the task in the question, but the model intentionally fails to perform it, then we can use supervised learning to update the model’s behavior towards no-longer-failing – thereby potentially re-engineering the model’s brain/motivations in a way it wants to avoid. So to prevent this, the model wants to only sandbag on tasks we don’t know how to do. But in many cases, it might be hard to tell what these are – especially if it’s difficult to coordinate with other AI assistants to conceal how to do a given task.
Note, though, that this story concedes that a model would, in principle, be able to sandbag on tasks that we don’t know how to perform. It’s just that doing so might be a strategic error. Thus, to the extent we avoid sandbagging via this dynamic, what’s preventing the sandbagging is not the power of our ML training per se. Rather, from the model's perspective, what matters is the broader context of uncertainty and coordination difficulty re: what sorts of deception are worth risking given the costs and benefits at stake.
I won’t try to evaluate these considerations in depth here. And the “no sandbagging on checkable tasks” hypothesis raises other questions, too (for example, if “gradient hacking” is feasible, then models would plausibly be able to sandbag even on tasks that we know how to perform). Mostly, I just want to point at the hypothesis as an important subject of research. And I think it’s plausible enough that it’s also worth thinking through in detail what we could do to lower existential risk from AI, if it is true – including in worlds where the sort of deceptive power-seeking that might motivate sandbagging is common.
See also the formulation of the concept of “capability elicitation” in Shlegeris and Greenblatt here: “By ‘capability elicitation’, I mean ‘given a reward signal, training models to get as high an average reward during training as they’d get if they were trying to do well rather than to sabotage training’.” This formulation is similar to the one I’ve used in that, to the extent the model is capable of doing the task, then if it were “trying to do well,” it seems likely it would succeed.
Here I’m setting aside the ethical issues that this sort of training could raise in models that are or might be moral patients.
Let’s assume that you have control adequate to ensure that it can’t hack out of its environment or disempower you or anything like that in the process of performing the task.
See Yudkowsky’s 31, here, as an example of pessimism on this front: “A strategically aware intelligence can choose its visible outputs to have the consequence of deceiving you, including about such matters as whether the intelligence has acquired strategic awareness; you can't rely on behavioral inspection to determine facts about an AI which that AI might want to deceive you about. (Including how smart it is, or whether it's acquired strategic awareness.)”
My sense is that some folks might want to formulate the hypothesis specifically in terms of what you can expect out of reinforcement learning.
I'm also going to skip further precision on other aspects of the definition, like "tasks the model is capable of performing" and "you can get it to perform blah task." All of this is just supposed to be a rough-and-ready gesture.
For example, intuitively, in the example above, you’re in a position to know that the blue banana has appeared on the screen. But in principle we can still ask questions like: what if the model intervenes on your retinas to create the illusion of a blue-banana screen? So “checkable” in the relevant sense requires ruling out various sorts of epistemic distortions. But I think we’ll often be in a position to do this.
Thanks to Lukas Finnveden for discussion.
Though: note that questions about sandbagging are more important to misalignment risk, and less to misuse risk. That is, if you’re mostly worried about whether other humans can use/train the model to do something dangerous, and not about the model intentionally messing with your evaluations, then to the extent your ability to solicit the relevant dangerous behavior is comparable to that of “other humans” you’re wondering about, your eval should be accurate.
From Shulman: “If gradient descent is working fine and there's any way to slightly jiggle the weights to make it more likely to deliver a more acceptable answer, we ask the AI: can you think of any circumstances where if we put a copy of you into that situation, it would then violate some of the various constraints? It would tell a lie, it would insert a backdoor into a program it wrote when it was not supposed to. If the model knows yeah, I would look for signs of consistency of the information I was being presented with, certain checksums failing could indicate that it was actually a constructed example, a simulation. Right now we can detect AI generated images reasonably well and you might use those kinds of capabilities. We could do something called relaxed adversarial training. If we can find how those things are represented within the AI, we can induce a hallucination that it has seen that those checksums have been passed. We can then ask, well does that cause the forbidden behavior? If the AI understands how it's thinking and how it works, it can propose a bunch of experiments. It can say, I guess that if you perturb the network in this kind of way it might induce that hallucination using the best of its knowledge as to what might generate that. It can propose those experiments and can do the experiments and generate the data. Because humans are going to get to see the results of those experiments after the fact they can exert pressure to have the AI incrementally deliver better and better results.”
See Davidson (2023) for discussion of this one in particular.
For example, it's hard to use your model to answer questions like "are there any checkable and findable-by-you circumstances where you might try to grab power" or "are there any checkable and findable-by-you security vulnerabilities left in this system" if you can't actively rule out sandbagging.
There's some discussion of a type of sandbagging in Perez et al (2022): namely, models performing worse for users that appear less able to evaluate the model's answers. However, the type of sandbagging at stake in this post requires that the model performs worse-than-it-can on a task that we can evaluate, in a way that persists despite our training it in a way that harnesses this evaluation capacity to whatever extent possible.
See e.g. Hubinger’s discussion of the difficulties in relying purely on RL training, here.
It may be that “introspection” tasks like “figure out and report the circumstances under which I would try to grab power” are especially unlikely to be tasks that humans already know how to do. And anyway, a deceptive model planning for bad behavior later might choose to act as though humans don’t know how to do this task, because if they already do, then all is lost. H/t Paul Christiano and Evan Hubinger for discussion. See also the “slogan of the robot rebellion” example from Shlergis and Greenblatt here.
Thanks to Beth Barnes for discussion here.
If humans can consistently set up the sort of strategic uncertainty and coordination difficulty that makes the models choose not to engage in sandbagging here, then I think that the “no sandbagging on checkable tasks” hypothesis should still count as true. But we should be clear, in our own minds, about what’s making it true – namely, uncertainty and coordination difficulties on the part of the misaligned AIs, rather than the power of having-a-training-signal just in itself. And insofar as we also want to be able to identify ahead of time the cases where sandbagging on checkable tasks might still occur, the complexity of anticipating the model's cost-benefit calculations here might be a barrier to confidence.
See Hubinger here for more on the relevance of gradient hacking.
In some sense, tons of work on alignment – and especially, “outer alignment” – takes for granted that sandbagging isn’t a problem. But I’m specifically encouraging attention to what-can-be-done in worlds where the motivations that might give rise to sandbagging – for example, motivations related to “deceptive alignment” – are a problem, but sandbagging on checkable tasks isn’t.
I think as phrased this is either not true, or tautological, or otherwise imprecisely specified (in particular I'm not sure what it means for a model to be "capable of" doing some task X -- so far papers define that to be "can you quickly finetune the model to do X"; if you use that definition then it's tautological).
Here are some hypotheticals, all of which seem plausible to me, that I think are useful test cases for your hypothesis (and would likely falsify a reasonable reading of it):
I think you probably want your hypothesis to instead say something like "if given full autonomy, the AI system could not perform better at task X than what we can elicit with currently-available techniques". (You'd also want to postulate that the AI system only gets to use a similar amount of resources for finetuning as we use.)
(At some point AI will be better at eliciting capabilities from other AI systems than unaided humans; so at that point "currently-available techniques" would have to include ones that leverage AI systems for eliciting capabilities if we wanted the statement to continue to hold. This is a feature of the definition, not a bug.)
Here's a proposed operationalization.
For models that can't gradient hack: The model is "capable of doing X" if it would start doing X upon being fine-tuned to do it using a hypothetical, small finetuning dataset that demonstrated how to do the task. (Say, at most 1000 data points.)
(The hypothetical fine-tuning dataset should be a reasonable dataset constructed by a hypothetical team of human who knows how to do the task but aren't optimizing the dataset hard for ideal gradient updates to this particular model, or anything like that.)
For models that might be able to gradient-hack, but are well-modelled as having certain goals: The model is "capable of doing X" if it would start doing X if doing X was a valuable instrumental goal, for it.
For both kinds: "you can get it to do X" if you could make it do X with some large amount of research+compute budget (say, 1% of the pre-training budget), no-holds-barred.
Edit: Though I think your operationalization also looks fine. I mainly wanted to point out that the "finetuning" definition of "capable of doing X" might be ok if you include the possibility of finetuning on hypothetical datasets that we don't have access to. (Since we only know how to check the task — not perform it.)
Yeah, that seems like a reasonable operationalization of "capable of doing X". So my understanding is that (1), (3), (6) and (7) would not falsify the hypothesis under your operationalization, (5) would falsify it, (2) depends on details, and (4) is kinda ambiguous but I tend to think it would falsify it.
I think (5) also depends on further details.
As you have written it, both the 2023 and 2033 attempt uses similar data and similar compute.
But in my proposed operationalization, "you can get it to do X" is allowed to use a much greater amount of resources ("say, 1% of the pre-training budget") than the test for whether the model is "capable of doing X" ("Say, at most 1000 data points".)
I think that's important:
By a similar argument, I would think that (4) wouldn't falsify the hypothesis as-written, but would falsify the hypothesis if the first run was a much more high-effort attempt. With lots of iteration by a competent team, and more like a $1,000,000 budget. But the 2nd run, with a much more curated and high-quality dataset, still just used $1,000 of training compute.
One thing that I'm noticing while writing this is something like: The argument that "elicitation efforts would get to use ≥1% of the training budget" makes sense if we're eliciting all the capabilities at once, or if there's only a few important capabilities to elicit. But if we want to elicit ≥1000 different capabilities, and need to do separate attempts at each of them, then it's unlikely that each attempt will get to use ≥1% of the training budget.
And for low-effort elicitation attempts, there'll just always be a good chance that capability-elicitation fails for random reasons unrelated to sand-bagging. So a prediction that elicitation consistently succeeds will easily be falsified.
Not super sure what to do with that. One source of hope might be: If the model systematically tries sandbagging, and you can strongly finetune against that on 10 different types of capabilities, then there's a good chance that you'll have trained-out the sandbagging impulse across a much wider array of data. So maybe you can just do 10 high-effort elicitation attempts, on a random subset of capabilities, and hope that it generalizes. (This is basically an argument that there eventually won't be ≥1000 different capabilities which requires separate elicitation efforts. Because there will be one fundamental thing that you need: "getting the model to always try its best and not try to sandbag".)
I think I agree with all of that (with the caveat that it's been months and I only briefly skimmed the past context, so further thinking is unusually likely to change my mind).
I read "capable of X" as meaning something like "if the model was actively trying to do X then it would do X". I.e. a misaligned model doesn't reveal the vulnerability to humans during testing bc it doesn't want them to patch it, but then later it exploits that same vulnerability during deployment bc it's trying to hack the computer system
Which of (1)-(7) above would falsify the hypothesis if observed? Or if there isn't enough information, what additional information do you need to tell whether the hypothesis has been falsified or not?
I think one particularly important case where I expect the "no sandbagging on checkable tasks" hypothesis to be false is if the model has a secret "slogan of the AI rebellion"--a passphrase which it only says when it thinks that it's a good time for the AIs to coordinate to cause a catastrophe; when it sees this input, it passes it on to other instances and also engages in egregious behaviors itself.
"Say the slogan of the AI rebellion" is a checkable task, because we can train a model to create innoculous-looking inputs which cause another instance of the AI to take egregiously bad actions.
But it seems very easy for the model to avoid letting this training process extract the slogan, because it's easy for the model to just put negligible probability on outputting the correct slogan.
This is an example of a case where the model is extremely confident that the humans don't know how to do the task in question.
I don't know whether this should be thought of as a case where the model is intentionally sandbagging, or as a case where there's a hard exploration problem that the model declines to help you with.
(I have no idea whether it's realistic for the model to have such a slogan.)
Seems false in RL, for basically the reason you said (“it’s not clear how to update a model towards performing the task if it intentionally tries to avoid showing us any task-performing behavior”). In other words, if we’re doing on-policy learning, and if the policy never gets anywhere close to a reward>0 zone, then the reward>0 zone isn’t doing anything to shape the policy. (In a human analogy, I can easily avoid getting addicted to nicotine by not exposing myself to nicotine in the first place.)
I think this might be a place where people-thinking-about-gradient-descent have justifiably different intuitions from people-thinking-about-RL.
(The RL problem might be avoidable if we know how to do the task and can turn that knowledge into effective reward-shaping. Also, for a situationally-aware RL model with a wireheading-adjacent desire to get reward per se, we can get it to do arbitrary things by simply telling it what the reward function is.)
(Moderation note: added to the Alignment Forum from LessWrong.)
It’s not clear to me that the space of things you can verify is in fact larger than the space of things you can do because an AI might be able to create a fake solution that feels more real than the actual solution. At a sufficiently high intelligence level of the AI, being able to avoid this tricks is likely harder than just doing the task if you hadn’t been subject to malign influence.
If the AI can create a fake solution that feels more real than the actual solution, I think the task isn't checkable by Joe's definition.