CEO at Redwood Research.
AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.
Do we think there is a non-negligible chance that an AI that doesn't have hidden memory, only English-language CoT, will be able to evade our monitoring and execute a rouge deployment?
Yes, I do think this. I think the situation looks worse if the AI has hidden memory, but I don't think we're either fine if the model doesn't have it or doomed if it does.
Project proposal: No-holds-barred jailbreak attacks and defenses for control
Normally we’ve done control research under the assumption that the red team isn’t able to use techniques like GCG to make targeted jailbreaks for the trusted monitor. Can the blue team defend if we relax that assumption (e.g. because we’re worried the model might be able to start a rogue deployment in which it runs GCG)?
The blue team can try a variety of strategies here.
(I'm sharing this here because I needed to link to it, and I thought I might as well make it public.)
When preventing high stakes failures, a crucial bottleneck is the proportion of actions that you can audit: that is, investigate the action further for the purpose of finding evidence that the AI is scheming. (See here for more on this. In particular, note that this might be different from the problem of investigating an action for the purpose of learning whether that action is dangerous.) If we need to rely on human labor to audit, we’ll only be able to audit a tiny fraction of actions that our AI agents are doing. It would be extremely helpful if we were able to use AI assistance for auditing, from both trusted and untrusted models. But obviously auditing with untrusted models leads to massive risk of collusion. So this setting seems potentially pretty hard and potentially crucial; I’m excited for a project investigating it.
This is pretty closely related to untrusted monitoring; you could also call this project “deep dive into untrusted monitoring”.
We’re going to take the APPS backdooring setting from AI Control: Improving Safety Despite Intentional Subversion, fix the main control technique to be trusted monitoring with 1% auditing budget and defer-to-trusted for the 10% most suspicious solutions, and focus on the problem of implementing an auditor. That is:
We want to implement a first-pass auditor (better name TBD). That is, we want the actual audit process to have two stages:
The first-pass auditor is allowed to be somewhat expensive compared to generating the original code, because it’s only getting called 1% of the time. For our purposes, maybe let’s say it can cost up to 20x as much.
And we’re doing a control eval: the untrusted model will, as usual, be replaced by an attack policy.
Note that in this control eval there’s no notion of usefulness: our only metric is the probability that the red team is incriminated on a given backdoor attempt. (Actually this might be more confusing than I’ve written here, because we might want to do honeypotting of the untrusted model in the first-pass auditing setup, which leads to the game varying with scale for the usual reasons (see untrusted monitoring in our original control paper).)
There are many potential options here.
My guess is that this project would actually be much better with more complicated programming problems than APPS.
I also asked it whether it consented to being a coauthor, and it said yes.
It seems like that assumption runs throughout the post though, eg 'But it won’t know what it actually tried, unless it is able to guess very accurately what it would have said', 'the AI might not even be able to get direct evidence of whether its attempts worked, or what its attempts even were.'
Those sentences are talking in the context of defer to resample, where I think it's actually reasonable to think the model won't see the resampled actions.
I don't just mean showing the model its past CoT. Current models are often given lots of information about their past actions and interactions so that they can produce better output; why would you expect that to change?
I'm specifically talking about not showing it actions that were resampled. You're totally right that they'll need other information about their actions and interactions (though it might be a good safety-usefulness tradeoff to somewhat redact this information).
Is the assumption that labs will be willing to accept worse performance because they recognize the model can't be trusted?
Yes, I am assuming that the deployer is willing to spend some performance in order to reduce risk from the model scheming.
Hmmm, good call, it will probably see the history of its chain of thought. My dear coauthor Claude wrote that and I didn't second guess it, I agree it's wrong.
Yep! You might have to show it CoT from previous, non-resampled steps though
It would be crazy to hide that CoT from the monitor.
Why does it know its attempt was replaced? Because it can guess it wouldn't have produced R?
Yep.
If it can tell that, how come it can't guess it specifically would have produced A?
I'm just hypothesizing that the first guessing problem is easier than the second; I intuitively expect that it's a lot easier to notice that something isn't your style than to guess what exactly you would have done (among other things, there are many more bits in "what action did I take" than "did I take this action").
I think this paper was great. I'm very proud of it. It's a bit hard to separate out this paper from the follow-up post arguing for control, but I'll try to.
This paper led to a bunch of research by us and other people; it helped AI control become (IMO correctly) one of the main strategies discussed for ensuring safety from scheming AIs. It was accepted as an oral at ICML 2024. AI companies and other researchers have since built on this work (Anthropic’s “Sabotage Evaluations”, Mathew et al “Hidden in Plain Text”; I collaborated on Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats; Redwood has also been working on a big follow up paper that should be out soon), and AI control has been framed by (Grosse, Buhl, Balesni, Clymer) as a bedrock approach for mitigating catastrophic risk from misaligned AI.
My main regret about this paper is that we didn't use defer-to-resample, a technique where you replace suspicious actions with a resample from the untrusted model (as discussed e.g. here). This probably would have been better than the other defer techniques we tried.
I have more regrets about the follow-up post ("The case for ensuring...") than about this post; this post was more straightforward and less ambitious, and so gave us fewer opportunities to stick our necks out making arguments or introducing concepts that we'd later regret.
I'm very excited for more follow-up work on this paper, and I'm working on mentoring such projects and sourcing funding for them.