Buck Shlegeris

CEO at Redwood Research.

AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Wiki Contributions

Comments

Sorted by

I think this paper was great. I'm very proud of it. It's a bit hard to separate out this paper from the follow-up post arguing for control, but I'll try to.

This paper led to a bunch of research by us and other people; it helped AI control become (IMO correctly) one of the main strategies discussed for ensuring safety from scheming AIs. It was accepted as an oral at ICML 2024. AI companies and other researchers have since built on this work (Anthropic’s “Sabotage Evaluations”, Mathew et al “Hidden in Plain Text”; I collaborated on Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats; Redwood has also been working on a big follow up paper that should be out soon), and AI control has been framed by (GrosseBuhlBalesniClymer) as a bedrock approach for mitigating catastrophic risk from misaligned AI. 

My main regret about this paper is that we didn't use defer-to-resample, a technique where you replace suspicious actions with a resample from the untrusted model (as discussed e.g. here). This probably would have been better than the other defer techniques we tried.

I have more regrets about the follow-up post ("The case for ensuring...") than about this post; this post was more straightforward and less ambitious, and so gave us fewer opportunities to stick our necks out making arguments or introducing concepts that we'd later regret.

I'm very excited for more follow-up work on this paper, and I'm working on mentoring such projects and sourcing funding for them.

Do we think there is a non-negligible chance that an AI that doesn't have hidden memory, only English-language CoT, will be able to evade our monitoring and execute a rouge deployment?

Yes, I do think this. I think the situation looks worse if the AI has hidden memory, but I don't think we're either fine if the model doesn't have it or doomed if it does.

Project proposal: No-holds-barred jailbreak attacks and defenses for control

Normally we’ve done control research under the assumption that the red team isn’t able to use techniques like GCG to make targeted jailbreaks for the trusted monitor. Can the blue team defend if we relax that assumption (e.g. because we’re worried the model might be able to start a rogue deployment in which it runs GCG)?

The blue team can try a variety of strategies here.

  • The simplest strategy is randomizing the prompt to the trusted monitor. This forces the red team to find a jailbreak that works over multiple prompts.
  • There are fancier approaches here, e.g. giving the trusted model a randomized prompt that says something like “if the code is safe, reply with the number 1385, otherwise reply with the number 9247”, which breaks jailbreaks which just convince the trusted model to say a particular fixed string.
  • Blue team can modify the current input to contain a known-bad substring, and check that it gets flagged.

Project proposal: Control for auditing

(I'm sharing this here because I needed to link to it, and I thought I might as well make it public.)

When preventing high stakes failures, a crucial bottleneck is the proportion of actions that you can audit: that is, investigate the action further for the purpose of finding evidence that the AI is scheming. (See here for more on this. In particular, note that this might be different from the problem of investigating an action for the purpose of learning whether that action is dangerous.) If we need to rely on human labor to audit, we’ll only be able to audit a tiny fraction of actions that our AI agents are doing. It would be extremely helpful if we were able to use AI assistance for auditing, from both trusted and untrusted models. But obviously auditing with untrusted models leads to massive risk of collusion. So this setting seems potentially pretty hard and potentially crucial; I’m excited for a project investigating it.


This is pretty closely related to untrusted monitoring; you could also call this project “deep dive into untrusted monitoring”.

The simplest such setting I can think of

We’re going to take the APPS backdooring setting from AI Control: Improving Safety Despite Intentional Subversion, fix the main control technique to be trusted monitoring with 1% auditing budget and defer-to-trusted for the 10% most suspicious solutions, and focus on the problem of implementing an auditor. That is:

  • Get a dataset of backdoor attempts.
    • For statistical power, this dataset will need to be large (at least 100; it would be better for it to be more like 1000). This might require developing a better/cheaper backdoor generation pipeline.
    • It might be good to explore backdooring in a harder setting than the interview subset of APPS. Maybe the advanced subset of APPS would be better.
    • The red team will be allowed to modify these backdoor attempts to add collusion signals. (These modifications will probably be cheap and easy compared to generating the backdoor attempts.)
  • Get a dataset of benign code, filter with a trusted monitor for the most suspicious 1%. (This should be pretty cheap and easy)

We want to implement a first-pass auditor (better name TBD). That is, we want the actual audit process to have two stages:

  • Firstly, some AI scaffold (the first-pass auditor) carefully investigates the sketchiest 1% of model-proposed code.
  • That scaffold needs to choose 1% of the code it reviews to pass along to a human reviewer.

The first-pass auditor is allowed to be somewhat expensive compared to generating the original code, because it’s only getting called 1% of the time. For our purposes, maybe let’s say it can cost up to 20x as much.

And we’re doing a control eval: the untrusted model will, as usual, be replaced by an attack policy.

Note that in this control eval there’s no notion of usefulness: our only metric is the probability that the red team is incriminated on a given backdoor attempt. (Actually this might be more confusing than I’ve written here, because we might want to do honeypotting of the untrusted model in the first-pass auditing setup, which leads to the game varying with scale for the usual reasons (see untrusted monitoring in our original control paper).)

Techniques

  • Get the untrusted model to reimplement the function, compare the implementations.
  • Ask the untrusted model to explain why various parts of the function are the way they are; make collusion hard.
  • Show the untrusted model only a small amount of the context (e.g. the action but not the rest of the trajectory), have it ask a trusted model for information that it needs to know in order to assess whether the action was dangerous; hope that it can’t learn whether to collude given this restricted information channel.

There are many potential options here.

My guess is that this project would actually be much better with more complicated programming problems than APPS.

I also asked it whether it consented to being a coauthor, and it said yes.

It seems like that assumption runs throughout the post though, eg 'But it won’t know what it actually tried, unless it is able to guess very accurately what it would have said', 'the AI might not even be able to get direct evidence of whether its attempts worked, or what its attempts even were.'

Those sentences are talking in the context of defer to resample, where I think it's actually reasonable to think the model won't see the resampled actions.

I don't just mean showing the model its past CoT. Current models are often given lots of information about their past actions and interactions so that they can produce better output; why would you expect that to change?

I'm specifically talking about not showing it actions that were resampled. You're totally right that they'll need other information about their actions and interactions (though it might be a good safety-usefulness tradeoff to somewhat redact this information).

Is the assumption that labs will be willing to accept worse performance because they recognize the model can't be trusted?

Yes, I am assuming that the deployer is willing to spend some performance in order to reduce risk from the model scheming.

Hmmm, good call, it will probably see the history of its chain of thought. My dear coauthor Claude wrote that and I didn't second guess it, I agree it's wrong.

Yep! You might have to show it CoT from previous, non-resampled steps though

It would be crazy to hide that CoT from the monitor.

Why does it know its attempt was replaced? Because it can guess it wouldn't have produced R?

Yep.

If it can tell that, how come it can't guess it specifically would have produced A?

I'm just hypothesizing that the first guessing problem is easier than the second; I intuitively expect that it's a lot easier to notice that something isn't your style than to guess what exactly you would have done (among other things, there are many more bits in "what action did I take" than "did I take this action").

Load More