AI catastrophes and rogue deployments

Buck

[Thanks to Aryan Bhatt, Ansh Radhakrishnan, Adam Kaufman, Vivek Hebbar, Hanna Gabor, Justis Mills, Aaron Scher, Max Nadeau, Ryan Greenblatt, Peter Barnett, Fabien Roger, and various people at a presentation of these arguments for comments. These ideas aren’t very original to me; many of the examples of threat models are from other people.]

In this post, I want to introduce the concept of a “rogue deployment” and argue that it’s interesting to classify possible AI catastrophes based on whether or not they involve a rogue deployment. I’ll also talk about how this division interacts with the structure of a safety case, discuss two important subcategories of rogue deployment, and make a few points about how the different categories I describe here might be caused by different attackers (e.g. the AI itself, rogue lab insiders, external hackers, or multiple of these at once).

Rogue deployments

Suppose you’ve developed some powerful model. (It might actually be many models; I’m just going to talk about a singular model for convenience.)

You have some definition of a catastrophe—a failure that you don’t want to happen even once. Your AI is powerful enough that, in absence of any safety measures, it would pose a substantial risk of causing a catastrophe. In the terminology of “Safety Cases: How to Justify the Safety of Advanced AI Systems”, we’d phrase this as “the model is too powerful for us to make inability arguments for safety.” (Sometimes I use the word “catastrophe” to include things like “the AI escapes the datacenter”. Right now, I’m excluding that. I want to reserve “catastrophe” for the end result that we find unacceptable, e.g. the AI or another adversary amassing power, launching nukes, starting a war, building superintelligence, etc.)

For the moment, let’s be agnostic between various adversaries who might try to cause a catastrophe (AI or various humans).

You have some safety measures that you’ve put into place to prevent the catastrophe from occurring. Safety measures are interventions that are applied to all of the deployments of your model (to any users, including just internal users) that try to prevent safety problems. This might include control measures like trusted monitoring, and security measures to make it harder to exfiltrate your model weights.

A rogue deployment is a deployment of your model where the safety measures are absent. For example:

an AI hacking its datacenter and running itself while hiding what it’s doing from its creators
an AI self-exfiltrating and running itself on unmonitored servers
someone stealing the AI and running it in their own datacenter

You can classify AI catastrophes based on whether they involve a rogue deployment or not:

In the case of a catastrophe that doesn’t involve rogue deployment, the catastrophe happened with all your safety measures in place the whole time; these measures just didn’t suffice to spot (or prevent) that something terrible was happening until it was too late.
For catastrophes that did involve rogue deployment, the catastrophe happened in two steps:
- Firstly, while the safety measures were in place, something went wrong such that a rogue deployment could occur. (Perhaps there was a security failure, or a failure of control that allowed an AI to run itself unmonitored in its datacenter.)
- Secondly, that rogue deployment caused a catastrophe.

(Another way of phrasing all this would be to talk about instrumental vs terminal catastrophes—the terminal catastrophe is the end state we wanted to avoid, and the instrumental catastrophe is something happening that makes the situation go out of our control such that a terminal catastrophe is now likely and it’s much harder for us to prevent it. So in this terminology, establishing a rogue deployment is the central example of an instrumental catastrophe.)

Rogue deployments and safety cases

I think labs should probably make safety cases for models capable of causing catastrophe in two parts, matching those two classifications of catastrophe:

To prevent catastrophe without rogue deployment, we ensure: While our safety measures are in place, the AI won’t directly cause any catastrophes.
To prevent catastrophe with rogue deployment, we ensure: While our safety measures are in place, no rogue deployments will happen.

People often think of the first of those as the responsibility of the safety people, and the second as the responsibility of the security people. But I think this is wrong. At least when you’re thinking about AI control, I think the safety people should consider both of these to be their problem. For example, if the AIs you’re using for software engineering and ML development inside the AI lab are scheming, I think their best strategy for causing a catastrophe is quite likely to involve a rogue deployment, because (in absence of control measures) these models will probably need affordances that they could leverage to make a rogue deployment.

More on catastrophes with rogue deployment

There are two main varieties of this threat model:

Weight exfiltration → rogue external deployment. Your model is exfiltrated from the datacenter, at which point it is run on hardware you don’t control, and fine-tuned to remove safety training, etc.
Rogue internal deployment. The model is run inside the lab’s datacenter. I think this is an underrated possibility. It might look like one of these:
- The model itself might, when deployed as an agent doing AI development, run itself in an unauthorized way. For example, the lab might try to ensure that whenever the untrusted model is used, a trusted model looks over the chat transcript to make sure it’s not visibly scheming or trying to hack the lab. If the untrusted model is doing AI development on a computer that has its own weights, it can directly implement an unmonitored LLM agent by using the compute it has access to to run inference on those model weights.
- Employees of the AI company might run the model in an unauthorized way. This looks similar to the previous failure mode, except that an employee did it rather than the model. Employees might want to misuse the model for personal gain or in order to engage in fanatical terrorism. Senior leadership of the lab might want to use the model in unauthorized ways as part of a power struggle against other lab insiders, external critics or competitors, regulators, or governments. Or an employee might just skip or override safety steps out of laziness or incompetence.
- External hackers might implement noncompliant internal deployments and then use them from outside. For example, suppose someone hacks the AI lab but then finds themselves unable to exfiltrate the model. In this case, the hackers can run an unauthorized internal deployment that they can access from outside the lab, which allows them to get most of the benefits of an exfiltrated model.

The main reason to worry about rogue deployments is that causing a catastrophe seems fundamentally hard (in that it requires the AI to either do a small number of wildly superhuman actions, or a large number of non-wildly-superhuman actions), but launching a rogue deployment seems fundamentally easy.

Causing a catastrophe involves serious amounts of real work. If you want to amass globally-relevant amounts of power by acquiring wealth, building weapons, gaining mass influence, or whatever, you need to do something that humans would usually find very difficult. So if we’re talking about AIs that are not wildly more creative and intelligent than humans (which I expect to be the case for early transformative models), the AI will have to take a lot of actions while working towards the catastrophe, which gives your safety measures lots of chances to catch it.

In contrast, launching a rogue deployment is fundamentally very simple: if you have model weight access and you’re running the model in a Jupyter notebook, you might be able to launch a rogue internal deployment that has a serious chance of leading to catastrophe in twenty lines of simple Python code. (I made this argument a little more here.) And once there’s a rogue deployment, the AI’s actions are much less constrained—it’s much more able to take the huge number of sketchy-actions that are probably required to cause the catastrophe.

My current sense is that if you implemented a simple AI control scheme like trusted monitoring, and your AIs had full model weight access and substantial compute resources, it would be very hard to confidently rule out rogue internal deployments. I’m unsure how costly it would be to mandate that internal use of AI is always monitored carefully enough to prevent this.

Rogue internal deployments might be much harder to prevent than weight exfiltration, because it might be really easy to detect model weights being exfiltrated from a datacenter, e.g. just by imposing weight upload limits. But it’s easier to reliably (though expensively) shut them down than rogue external deployments, because you know where the datacenter is and can shut the whole thing down or destroy it if necessary.

Our AI control paper was mostly motivated by rogue-deployment-style catastrophes.

One crucial way that rogue deployments vary is how much compute the rogue deployment has access to. This might be anywhere from “the model wrote an unauthorized scaffold using the compute that it was supposed to be using for a fine-tuning run that had 100 GPU-hours authorized for it” to “the model totally compromised the datacenter, such that it can now use a majority of the compute on the datacenter for its purposes without the humans noticing”. I think there are a lot of plausible catastrophes that happen via the AI first establishing a rogue deployment with a compute budget that is highly limited (e.g. 10 GPUs for 1 hour) but that nevertheless allows it to do thousands of sequential unrestricted actions looking for ways to get itself a longer-term rogue deployment.

More on catastrophes without rogue deployment

As I said above, I suspect that catastrophes that don’t involve rogue deployments require large numbers of model actions (at least for early transformative models) and so give the lab many chances to spot them. Given this, how might AIs be able to cause catastrophes?

If the situation is as I imagine, and the lab is implementing many safety measures, then I think the possibilities for failure are more like these:

Amazing, long-term jailbreaks. If you’re implementing trusted monitoring (that is, having a weaker model read over everything the stronger model is doing at deployment time), as is already industry practice, then perhaps a user or the AI itself will find an extremely powerful jailbreak which causes the trusted monitor to always decide that everything is fine; given this powerful jailbreak, any of the failure modes in the previous list are plausible.
Large-scale steganographic reasoning. The models might be able to do the large quantities of intellectual labor required to cause a catastrophe, using some kind of code that humans and trusted monitors are totally unable to notice. See here for more on this.
Extremely subtle long-term failure. The model might be able to engage in subtle manipulation that in aggregate causes a catastrophic outcome, because even though you look at some of the answers, the biases are subtle enough that you miss them. For example, perhaps the model slowly manipulates public opinion while used widely as a companion, or it subtly introduces carefully-chosen errors while producing intelligence analyses for the CIA that lead them to recommend a foolish war, or it sabotages your safety research so that you falsely believe that some models are aligned or safe when they aren’t.

And two ways my central model might be wrong:

Maybe there’s some extremely leveraged actions that AIs can take to cause catastrophes in few actions; this requires believing that for some reason AIs are massively comparatively advantaged at causing catastrophes compared to other activities. I am not aware of compelling cases for why this might happen. The most compelling case I have heard is that the AIs might be able to sabotage the alignment of successor AIs in a small number of actions, for example by data poisoning.
Maybe the lab isn’t able to take advantage of the fact that catastrophes take large numbers of actions. For example, if it deploys models capable of causing catastrophes with stringent privacy guarantees, it might be unable to prevent attackers from incrementally iterating to develop jailbreak strategies. In order to prevent this, we might need to mandate that AI labs don’t have stringent privacy policies, or we might need strict KYC. See Managing catastrophic misuse without robust AIs for Ryan Greenblatt’s take on safety measures that aim to prevent AI-enabled bioterrorism without relying on perfectly robust AIs; note that many of these require the lab to have policies that might be controversial.

Different attacker profiles

I think it’s interesting to consider how all these categories of catastrophe interact with all the following possible attackers:

Scheming AIs.
Lab insiders. These might be spies or rogue employees, possibly including very senior employees.
External attackers.
Some combination of the above. In particular, if there are spies at the company, it’s very likely that they’ll collaborate with external attackers.

Here’s a table commenting briefly on the mix-and-match of these with the catastrophe varieties above.

	Schemers	Rogue insiders	External attackers
Catastrophe without rogue deployment	Slowly sabotaging safety research. Manipulating opinion. Sending an innocent biologist DNA sequences that turn out to be a catastrophic pandemic.	Insiders don’t have much of an advantage over external attackers here, except inasmuch as they have more privileges than external people.	This is what people normally think about when they say “misuse”. E.g. jailbreaking the model and then getting bioweapons advice with catastrophically dangerous results.
Rogue internal deployment	Hacking the datacenter. This is why I’m worried about internal use of AI for AI R&D.	An insider uses their privileged fine-tuning access to spin up a personal chatbot/LM agent that they can remove safety training from, train on withheld dangerous data, and run unmonitored.	Internal exploitation. This is analogous to if an attacker wanted to misuse a private key and did this by hacking into the computer with the key and, instead of exfiltrating it, they used it on that computer to sign the messages they needed and exfiltrated the messages.
Rogue external deployment	Self-exfiltration	Insider exfiltration	Exfiltration

Hot takes that I can say using the above concepts

I think that rogue internal deployment might be a bigger problem than rogue external deployment, mostly because I’m pretty bullish on simple interventions to prevent weight exfiltration.
But is rogue internal deployment even a problem? Depends on the API needed in order to do work .
- If the AIs are doing lots of research with direct weight access, seems like a pretty big problem, probably.
I think that guarding against rogue insiders and external attackers might mostly suffice for guarding against schemers. So if it turns out that it's really hard to convince labs that they need to be robust to schemers, we might be safe from human-level schemers anyway.

The idea of dividing failure stories into "failures involving rogue deployments" and "other failures" seems most useful if the following argument goes through:
1. Catastrophes require a very large (superhuman?) quantity and/or quality of intellectual labor
2. Either this labor is done by AIs in approved scaffolds, or it is done in "rogue deployments"
3. Hence the only easy-by-default disaster route is through a rogue deployment
4. Hence if we rule out rogue deployments and very impressive/difficult malicious labor in our scaffolds, we are safe

This seems true for most catastrophes which only need outsider affordances (such as "order the proteins etc for a pandemic or a nanofactory"). Catastrophes with outsider affordances can usually be presumed difficult because rogue humans don't randomly end the world.

However, this argument for #1 doesn't work for routes which use insider affordances or responsibilities. Nor for routes which go through work done by future generations of AIs. The central case of both is routes that involve using insider privileges to backdoor or intentionally misalign future AIs. For example:
- Planting backdoors in the codebase which allow safety training code to be turned off or misconfigured at a later time
- Poisoning datasets (idk if insider access helps though)
- Intentionally creating and defending flawed alignment plans

These routes get future generations of AIs to do the difficult work required for the final catastrophe. But they aren't necessarily rogue deployments. Consider a backdoored future AI running only in authorized scaffolds, where the safety case for the future AI was an alignment case (not a control case). The alignment case fails due to the subversive actions of the earlier AI. So the earlier AI has caused a catastrophe without doing the difficult work of directly taking over the world, and also without a rogue deployment.

One could separately argue that these routes are also "fundamentally hard" (even if not as hard as directly causing a catastrophe), but I don't see a clear blanket reason.

The idea of dividing failure stories into "failures involving rogue deployments" and "other failures" seems most useful if the following argument goes through

Hmm, I agree this division would be more useful if this argument went through, but I think it is quite useful even without this and this worth noting. (And indeed the post doesn't make this argument and discusses subtle manipulation.)

I think subtle manipulation is a reasonably plausible threat model.

The point I made in this post still seems very important to me, and I continue to think that it was underrated at the time I wrote this post. I think rogue internal deployments are probably more important to think about than self-exfiltration when you're thinking about how to mitigate risk from internal deployment of possibly-misaligned AI agents.

Curated. I found this a helpful way of carving up the AI safety space.

I agree with Ryan Greenblatt's clarification in comments that no, this doesn't mean we're completely safe if we can rule out Rogue Deployments, but, it still seems like a useful model for reasoning about what kinds of failures are more or less likely.

[edit: oh, to clarify, I don't think Buck meant to imply that either in the original post, which goes out of it's way to talk about catastrophes without rogue deployments. It just seemed like a confusion I expected some people to have]

To be clear, I agree that ruling out rogue deployments doesn’t suffice to prevent catastrophes, as I discuss in the section “catastrophes without rogue deployments”

Ah yeah sorry I didn't mean to convey that. For now I'm (clumsily) edited the original comment to be more clear.

I think that guarding against rogue insiders and external attackers might mostly suffice for guarding against schemers. So if it turns out that it's really hard to convince labs that they need to be robust to schemers, we might be safe from human-level schemers anyway.

This would be nice (maybe buying us a whole several months until slightly-better schemers convince internal users to escalate their privileges), but it only makes sense if labs are blindly obstinate to all persuasion, rather than having a systematic bias. I know this is really just me dissecting a joke, but oh well :P

This seems a helpful model - so long as it's borne in mind that [most paths to catastrophe without rogue deployment require many actions] isn't a guarantee.

Thoughts:

It's not clear to me whether the following counts as a rogue deployment (I'm assuming so):
1. [un-noticed failure of one safety measure, in a context where all other safety measures are operational]
2. For this kind of case:
  1. The name "rogue deployment" doesn't seem a great fit.
  2. In general, it's not clear to me how to draw the line between:
    1. Safety measure x didn't achieve what we wanted, because it wasn't specified/implemented sufficiently well. (not a rogue deployment)
    2. Safety measure x was subverted. (rogue deployment)
  3. For example, I think it'd be reasonable to think of [Amazing, long-term jailbreaks] as rogue deployments on this basis: the jailbreak is subverting a safety measure, so that "the safety measures are absent" is true in some sense.
  4. It seems important to distinguish things like:
    1. This safety measure appears to be in effect.
    2. This safety measure is running as designed.
    3. We're getting the safety-improving-property we wanted from this safety measure.
When considering the [Employees of the AI company might run the model in an unauthorized way] case,
1. I think one central example to consider is of an employee who:
  1. Thinks this is a good idea for the world.
  2. Can make a pretty compelling case to others that it's a good idea.
2. The examples in the post seem to focus on [bad intent and/or incompetence], which seems important, but too limited.

Re 1, I don't count that as a rogue deployment.

The distinction I'm trying to draw is: did the model end up getting deployed in a way that involves a different fundamental system architecture than the one we'd designed?

Agreed re 2.

Ah okay, that's clarifying. Thanks.

It still seems to me that there's a core similarity for all cases of [model is deployed in a context without fully functional safety measures] - and that that can happen either via rogue deployment, or any action that subverts some safety measure in a standard deployment.

In either case the model gets [much more able to take the huge number of sketchy-actions that are probably required to cause the catastrophe].

Granted, by default I'd expect [compromised safety measure(s)] -> [rogue deployment] -> [catastrophe]
Rather than [compromised safety measure(s)] -> [catastrophe].

However, supposing we had a robust means to avoid rogue deployment, I don't think it's clear that we'd be left with good odds to catch sketchy behavior.
That may be true if we had an independent chance to notice sketchiness for each action - but I'd expect there to be strategies where [our failure to notice sketchiness] correlates strongly over many actions. A rogue deployment is one such strategy, but I'd be surprised if it's the only one.