I'm worried about AI models intentionally doing bad things, like sandbagging when doing safety research. In the regime where the AI has to do many of these bad actions in order to cause an unacceptable outcome, we have some hope of identifying examples of the AI doing the bad action (or at least having some signal at distinguishing bad actions from good ones). Given such a signal we could:

Directly train the policy to not perform bad actions.
Train activation monitors to detect bad actions. These monitors could be used in a variety of ways such as:
1. Rejecting actions flagged as bad. Either resample a response from the same policy, or using some different process.
2. Look into bad actions further, to try to incriminate the AI. Incrimination based strategies are out of scope for this post: I'll be solely interested in the question of: What should we expect the sample efficiency of training policies and monitors to be at removing bad behavior? This theoretical analysis isn't supposed to replace empirical research on the matter, although I hope that it can generate some guiding intuition and good hypotheses to test.

Learning theory basic background

Let's start with the following problem setup:

We have a function from some function family of VC dimension .
- The VC dimension of a function family measures how "expressive/complex" the family is. Formally, it's the largest number such that there's some set of points so that for any -coloring of , there exists a function such that . As an example, the family of "union of intervals" has VC dimension .
We receive labelled samples .
We'd like to output a function to minimize ; this quantity that we're trying to minimize is called our error. If it helps, you could think of as an AI model that must make a binary decision (e.g., maybe is a monitor flagging actions as bad or not).

The Fundamental Theorem of Statistical Learning states that it's possible to achieve error at most .

There's a related version of the problem, where the labels are noisy. In the noisy case, there is some amount of irreducible error that any predictor will incur. However, the extra expected error beyond the irreducible error is at most in the noisy case.

SGD doesn't do optimal Bayesian learning, but these learning theory results seem like a reasonable first guess for sample efficiency of SGD.

Test case: Half space

As an example, let's apply this theorem to the problem of learning a half-space . The VC dimension of a half space in is . Thus, we predict that our error should scale as .

However, the error appears to scale more like . My current best guess is that this is actually the right answer for how well SGD does.

We should be much more sample efficient at learning points which aren't just separated by a half-space, but by a half-space plus margin. In particular, error should go to zero in samples (this constant depends on the margin, but not the dimension).

More on gradient descent

Based on the halfspace experiment and the VC theory theorems, I feel pretty good about using as a default model of how fast SGD will learn things.

There's some good theoretical justification for the from convex optimization theory too. (While we're generally not interested in convex functions, this hopefully isn't a big issue see discussion here by Paul). Specifically, the result of interest is:

Suppose is some sequence of convex functions, unknown in advance.
You must choose model parameters at time , and then is revealed and you incur loss .
If you choose via online SGD then your total regret (the sum of your losses, minus the sum of the loss you'd have incurred by fixing the optimal in advance) is at most -- assuming a bounded domain and -Lipchitz functions.

If we consider the case with a fixed function , then error falling off like matches regret falling as because . The OCO result is extra nice though because it applies to situations with distribution shift, unlike the previous learning theoretic results which were about learning static functions.

It's worth noting that in cases where the AI's performance is stochastic, it's not really meaningful to talk about (untargeted) error smaller than because there is typically deviation in the AI's performance between different trials.

Removing intentional bad behavior might be easier than this

I think that training against intentional bad behavior might go substantially better than the " bad actions" bound from the learning problems. I'll now list five reasons for my optimism.

1: It's an elicitation problem, not a learning problem

I'm optimistic because training against intentional bad behavior is an elicitation problem, not a learning problem (although some bad behavior, like exploration hacking, may cause the model to learn less than it otherwise would).

The model already knows how to not sabotage us. The problem (probably) isn't teaching the AI what we want, the problem is that the AI wants different things from us. Thus, we might hope that these cognitions which are leading to low reward (I'll assume a perfect reward signal for now) will be rapidly removed by SGD, because there's a clear nearby policy which doesn't sabotage us as much. Namely, a policy that's more conservative about attacking us, or one that doesn't attack at all. It seems a priori likely that "sabotage intent" will be cleanly represented in the model's activations, which should help training latch onto this and remove (or detect) it.

Furthermore, empirical work studying elicitation found extremely high sample efficiency.

2: We can stack activation monitors and training the policy

If we train both the policy and a monitor (and then we rejection sample against the monitor), and their "errors" behave independently (a dubious assumption, but we could hope that the errors are at least somewhat decorrelated, and maybe grant this for now) and are both small, then we'd essentially get to multiply their errors. If you could obtain multiple decorrelated monitors, then you could ensemble these to get an extremely powerful monitor. Ensembling again morally multiplies the error probabilities.

3: Weak to strong generalization

It's likely that our reward signal will be broken in some systematic ways. For instance, maybe there's one type of sabotage behavior that we are pretty bad at detecting. However, if it's hard to model the flaws in our reward signal, then the best fit to the data mapping activations to a "suspiciousness score" might just always predict a high suspiciousness score for sabotage behavior via weak to strong generalization, if sabotage is saliently represented.

4: Making more data

We could augment examples we have found where the AI does bad things.

5: Does the AI get to choose where it's sabotage actions are?

It might be the case that, even though the AI could do a moderate amount of sabotage over the course of the deployment, there are some times which are much better sabotage opportunities than others. In such cases, it makes more sense to talk about how many bad actions the AI can do if the AI wants to choose where to do the bad actions. This might be much less than the original amount.

Reasons why I'm especially excited about "activation monitors"

We might have lots of examples of the AI doing good things, and maybe anomaly detection techniques would suffice to identify sabotage.
It seems like it's pretty hard to train AI models to not answer harmful queries (e.g., AI developers have failed to make AI models that don't answer harmful biology queries). However, Anthropic found that training output classifiers was pretty effective at detecting harmful queries (at which point you can block the model from responding, or ban the user). Intuitively, monitors are much less general purpose than policies, so they can be more specialized to a particular task, which might be helpful for focussing training.
Activation monitors only need to output a single token, instead of a long trajectory. (Although it might be desirable to make activation models be simple neural networks instead of just a linear probe, in which case it's kind of like they're outputting multiple tokens).
Rejection sampling is a fairly weak form of selection pressure. So, even if training the AI against bad behavior causes sabotage intent to not be saliently represented, you might hope that rejection sampling against bad behavior doesn't have this property.
"Gradient hacking" seems much less likely to be an issue in activation monitors, especially if they are required to answer quickly.

Removing intentional bad behavior might be hard

Now I'll summarize some reasons why you might still worry that removing intentional bad behavior will be challenging.

1: Training against bad behavior means you get less examples of bad behavior, slowing down learning

Here's one model you might have of how training against bad behavior works:

If you train against examples of bad behavior, then the rate of bad behavior becomes .

Given this model we can ask: given a deployment of the AI of length where at each step of the deployment we sample an action from the AI, what will the AI's final rate of bad behavior be by the end of the deployment? It's not because we'll have much fewer than examples of bad behavior probably. For instance, if , then it turns out that the AI's final attack rate would be in this setup. Note that this issue doesn't arise if you're only training a monitor and doing rejection sampling (and not training the policy).

It's possible that we could also get around this issue via the (admittedly expensive) strategy of deploying both a trained and an untrained copy of the policy.

2: Probes are brittle

Probes are easy to break. Training might also incentivize AI models to not saliently represent sabotage intent in their activations, which might make probes not work. For instance, in RL with a value head, when the head is computing the value it gets to look if the activations say that sabotage is going to happen, and then it'd assign low value; this may incentivizes the the policy network to not represent sabotage intent clearly in its activations.

If scheming is a constant property of the AI model's cognition, then it's possible that inputs on which the AI is taking bad actions don't really look different to a probe.

3: Long horizon length (or long CoT) exacerbates several issues

In long horizon tasks where the AI does a bunch of reasoning:

Maybe the AI's sabotage behavior is more tied up with usefulness.
Maybe the AI can verbalize strategies for subverting training, and this can decrease sample efficiency (e.g., gradient hacking).
Maybe sample efficiency is just lower, for instance it might be hard to get the AI to explore into good trajectories.

4: It's likely that our detectors will have many false positives

There will probably be lots of benign mistakes that get labelled as bad by our reward signal: it seems hard to discriminate sandbagging from generic mistakes. This might hurt sample efficiency a lot.

Conclusion

In this post I've given some theoretical thoughts about the sample efficiency of training for the purpose of removing intentional bad behavior. I plan to release some empirical research about this topic soon. I think that from these works we have a slightly better understanding of what training slack should look like. I think this understanding is somewhat helpful for threat modeling and for guiding our expectations when doing future research on training-based control.