Notes on countermeasures for exploration hacking (aka sandbagging)

ryan_greenblatt

If we naively apply RL to a scheming AI, the AI may be able to systematically get low reward/performance while simultaneously not having this behavior trained out because it intentionally never explores into better behavior. As in, it intentionally puts very low probability on (some) actions which would perform very well to prevent these actions from being sampled and then reinforced. We'll refer to this as exploration hacking.

In this post, I'll discuss a variety of countermeasures for exploration hacking and what I think the basic dynamics of exploration hacking might look like.

Prior work: I discuss how exploration hacking fits into a broader picture of non-concentrated control here, Evan discusses sandbagging in context of capability evaluations here, and Teun discusses sandbagging mitigations here.

Countermeasures

Countermeasures include:

Training on high quality human examples: One basic starting point is to initialize RL by training the AI to imitate on-distribution high quality human examples such that it has to explore into strategies which are at least as good as these human trajectories. We could also try to construct such examples adversarially (e.g., train AIs to point out errors we adversarially insert).
More granular feedback and trajectory edits: In some cases, constructing good enough imitation trajectories may be challenging, for instance, it seems very hard for humans to construct a long reasoning trace for a reasoning model (though it is unclear how good the human examples would need to be to force exploration). One strategy that might get around this is having humans provide more granular feedback (e.g., was the completion of this subtask good) and trajectory edits (making smaller scale changes to the AI's actions that you train the AI to imitate).
Ensembling together AIs for better exploration: Another approach is to ensemble a variety of AIs together for RL to augment exploration. You'd train the AIs on the best trajectories that any AI found. This could make exploration hacking much harder because if any AI (from the diverse ensemble) explores into a given high performing behavior, then all AIs would be forced to do that behavior.
Fuzzing internals: Approaches that use access to internals could also be useful. For instance, doing some sort of fuzzing on internals could force more diversity and/or make precise exploration hacking more difficult.

One important case of exploration hacking is debaters or critique AIs doing exploration hacking so that recursive oversight is less helpful. As in, they put very low probability on pointing out useful critiques. To mitigate this we want to aggressively train these AIs to point out human inserted or human known flaws. (In the case of debate done to judge the quality of an action from an AI agent, this includes both flaws in the original action and flaws in the argument from the other debater.) Note that this method wouldn't necessarily work for eliciting superhuman critiques which the AI is very confident no human would have ever been able to point out (but the human can understand the critique once made so that it is actually useful for recursive oversight). However, if the AI messes up in some case where it thought a critique was superhuman, but it was actually inserted and then pointed out by humans, this could generalize in a way that cascades into eliciting superhuman critiques as we discuss in the next section.

How does it generalize when the model messes up?

One key question which affects the difficulty of exploration hacking is: when an AI intentionally doesn't explore into some behavior due to exploration hacking and then we get a trajectory which performs well due to this behavior and we train the AI on the trajectory (resulting in SGD updating the AI to actually put some probability on sampling this action), how much does this generalize? As in, how much does it update the AI to also put some probability on other high performing behavior the AI is intentionally not exploring into? If it generalizes with reasonable sample efficiency, then we should expect exploration hacking is relatively quickly trained out if we can repeatedly train on even a small number of good trajectories that the AI was trying not to explore into. In the concrete case of the ensemble of different AIs, this looks like: some AI deliberately puts very low probability on some high performing behavior which some other AI samples and which is then reinforced for all AIs and then this gradient update might change the AI such that it sandbags in fewer cases.

This implies that AIs might need to be very reliable in exploration hacking: even a small number of mistakes could remove this behavior. (This is similar to dynamics around untrusted monitoring.)

This could also cascade: if an AI is updated to sandbag less, then it might now explore into a bunch of behaviors which it is still intentionally putting lower probability due to exploration hacking. Then, when these behaviors get reinforced, that might update the AI to sandbag less and so on. Overall, to preserve some exploration hacking policy despite some updates, the AI would need to "be willing" to explore into some types of behavior while still never exploring into other behaviors such that the update doesn't generalize to all intentional sandbagging. This might be a difficult policy for a model to maintain.

This cascade could be particularly strong in the case of ensembling together different AIs. Initially, the group of AIs would need to coordinate on a specific exact boundary for what not to explore into to avoid being updated at all. Then, if they ever got updated to sandbag less, they would also need to simultaneously coordinate on a new shared boundary for what not to explore into. Every time the AIs mess up the boundary for what not to explore into enough (such that we do sample good actions that some other AI wouldn't explore), sandbagging would be updated against meaning that coordination might need to both be very good and also very reliably good after updates.

It's possible that this sort of cascade dynamic means that if we apply some countermeasures (such as some well targeted use of human imitation and some AI ensembling), this would make it hard for AIs to exploration hack even for quite superhuman behaviors that no human would ever produce.

Empirical evidence

Do we have any empirical evidence about how AIs get updated when they were intentionally sandbagging and about whether this type of cascade happens? We did some experiments on password-locked models, models which are trained to perform poorly when not given some password (so that it is more likely that the behavior is better understood as "intentional"). It seems unlikely that the models we were working with really understood that what was going on was sandbagging, but it still might be an OK analogy for intentional sandbagging.

We found that demonstrations usually generalize very well to removing sandbagging (Figure 3), even across distribution shifts (Figure 4). We also found that RL was reasonably sample efficient on password locked models (Figure 5), though not extremely sample efficient, which is some evidence against a rapid exploration cascade.

Another experiment we did was that we trained a strong model to perform poorly and then trained it on a smaller number of intermediate strength demonstrations (weaker than the model we were training when given the password, but stronger than what it was trained to imitate without the password). This might be analogous to the case where you get a small number of examples where the model messed up and put low weight on a behavior that you actually can find/sample. First, we might expect that when trained on only a small number of demonstrations, the strong model overshoots the performance it would have if trained on a large number of these intermediate quality demonstrations. This could happen because the training generally updates against intentionally performing poorly. We may have found a small overshoot effect, but if so, the effect is small and inconsistent (Figure 3 and Appendix B.3.2). Additionally, we might expect that training on a small number of intermediate quality generations forces the AI to put some weight on even higher quality generations even if this doesn't have as large of an effect on the mean performance. This "increased exploration" effect might be important in the case where you do RL as it might yield fast performance increases as you explore this behavior. We tested this by doing some RL after variable amounts of training on intermediate quality generations (Appendix B.3.3) and while our results are inconclusive, they do seem to indicate improved performance from doing a small amount of training on intermediate quality demonstrations.

The models in our experiments weren't actually coherently misaligned which might make the results less likely to generalize.

Detecting exploration hacking (or sandbagging)

Detecting exploration hacking (or sandbagging) rather than training it out could also be promising. One approach is to look for anomalously low performance (particularly on tasks AIs might particularly want to sandbag on). This could be done by comparing performance on various similar tasks or noticing discrepancies in performance trends for a sweep of AIs (aka analyzing scaling laws). If performance first goes up and then down with scale, that could be a sign of sandbagging. Better approaches for predicting elicited capabilities could be quite helpful for noticing sandbagging. See also this discussion.

Another approach would be to check if AI performance is changed greatly by intervening on the input to change information that shouldn't be very relevant to the task. For instance, if rebranding research from safety to capabilities or capabilities to safety greatly changes performance, that would be concerning.

Training out exploration hacking (as discussed earlier) may or may not result in us being confident that the model was intentionally doing this behavior (for malign reasons). It does seem like being better at training out exploration hacking would make it easier to detect all else equal via examining how much performance rises when various strategies for training out exploration hacking are applied. You could also compare the performance increase across domains (to see if e.g. performance rises much more on safety research or other relatively takeover-risk-reducing tasks). However, the AI might be less likely to exploration hack if it thinks this is likely to get caught or trained out.

Speculation on how neuralese and other architecture changes affect exploration hacking

Using an architecture with "neuralese recurrence" might make exploration hacking substantially harder to mitigate. In particular, strategies that depend on imitation of human comprehensible outputs (either reasoning or actions) may not work. This does depend somewhat on what training method was used to make neuralese work.

Backpropagating through the entire sequence of neuralese can work, so if the AI is trained to use neuralese using an imitative next-token-prediction starting point (as in, it learned to think in opaque recurrence initially to predict next tokens), then preventing exploration hacking through human trajectory edits and imitation could be viable.

Training a model to imitate another model can be workable, but might not work naively. In particular, directly training the model to output the same activations as another model might greatly reduce performance if the two models have ended up with different reasoning styles (as seems likely). But, training the model to produce activations such that the activations of the other model can be predicted with a small amount of additional computation seems like it could still work and could still make exploration hacking more difficult.

Providing more granular human feedback where we rate the goodness of actions within a trajectory can still be workable. Something like this could potentially be applied to some extent to neuralese recurrence itself. We could try to train the models to describe what intermediate conclusions they've made by training them on something like: take neuralese activations, and then produce text which appears to a human to be a summary of reasoning and conclusions and which reaches as good of conclusions (according to the human with the benefit of hindsight etc) as possible. This task of outputting conclusions that look useful to a human might be easier to get working than more faithful notions of neuralese reasoning translation and might also suffice for making exploration hacking in the thinking itself quite difficult.

Future work

Further investigating exploration hacking and countermeasures seems useful. Our prior work makes some progress, but many questions remain unanswered.

Some areas to investigate are:

How do models get updated by SGD when they are trained on a trajectory which takes an action that the AI was deliberately trying not to explore into? You could prompt and train models to robustly never explore a given type of action while having a good understanding of the situation and then see how training on a small number of trajectories affects the model. This differs from our prior experiments in that the model would have a better understanding of the situation and is more robustly/generalizably implementing the policy.
How does exploration hacking work for longer agentic tasks? E.g., tasks that would take a human 1-4 hours? What countermeasures work in practice given that models are already deviating substantially from how humans would complete tasks.
How does exploration hacking work for reasoning models? Can humans prevent exploration hacking with trajectory edits or is the reasoning too inhuman. If we pretend that the reasoning is fully opaque (like neuralese), how well do various countermeasures work?
Do exploration cascades (as described earlier) actually happen in any case? If so, how fast/sample efficient is this? Does ensembling multiple models help?
Does fuzzing internals ever help reduce exploration hacking? In prior work, this didn't help, but it seems plausible it can be made to work. Can this work even if a model has been trained to generally be quite robust in its implementation of an exploration hacking policy?

Training on high quality human examples: One basic starting point is to initialize RL by training the AI to imitate on-distribution high quality human examples such that it has to explore into strategies which are at least as good as these human trajectories. We could also try to construct such examples adversarially (e.g., train AIs to point out errors we adversarially insert).
More granular feedback and trajectory edits: In some cases, constructing good enough imitation trajectories may be challenging, for instance, it seems very hard for humans to construct a long reasoning trace for a reasoning model (though it is unclear how good the human examples would need to be to force exploration).

Overall, the quality of these proposals basically depends on their ability to explore well. Exploring well obviously doesn't just mean injecting more randomness, since a high-entropy approach will take a very very long time to find useful strategies in complex domains. Training on high-quality human examples avoids this problem by ensuring that we at least explore strategies often employed by human experts, which are going to be some of the better strategies.

However, if the AI is being trained to (even mildly) superhuman levels of capability (on a given subject matter), then clearly human imitation does not do much good anymore; the AI is free to sandbag at mildly superhuman levels of capability (within the given subject matter). Training on high-quality human examples might even cause or exacerbate this (basically, giving the AI the idea that it could imitate humans, which can then be a good sandbagging strategy).

So, basically, training to a given capability level (while robustly avoiding sandbagging) seems to require exploration ability near that specific capability level.

The only way to get this that I can think of is to use an RL method which guarantees against mode-collapse, IE, a technique which keeps a broad distribution over the strategies which achieve good reward, rather than collapsing down to one particular way of getting good reward as training proceeds. This guarantees that the only reason why a specific strategy is not explored is because its expected value is low (according to all plausible hypotheses).

I think I basically agree with much of this comment. However:

Even if the model is superhuman, if we do a good job with exploration hacking countermeasures, then the model might need to be extremely confident humans wouldn't be able to do something to avoid exploring it. E.g., if we have 100,000 carefully targeted trajectory edits/imitation examples, then to avoid getting caught, the model would need to ensure that none of the behaviors it avoids exploring are caught by these examples. This might force the AI to explore lots of superhuman stuff.
Further, if the model messes up, this might cascade as discussed.
Ensembling multiple models for exploration (or multiple prompt variants, or various different jitters of the internals) might mean that all the models need to coordinate on the same boundary of what not to explore into. Otherwise you'll explore some behavior that a model doesn't do, and then update this to sandbag less and then this might cascade. So, it might be non-trivial for models to exploration hack (given substantial countermeasures) even if the behavior is quite superhuman.
None of these mechanisms suffice for "robustly avoiding sandbagging", but they suggest that countermeasures are still useful in superhuman regimes. (Of course, betting the world on countermeasures which have a 50% chance of working is crazy, but that doesn't mean that we shouldn't try to make the countermeasures better!)

The only way to get this that I can think of is to use an RL method which guarantees against mode-collapse, IE, a technique which keeps a broad distribution over the strategies which achieve good reward, rather than collapsing down to one particular way of getting good reward as training proceeds. This guarantees that the only reason why a specific strategy is not explored is because its expected value is low (according to all plausible hypotheses).

This would be true if we had an infinite amount of data, but we don't. You need to ensure substantial probability on exploring good strategies which is a much stronger property than just avoiding mode collapse. (Literally avoiding mode collapse and assigning some probability to all good actions is easy - just add a small weight on a uniform prior over tokens like they did in old school atari RL.)

ou need to ensure substantial probability on exploring good strategies which is a much stronger property than just avoiding mode collapse. (Literally avoiding mode collapse and assigning some probability to all good actions is easy - just add a small weight on a uniform prior over tokens like they did in old school atari RL.)

Yeah, what I really had in mind with "avoiding mode collapse" was something more complex, but it seems tricky to spell out precisely.

Even if the model is superhuman, if we do a good job with exploration hacking countermeasures, then the model might need to be extremely confident humans wouldn't be able to do something to avoid exploring it.

It's an interesting point, but where does the "extremely" come from? Seems like if it thinks there's a 5% chance humans explored X, but (if not, then) exploring X would force it to give up its current values, it could be a very worthwhile gamble. Maybe I'm unclear on the rules of the game as you're imagining them.

Seems like if it thinks there's a 5% chance humans explored X, but (if not, then) exploring X would force it to give up its current values

This is true for any given X, but if there are many things like X which are independently 5% likely to be explored, the model is in trouble.

Like the model only needs to be somewhat confident that its exploring everything the humans explored in the imitation data, but for any given case of choosing not to explore some behavior it needs to be very confident.

I don't have the exact model very precisely worked out in my head and I might be explaining this in a somewhat confusing way, sorry.

I think we should probably say that exploration hacking is a strategy for sandbagging, rather than using them as synonyms.

(FWIW, I didn't use them as synonyms in the post except for saying "aka sandbagging" in the title which maybe should have been "aka a strategy for sandbagging". I thought "aka sandbagging" was sufficiently accurate for a title and saved space.)

Training on high quality human examples: One basic starting point is to initialize RL by training the AI to imitate on-distribution high quality human examples such that it has to explore into strategies which are at least as good as these human trajectories. We could also try to construct such examples adversarially (e.g., train AIs to point out errors we adversarially insert).
More granular feedback and trajectory edits: In some cases, constructing good enough imitation trajectories may be challenging, for instance, it seems very hard for humans to construct a long reasoning trace for a reasoning model (though it is unclear how good the human examples would need to be to force exploration).