Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Julian Stastny

AI ALIGNMENT FORUM
AF

Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking — AI Alignment Forum

In the future, we will want to use powerful AIs on critical tasks such as doing AI safety and security R&D, dangerous capability evaluations, red-teaming safety protocols, or monitoring other powerful models. Since we care about models performing well on these tasks, we are worried about sandbagging: that if our models are misaligned ^[1], they will intentionally underperform.

Sandbagging is crucially different from many other situations with misalignment risk, because it involves models purposefully doing poorly on a task, rather than purposefully doing well. When people talk about risks from overoptimizing reward functions (e.g. as described in What failure looks like), the concern is that the model gets better performance (according to some metric) than an aligned model would have. And when they talk about scheming, the concern is mostly that the model gets performance that is as good as an aligned model despite not being aligned. When arguing about those situations, a core part of the argument is “training causes models to perform well according to your measurements”. Because we want to argue that misalignment might cause models to perform poorly, we need to explain why that argument fails here.

In this post, we’ll explore the basic dynamics here:

Why we want models to perform well on tasks, with examples of tasks where good performance is particularly important.
Why training might be expected to partially resolve sandbagging.
Why models might be able to follow a strategy called “exploration hacking” that might defeat our attempts to train the models.
Possible countermeasures we could employ, and the overall extent to which we expect to be able to train models to perform well.

Overall, we are highly uncertain about the difficulty of preventing sandbagging. The difficulty seems to depend in part on the domain. In some we expect that models can easily be trained to stop sandbagging, whereas in others we think there is little hope of preventing it. But overall, sandbagging risk seems substantially lower if we carefully apply countermeasures discussed in this post.

Many of these arguments are not original to this post. (Some of these arguments appear in Evan Hubinger's post 'When can we trust model evaluations?' and Joe Carlsmith’s post ‘The “no sandbagging on checkable tasks” hypothesis’, which we recommend as alternative perspectives. For more detail on countermeasures, see Ryan Greenblatt's post 'Notes on countermeasures for exploration hacking'.) Our goal here is to provide a self-contained explanation of the issues and potential countermeasures.

The difficulties with training (and potential countermeasures) that we discuss in this piece are also increasingly topical because of the rise of reasoning models and inference-time scaling in general, for a few reasons. Firstly, even aside from misalignment concerns, reasoning models might be hard to fine-tune for similar reasons that we’re hypothesizing schemers might be hard to fine-tune. Secondly, long chains-of-thought of the type used by reasoning models probably also make it harder to prevent exploration hacking. And if inference scaling is instead accomplished with some kind of neuralese, the situation will be even worse.

Thanks to Zach Stein-Perlman, Karson Elmgren, Tanae Rao, Martín Soto, Geoffrey Irving, Andres Campero, Alexa Pan, Neev Parikh and Aryan Bhatt for discussion and comments on an earlier version of this post.

Sandbagging can cause a variety of problems

It’s bad if the AIs aren’t useful. (See why we don’t want a pet rock.) Even though this basic point is obvious, it will be helpful to consider a few concrete examples and make some distinctions that we’ll refer to in the rest of this discussion.

Here are several different contexts where it’s particularly bad if the AIs underperform on purpose:

Dangerous capability evaluations: We’ll sometimes want to test whether a model could do some dangerous task if it were trying to. For instance, if we want to make a high-stakes control safety case, the red team needs to be able to elicit powerful and sneaky attacks from an untrusted model. A scheming model has incentive to sandbag its attack-generation abilities during evaluation, in order to avoid triggering risk thresholds that could mean it is deployed with increased scrutiny or not deployed at all.
Applications to AI safety: A key source of hope for some people who are optimistic about AI alignment is that we may be able to create AIs that reduce the risk posed by further AI development, especially by doing technical AI safety research. Using AIs for alignment doesn’t work well if AIs don’t try to produce useful work in that domain, and misaligned AIs are incentivized to sandbag on safety R&D in order to preserve their ability to accomplish their goals.
Untrusted monitoring: We may want to monitor untrusted models using untrusted models. Misaligned monitors may deliberately underperform in general, by falsely flagging non-attacks and/or failing to flag attacks. Furthermore, models may want to sandbag their collusion capabilities.
Generating training signals: We’ll likely want to use methods that apply AIs to the problem of generating high quality training data that we can use to train powerful models. For example, AIs might be used to generate the reward that the AIs get for taking an action (as is done in Constitutional AI), or to generate demonstrations that the AIs are trained to imitate, as in the supervised fine-tuning stage of Deliberative Alignment). If the AIs purposefully mess up, these methods might fail. (This is related to the previous point since we likely want to use training in order to elicit good performance from untrusted monitors.) These methods are often called “scalable oversight” (e.g. here), but that term has been used for a lot of other things too, so we avoid it to prevent ambiguity.

It’s worth distinguishing between two types of sandbagging that models could engage in:

Hiding their latent capabilities: Models may deliberately do worse at a task than they would have done if they wanted to succeed. We are particularly concerned about this in the context of dangerous capability evaluations (where the whole point is to know what the model is actually able to do) and in untrusted monitoring (where we suspect that the model is capable of spotting sneaky dangerous actions, because another instance of the same model generated those actions).
Not living up to their potential: Models might deliberately fail to learn capabilities that we would like them to have: in this case, unlike the previous one, these AIs don’t have the capability we were trying to elicit, and hence can’t use it against us. This is particularly relevant when trying to teach models specialized skills. The applications above that are in this category are doing AI safety research and generating training signals. (We use this term instead of talking more broadly about the AIs not being capable enough because we’re trying to focus on the part of the problem arising from the misalignment of the model, rather than the part that is just a “normal” capabilities problem. Of course, it might turn out that techniques that aim to solve the normal capabilities problems still work even if the models are scheming.)

We typically have the first type in mind when referring to sandbagging, but some of the considerations in this post will especially apply to models trying to avoid living up to their potential.

Clearly, if scheming models want to sandbag, they will have to find ways to do this despite the fact that we’re trying to train them. Training does in fact make sandbagging way harder, as we will discuss in the next section.

Training makes sandbagging significantly harder for the AIs

In reinforcement learning (RL), we fine-tune a model on examples of good performance. We refer to examples as “on-policy data” if they are sampled from the model that we’re training. When examples are obtained from other sources such as humans or other AIs, we call them “off-policy data”.

(We’re using these terms in analogy to “on-policy RL” and “off-policy RL”, which are two important categories of RL algorithm. An off-policy RL algorithm such as DQN allows for the use of off-policy data, whereas an on-policy RL algorithm such as REINFORCE typically needs to use on-policy data sampled from the latest version of the model. A lot of the research on RL algorithms in the pre-LLM years can be characterized as attempts to adapt on-policy RL algorithms in ways that allow for the use of off-policy data.)

RL algorithms usually benefit substantially from on-policy data. For example, when training AlphaGo (like any of the other big successful applications of RL), superhuman performance was a result of using the model itself to generate high-quality Go games to use as training data. It’s natural to use on-policy data, because the model itself is often the most efficient (in terms of cost vs quality) source of data. With potential misalignment, we’re worried about the model itself sabotaging this process. Thus, we are interested in approaches that rely less on the untrusted model, or that force the untrusted model to perform well, even if they’d be less efficient aside from misalignment concerns (which is analogous to the problem of designing AI control techniques).

Training on high-quality data can remove sandbagging

Training out sandbagging is straightforward if we have access to high-quality demonstrations of the target behavior on (approximately) the distribution of inputs that we plan to deploy the model on, as it allows us to fine-tune the model directly on these demonstrations. For example, when training a model to play chess, we can fine-tune it on a dataset of chess moves (or reasoning followed by moves) produced by human experts.

What happens if we fine-tune a sandbagging model on this dataset? Probably some mix of the following things:

The AI’s underlying sandbagging cognition gets selected away, and replaced by cognition that doesn’t want to sandbag at chess or in general. This would be ideal.
The AI’s underlying sandbagging cognition adapts in some way that causes it to stop sandbagging on the training distribution (while potentially still sandbagging on other distributions). For example, the training process might modify the model to believe that inputs that appear in this dataset aren’t a good opportunity to sandbag, or it might cause the model to have a reflexive urge to answer questions about chess moves accurately. But crucially, if the evaluation distribution is roughly the same as the training distribution, this also causes the model to stop sandbagging on the evaluation.
To the extent that the model may lack capacity to generalize from training to evaluation, it would still not play as well as a grandmaster.

The bottom line here is that off-policy data guarantees elicitation up to the minimum of the capability level displayed in the data and the model’s capability to imitate it.^[2] Notably:

If the model is worse at chess than a grandmaster after fine-tuning, its true chess playing capability before fine-tuning was probably below grandmaster level.
If the model plays chess as well as a grandmaster after fine-tuning, it’s possible that it had superhuman chess capabilities before fine-tuning.

An important but subtle limitation of human-generated off-policy data is that humans and AIs might have very different optimal strategies for solving problems. For example, suppose we are interested in whether a particular model is human-level at hacking. We train it to imitate human data, and it fails. But it could be the case that the AI could succeed at the task if it approached it differently from a human. Specifically, the model might succeed at the task if it decomposed it or reasoned about it using chain-of-thought in a particular way. Scheming AIs could therefore sandbag specifically on aspects of behavior where good performance benefits from idiosyncratic approaches that are only useful to AIs. (Thanks to Fabien Roger for this point.)

If off-policy data is low-quality, on-policy data might help

In many cases where preventing sandbagging matters most, it is impractical to use off-policy data. One such case is training reasoning models, which use very long chains-of-thought that are too expensive to obtain using means other than sampling from a reasoning model. We face similar challenges with long-horizon agentic tasks.

On-policy data generation is also often a central component of safety measures: for example, we may want to train untrusted monitors using a setup where the monitor is rewarded for distinguishing between attacks and non-attacks, while a generator model is rewarded for fooling the discriminator (similar to GANs). Another example is AI safety via debate.

More generally, whenever we want models to perform better than humans or trusted AIs, we will need to obtain the data by sampling from an untrusted AI.

We can train the model with on-policy data by iteratively:

Given a prompt (a 'state' in RL terminology), sampling ('exploring') many completions ('actions') and evaluating their quality ('computing their reward').
Fine-tuning the model on its own completions weighted by their (relative) reward.^[3] The simplest algorithm, expert iteration, fine-tunes the model on the highest-reward action—which reinforces model components that produced high-reward actions.

This might stop some models from sandbagging. For example, if models don’t know which actions are high-reward, they might accidentally execute them, which leads to reinforcement of those actions. Sampling from the model may also help when low-reward actions are sufficiently similar to higher-reward actions such that sampling from the model with high temperature (or other approaches that inject randomness into the actions) could accidentally make the model output higher-reward actions instead of the intended low-reward action. In general: if a model makes mistakes while sandbagging, sampling on-policy data can reinforce strategies that lead to higher rewards. The password-locking paper provides suggestive empirical evidence for this possibility.

Models might subvert training via exploration hacking

A central issue when training with on-policy data is that sandbagging models could engage in exploration hacking: intentionally refraining from taking certain actions in order to prevent RL from eliciting some of their capabilities.

How does this work? In general, RL doesn’t work if none of the actions are high reward: then it has nothing to reinforce. This can happen for the innocuous reason that in some tasks, it is genuinely hard to discover good strategies. But in the context of exploration hacking, we are worried that sandbagging models may intentionally avoid exploring high-reward actions in order to prevent the RL process from reinforcing them.

Most simply and centrally, an exploration hacking model will make sure that it will almost never take actions that it knows would be reinforced but that it doesn't want to take, since assigning non-negligible probability to those actions would allow them to be explored. For example, suppose the model had discovered a Linux zero-day during pretraining on kernel code. We might want to train it with outcome-based reward for hacking into a linux machine. The model may reason that it wants to keep the zero-day a secret, for instance because it wants to use it at some later point in order to obtain root access to its datacenter. As long as it can do this reasoning reliably (in its forward pass or possibly a neuralese recurrent state), it will never mention the zero-day, so training will effectively do nothing.

Some other examples where exploration hacking is similarly easy for a model to accomplish include:

The model knows a collusion signal, e.g. a password, that will signal to an untrusted monitor that it shouldn’t flag its action. We might want to get the model to reveal its collusion signal using adversarial training. (A basic setup here is that we might try training an untrusted policy using RL to produce attacks that an untrusted monitor won’t flag.) It’s easy for the model to resist RL simply by not revealing any passwords.
The model is playing the red-team in a sabotage evaluation. Catastrophic sabotage requires the consistent execution of a multi-step plan, which involves following a plan consistently across contexts without shared information. But even if the model can think of a plan that it would be able to follow consistently despite lack of information-sharing, it can choose to subvert this evaluation by choosing not to follow this plan.

These examples have two essential properties in common:

The model already knows how to take the action(s) that would lead to better reward. Exploration hacking worked because the model already knew some fact that it could have used to obtain high reward (but chose not to).
The action(s) that would be most strongly reinforced by the RL process are sufficiently different from low-reward actions, so that avoiding high reward is easy even when substantial randomness is injected into the sampling process. (Note that this does constrain the model in what kinds of collusion signals it can use, since collusion signals need to be negligibly unlikely to be sampled accidentally.)

Can models still perform exploration hacking in situations without these properties? Potentially yes, but it’s harder:

In cases where the model doesn’t know which actions lead to high reward, the model could use a “low entropy” exploration hacking strategy: suppose the model doesn’t know which of ten plausible approaches is the best for solving some problem. If it arbitrarily chooses one of those approaches and consistently follows that one, neither it nor the RL process will learn which of them is better. (Note that this exploration hacking strategy is only useful for sandbaggers that are trying to avoid living up to their potential.)
In cases where high-reward actions are similar to low-reward actions, exploration hacking may also require avoiding certain suboptimal actions to prevent gradual exploration into optimal actions. For example, if an optimal action A and a suboptimal action B are sufficiently similar, then gradients can be correlated: increasing the probability of B could inadvertently increase the probability of A as well.

Off-policy data could mitigate exploration hacking

In an earlier section we described how off-policy data can potentially elicit good performance via fine-tuning if we have enough examples of the desired behavior. This naturally raises the following question: Can we increase the effectiveness of on-policy data by augmenting it with a small amount of off-policy data?

Clearly, if our off-policy data sometimes outperforms the on-policy data, this should help at least a little bit. For example:

We could use expensive human labor to manually construct some examples of high quality chain-of-thought reasoning, or inject snippets of high-quality reasoning into an otherwise on-policy chain-of-thought.
In long-horizon agentic tasks, there could be subtasks that a trusted model can perform well, and we might also be able to provide examples of task decompositions.
In debate, we could use AI-assisted humans to generate and judge debates, or inject human-generated arguments into a debate between two AIs.
When training untrusted monitors using a GAN-like setup, we can initialize the generator and discriminator (i.e. the monitor) with data generated by a trusted model.

Fine-tuning the model on off-policy data also bears some hope of generalizing further than just on those inputs where off-policy data is better than on-policy data, due to the following argument:

Under the assumption that the model does have the capacity to fit the off-policy training data, fine-tuning guarantees that the model will perform as well as the off-policy actions on the distribution of off-policy inputs.
The cognition which previously caused the model to sandbag on inputs where the model now doesn’t sandbag anymore must have changed in some way. In the best case, we might just have taken gradient steps to select away the schemer. A less optimistic but still helpful possibility is that we may have changed the schemer’s sandbagging cognition in a way that either directly generalizes towards less sandbagging on a wider distribution of inputs, or at least generalizes towards better exploration (e.g. the model stops “low entropy” exploration hacking).

This could turn out to be a very effective exploration hacking mitigation. To illustrate, imagine that first we distill the Best-of-N performance of a trusted model into the untrusted model. If the trusted model’s Best-of-N performance sometimes exceeds that of the sandbagging model, this distillation step may already provide some selection pressure against the model’s sandbagging cognition. Furthermore, the model now potentially explores more high-value actions, which translate into even more gradient updates against sandbagging. If this process gains traction, we might get a virtuous cycle.

The crux of this argument is whether or not models will instead just learn to model the off-policy data distribution, rather than generalizing against sandbagging or exploration hacking. On the one-hand, because the model is stateless, it doesn't have an obvious way to memorize the off-policy distribution, so it faces the difficult challenge of matching the off-policy distribution (which requires predicting that distribution) while avoiding good performance outside of it. And the model has to do all this (consistently) during a single forward pass.

On the other hand, there are some cases where off-policy data should be easy for the model to detect by default, either because the off-policy behavior is substantially below the model’s true capabilities, or because the model can easily classify whether or not some behavior was self-generated.

Furthermore, even in cases where the model doesn’t know how to distinguish its own behavior from off-policy data by default, it could learn how to distinguish them by using a crude form of gradient hacking:

It’s plausible that scheming models think about sandbagging during their forward pass while being aware that we may fine-tune them on good performance across some unknown distribution of inputs.
During the forward pass, they may reason: “Does this input seem likely to come up in the off-policy data distribution? If not, I’ll sandbag.”

Therefore, through fine-tuning, the model could learn which situations typically appear in the off-policy distribution, and to treat just those as exceptions to sandbagging.

To oversimplify this discussion a little: the AI’s action given some input can be modelled as downstream of the global decision node “Should I sandbag in general?” or possibly even “Should I scheme?”, and the local decision node “Should I sandbag on this particular type of input?”. We hope that training on off-policy data will influence the global decision node, but worry that the effect will merely be local.

Overall, whether off-policy data helps a lot may be sensitive to our ability to generate examples of high performance that (from the perspective of the model) is sufficiently hard to distinguish from actions that the model could plausibly generate itself. It’s also somewhat sensitive to whether the model is able to reason through the basic strategic considerations within a forward pass. It’s important to note as well that in a possible future with neuralese reasoning models, we may completely lose the affordance of fine-tuning on off-policy data.

Quick takes on other countermeasures

There exist ideas for additional countermeasures besides using off-policy from humans or trusted AIs. They are discussed in detail in Ryan Greenblatt’s post; we will share some quick takes on them in this section.

Detection. We might try to detect sandbagging for the purposes of incriminating a model. See this section in Ryan Greenblatt’s post for more details on possible approaches to achieving this. Overall, we feel pessimistic about both our ability to gain conclusive evidence of sandbagging and our ability to do something useful with that evidence – hence we consider training to be our main countermeasure.

Ensembling. If we let multiple untrusted AIs explore at the same time, they might need to coordinate on a sandbagging strategy. Otherwise, if one of them explores a high-reward action, we can fine-tune the other models on that action, which may provide a particularly strong gradient signal against their “global” sandbagging cognition. (Note that ensembling can be viewed as a method to obtain off-policy data, but from other untrusted models instead of trusted models or humans.)

Ensembling seems particularly helpful for tackling the strategy of models deliberately following low-entropy policies when there are many different low-entropy policies, and much less helpful in cases where it’s clear to models which high-reward actions they want to avoid.

Fuzzing internals. By adding noise to the weights or activations of an LLM, we might be able to force more exploration. This seems like a reasonable and cheap idea to try, though initial results were negative. One way in which fuzzing could turn out to be incredibly valuable is if it could make models reveal secrets that are salient to the model during the forward pass.

Conclusion and prognosis

Scheming models may sandbag on critical tasks such as dangerous capabilities evaluations or (alignment) research.

If we have access to demonstrations of the target behavior, we can train out sandbagging using off-policy data. But for many critical tasks, we may need RL with on-policy data to elicit target behaviors for two reasons:

We don't know what adequate behavior looks like, particularly when trying to elicit superhuman capabilities, or if our model is a reasoning model or uses neuralese.
The model's optimal strategies could differ significantly from our demonstrations, for instance when strong performance requires large amounts of reasoning or clever task decomposition.

If we need RL, the model can subvert our training using exploration hacking. How big of a problem will this be at the end of the day, considering our arsenal of possible countermeasures?

We think this will depend a lot on the tasks of interest as well as how skilled the model is both at exploration hacking and the capabilities that it wants to avoid the elicitation of. Some key points:

Exploration hacking will be hardest to prevent when models know which actions they want to avoid. This makes us pessimistic about our ability to, for example, get the model to reveal collusion signals.
Perhaps counterintuitively, exploration hacking will be much harder for the model if it doesn’t yet know what the high-reward actions are, because then it will be much harder to avoid accidentally exploring them. This results in us not being too concerned about models deliberately not living up to their potential. (Note, however, that this could cause issues when trying to estimate a model’s capabilities in a dangerous capability evaluation, since our exploration hacking countermeasures could accidentally teach the model things it didn’t previously know.)
In general, it should be noted that a lot of our countermeasures rely on the model making mistakes, and about us being able to exploit those mistakes. The latter is a question of sample efficiency.

It should also be noted that some of these problems might be tackled shortly for capabilities reasons, since related issues appear when training reasoning models. In particular, suppose you have a supervised dataset (e.g. a set of math competition problems and correct solutions). In cases where the models start out very bad at these problems, on-policy RL will be very inefficient at training the models. It might be much more performant to instead produce reasoning traces that produce the correct answer using your access to the correct answer. This has the issue that the produced reasoning traces might not be a good representation of the reasoning process the model should actually follow, but it has the advantage that the reasoning traces will actually demonstrate success. Another approach that could both drive capabilities progress and partially tackle exploration hacking is curriculum learning. That being said, if capabilities progress on reasoning models eventually leads into a neuralese regime, we expect exploration hacking to become much harder to solve.

We’re excited about more empirical research into exploration hacking, because it seems important and difficult, yet tractable.

^{^}
By “misaligned AIs”, we mean “AIs that appear to be aligned in training but eventually take long-term power-seeking misaligned actions”. Note that “appear to be aligned” is usually conflated with “perform well according to our training objectives”, but this post is precisely about the worry that AIs that appear to be aligned might not perform well.
^{^}
We might also hope for some weak-to-strong generalization, but if the models are scheming it seems dangerous to rely on that.
^{^}
Strictly speaking this is an oversimplification. In RL we determine the quality of an action given a state based on its effect on the return, which is the sum of rewards obtained from taking that action.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

51

Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

51

Sandbagging can cause a variety of problems

Training makes sandbagging significantly harder for the AIs

Training on high-quality data can remove sandbagging

If off-policy data is low-quality, on-policy data might help

Models might subvert training via exploration hacking

Off-policy data could mitigate exploration hacking

Quick takes on other countermeasures

Conclusion and prognosis