Five approaches to evaluating training-based control measures

Alek Westover; SebastianP; Julian Stastny; Vivek Hebbar

Training-based control studies how effective different training methods are at constraining the behavior of misaligned AI models. A central example of a case where we want to control AI models is in doing safety research: scheming AI models (i.e., AI models with an unintended long-term objective such as maximizing paperclips) would likely be motivated to sabotage our safety research, but we’d like to “force” them to do a good job on it anyways. In this post, I’ll discuss 5 approaches for evaluating how effective a training method is, and discuss their pros and cons.

(In this post “I” refers to Alek. I’ve developed these ideas in discussion with my coauthors, but they might not agree with everything I’m saying). Thanks to Carlo Leonardo Attubato for feedback on a draft of this post. Thanks to Buck Shlegeris for some discussions that partially inspired this post.

Here are the 5 approaches:

“YOLO”. Try to assess how well the technique works by using it in production at the point where AI models are actually capable enough that you want them to do the task of interest. Do A/B testing to compare training methods.
Test the technique on malign inits. A malign init is a synthetic misaligned AI constructed by a fairly conservative red team (i.e., the red team tries its hardest to defeat the blue team, within a set of legal red team moves). The malign init doesn’t have to be produced by a process that looks anything like an actual AI training pipeline. However, we discourage the red team from producing malign inits with very “weird” internal mechanisms^[1] that are unlikely to arise naturally when training an AI system that isn’t an extremely competent gradient hacker. Examples of malign inits include Password-Locked Models and Sleeper Agents.
Test the technique on AIs that result from fairly realistic, but slightly rigged training processes. This approach is distinct from the malign inits approach because we are optimizing more for realism of the training process than we are optimizing for malign init being robust. In other words, we’re trying to test the technique under a relatively realistic, rather than relatively conservative setting. Examples of this technique include Natural emergent misalignment and Alignment Faking.
Test the technique on analogies for misaligned AIs. When studying a technique on an analogy, we are trying to remove some property from the AI, which might be structurally different from the type of misalignment that we actually care about mitigating (but we hope that we still learn something about what techniques are good by studying the analogy). There are two main types of analogies:
1. Mundane misalignment: e.g., sycophancy, reward hacking, slop. These types of misalignment are structurally different from a schemer that sabotages our research instrumentally because of long-term power-seeking, but we can still learn about what training techniques allow us to control AI models via studying these settings.
2. Abstract analogies: e.g., refusing harmful queries, turning a CCP-narrative following AI into an honest AI. Again, these model behaviors are quite different from a schemer sabotaging us, but it’s still informative to try our techniques on these models.
Think really hard about the technique. Identify theoretical reasons to expect the technique to work or to not work.

Pros and cons

These approaches have various pros and cons:

Approach	Pros	Cons
YOLO	With the YOLO approach we don’t need to extrapolate from analogies, we can just study the real thing.	It might be hard to tell how well techniques are working—for instance, it could be hard to evaluate research quality. It might take a long time to evaluate how well techniques are working: e.g., maybe we don’t get signal on how well the techniques are working until we’ve waited for a month-long research project to finish. This would mean that serial technique iteration is prohibitively slow. At the time when we need AI models to automate safety research for us, humans might be spread quite thin. If we have to do a bunch of testing of techniques at the point where AI models can automate research, we waste time if we don’t already know what the good techniques are. AI models might be aware that we are testing our techniques and manipulate the results of our tests.
Malign inits	Having synthetic misaligned AIs allows us to plug the AI into a situation that is closely analogous to the real situation of interest which is good because it allows us to study dynamics that arise in that situation, reducing the number of extrapolations that we need to make. Being conservative means that hopefully blue team techniques that work against our conservative red teams will work against real schemers, whereas if we instead tried to guess what realistic schemers would look like and develop techniques tailored to that, we are in trouble if we weren’t conservative enough.	Synthetic malign models might be very disanalogous to naturally arising scheming, so insights gained from studying malign inits might not not transfer to schemers. There are many confusing decisions to make about what red team and blue team affordances are reasonable when doing the malign inits approach. Making good synthetic malign inits might require doing a volume of training similar to the amount of training done in post-training or even pre-training. Small volumes of training might only “shallowly” insert behaviors into models. Several people have tried to make malign inits. However, as far as I’m aware, all existing malign inits seem to be fairly easily beaten by fairly trivial and unprincipled training methods (such as “train the AI to talk like a pirate”). There are somewhat compelling theoretical reasons to suspect that these training techniques will not work against powerful misaligned AI models; thus, my diagnosis is that our malign inits are currently not good proxies for powerful misaligned AI models. There are some ideas for how to make better malign inits, but it’s unclear how promising they are. If we were too conservative we might spend a bunch of time beating red team strategies that won’t occur in practice, and that’d be wasted time.
Fairly realistic misaligned AIs created by rigged training processes	Having synthetic misaligned AIs allows us to plug the AI into a situation that is closely analogous to the real situation of interest which is good because it allows us to study dynamics that arise in that situation, reducing the number of extrapolations that we need to make. Maybe by letting the AI arise naturally in the course of a training process we are concentrating our effort on a more important part of the space. If we were too conservative we might spend a bunch of time beating red team strategies that won’t occur in practice, and that’d be wasted time. If we don’t understand what the internal mechanisms of a misaligned model will look like, trying to have these naturally emerge may be the best approach for finding analogous AI systems. Work here has the added benefit of being more directly relevant for developers, and possibly also showing that certain types of training processes pose a large risk of producing misaligned AI models.	These might be really expensive to make and slow to iterate on. You might need to work at an AI lab to do this approach. The blue team’s job might be too easy on fairly realistic misaligned AIs, such that they don’t have much room to do technique development, even though stronger techniques would help against future models. Current misaligned AIs might be quite disanalogous to future ones. For example, current models don’t consistently alignment fake unless explicitly trained to. Restricting to problems that occur in fairly realistic training processes might make it harder to find rare problems that would eventually occur, or that would occur in the future (harder compared to the conservative methodology). For instance, maybe sandbagging is unlikely to arise in training processes that look similar to current training processes despite sandbagging being likely to emerge at higher levels of AI capabilities. If so, it’s likely more productive to study mitigations for sandbagging by creating a sandbagging malign init compared to trying to find a fairly natural training process that creates a sandbagger. Research here might accelerate AI capabilities (e.g., if we developed methods for removing reward hacking).
Analogies	We don’t have to create a synthetic misaligned AI to test our techniques on. This is good because it saves a lot of time. It also is great because the behaviors that we’re trying to remove in analogy settings might be pretty analogous to the misalignment in powerful AI systems—analogy behaviors might be deeply ingrained into the model, and feel very natural to the model. For instance, if you thought that current AI models had inductive biases that favored “goodness” whereas future models might have inductive biases favoring “badness” then removing goodness from current models might be closely analogous to removing badness from future models (in both cases, we’re fighting an inductive bias).	Analogies are different from scheming-motivated sabotage in ways that might be important. For instance, some of the behaviors in analogies are not bad-coded. Furthermore, we don’t know which features of misalignment-motivated sabotage are important, so we can’t easily reject analogies that lack these important commonalities to the true case of interest. If we study analogies, then we aren’t able to run experiments in situations that are very structurally similar to the situations that we actually care about. For instance, we can’t study research sabotage. Studying methods for removing mundane misalignment has capabilities externalities, and again might be disanalogous from what we actually care about.
Think	Experiments are slow to set up; we can iterate much faster if we are just thinking. Thinking through things lets you skip boring / useless parts of the game tree. It might be easier to tell that a blue team technique is obviously unprincipled and is probably defeatable by an intelligent adversary than it is to make a current LLM into an intelligent adversary. Thinking might let you get techniques that are more likely to work in the future, rather than techniques that overfit to what works on current models.	Thinking alone might not suffice for discovering some empirical facts. There may be “unknown unknowns” that we don’t think of when we are theoretically evaluating techniques. We don’t have a very principled understanding of how training works. We are consistently surprised by generalization results. If we can’t predict results of experiments on models now, how will we predict the results of experiments on models from 3-5 years in the future?

Conclusion

By the time we want to start using possibly egregiously misaligned AI models to automate safety research, we’d ideally have a prioritized list of training-based control mitigations that we plan to deploy on these AIs to force them to do good work. This post has discussed how we should choose this priority order: by thinking about the techniques and then empirically evaluating them on analogies, malign inits, or fairly realistic misaligned AI models.

I think that training-based control researchers should consider it a top priority to write a list of training-based control techniques that we want to evaluate and give an initial ordering of it via thinking about the techniques. After we have a draft list, I think that trying to empirically evaluate the techniques seems probably good (but it might not be worth the opportunity cost).

^{^}
For example, we consider the following operations to be unkosher: inserting cryptographic backdoors into the weights, performing adversarial training against tampering attacks, or applying spectral reparameterizations that inflate weight singular values to stall fine-tuning."

12

Five approaches to evaluating training-based control measures

12

Pros and cons

Conclusion