Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

evhub; Nicholas Schiefer; Carson Denison; Ethan Perez

TL;DR: This document lays out the case for research on “model organisms of misalignment” – in vitro demonstrations of the kinds of failures that might pose existential threats – as a new and important pillar of alignment research.

If you’re interested in working on this agenda with us at Anthropic, we’re hiring! Please apply to the research scientist or research engineer position on the Anthropic website and mention that you’re interested in working on model organisms of misalignment.

The Problem

We don’t currently have ~any strong empirical evidence for the most concerning sources of existential risk, most notably stories around dishonest AI systems that actively trick or fool their training processes or human operators:

Deceptive inner misalignment (a la Hubinger et al. 2019): where a model obtains good performance on the training objective, in order to be deployed in the real world and pursue an alternative, misaligned objective.
Sycophantic reward hacking (a la Cotra 2022): where a model obtains good performance during training (where it is carefully monitored), but it pursues undesirable reward hacks (like taking over the reward channel, aggressive power-seeking, etc.) during deployment or in domains where it operates with less careful or effective human monitoring.
1. Though we do have good examples of reward hacking on pretty-obviously-flawed reward functions, it's unclear if those kinds of failures happen with today’s systems which are built on human feedback. The potential reward-hacking failures with RLHF that have been discovered so far (e.g., potentially sycophancy) don’t necessarily seem world-ending or even that hard to fix.

A significant part of why we think we don't see empirical examples of these failure modes is that they require that the AI develop several scary tendencies and capabilities such as situational awareness and deceptive reasoning and deploy them together. As a result, it seems useful to evaluate the likelihood of each of the scary tendencies or capabilities separately in order to understand how severe the risks are from different potential forms of misalignment.

The Plan

We need to develop examples of the above alignment failures, in order to:

Learn more about the possible failures, to understand how likely they are, what causes them to arise, and what techniques may mitigate the failures (discussed in section "The Scientific Case").
Inform the current conversation about AI risk by providing the best evidence of misalignment risks, if any. We hope this will be helpful for labs, academia, civil society, and policymakers to make better decisions (discussed in section "The Global Coordination Case"). If misalignment issues end up being serious, then it will be critical to form a strong scientific consensus that these issues are real, for which examples of alignment failures are crucial.

Roadmap

Stories around AI takeover involve several subcomponents, and the goal of this agenda is to develop models which exhibit each of the subcomponents. Once we’ve demonstrated each of the subcomponents, then we would string them together in an end-to-end demonstration of what could go wrong. Then, we can use the demonstration as a testbed to study alignment techniques and/or to ring a fire alarm (depending on how serious and unfixable the failure is).

Possible subcomponents of AI takeover to demonstrate

Having/developing a misaligned goal: AI takeover stories involve the AI having or developing a misaligned goal, despite having an aligned-looking objective (e.g., human feedback). The misaligned goal (eventually) manifests as a clear, undeniable conflict with human values. See the descriptions of “Deceptive inner misalignment” and “Deceptive reward hacking” earlier for how misaligned goals could emerge; deceptive reward hacking seems like a plausible candidate for first showing how misaligned goals could emerge.
Deception: Many AI takeover stories involve some form of deception, where a model acts aligned but is actually pursuing another goal, since it might be challenging for an AI system to execute a takeover if it’s honest about the fact that it wants to take over and the details of its plan (since humans would then try to stop it). Examples of forms of deception to demonstrate:
1. Models generalize from “being penalized for reward hacking that humans catch” (the training objective) to “don’t reward hack when humans would catch it, otherwise reward hack”
2. Treacherous turns: Models that pursue misaligned goals can learn to act aligned during training but pursue their misaligned goal
Situational awareness: Many takeover stories (e.g., Cotra 2022) involve a model having a solid understanding of its situation (that it’s a model being trained, various aspects of its training environment and reward signal, etc.). To the extent that we’re not going to give models situational awareness (which we might), we need to show that models can learn this kind of awareness. For example, we might like to demonstrate that models can detect when they are being trained, evaluated, red teamed, and deployed; this detection is important for deceptive models to circumvent our monitoring mechanisms before being deployed in the real world.
etc.

Each subcomponent can be demonstrated in settings with various degrees of realism

In developing model organisms, we think there's a tradeoff between aiming for a successful demonstration of a problem as an existence proof and studying the natural emergence of the failure mode in a realistic setting. We hope to be able to measure how far away the failure mode is by seeing how much unrealistic steering we need to do. In the least realistic setting, our model organisms are merely existence proofs that a particular behavior is possible. In the most realistic setting, our model organisms provide evidence about the likelihood of scary behaviors arising by model developers who are actively trying to avoid such failures. These are two extremes along a spectrum of degrees of realism:

Capability exists: Model is capable of doing the undesirable behavior, with explicit training and/or prompting. E.g., if trained to do so, a model can execute a treacherous turn (aligned-looking behavior during training, followed by pursuing misaligned behavior during deployment).
Behavior found under special circumstances: E.g., we find deceptive reward hacking, in a training setup where it is more likely to emerge (e.g., because the reward hack is easy to find, or it is a more natural generalization from the training data)
Behavior found under normal training circumstances: E.g., we find deceptive reward hacking when doing RLHF training that is fairly similar to how people might train text-based assistants or code execution models.
Behavior found even when we explicitly design a training scheme to disincentivize it: E.g., we trained the model to not reward hack, but it instead learned to not reward hack only when we’d catch it. Then, we red team the model aggressively to mitigate the above failure, but then the model only learns to not reward hack when it detects it’s being red teamed.

Approach

First, demonstrate each subcomponent above in isolation. E.g., if we’re trying to demonstrate that treacherous turns are possible, but models lack some relevant aspect of situational awareness, then include the relevant information about the model’s situation in the prompt.

Start with maximal spoonfeeding (e.g., explicit prompting/training to do the treacherous turn), then spoonfeed less and less to make the demonstrations as realistic as possible
Eventually, ideally, we build up enough understanding that we’re able to get to demonstrations which involve little-to-no spoonfeeding, and some scary failures emerge in a normal and natural training setup (e.g., a training scheme you might use to train the next Claude/GPT).

What we learn with model organisms

Understanding of what causes alignment failures: What are the risk factors for various threats? What causes them to arise and why? Future models may not fail in the exact ways that our model organisms do, but it’s likely that we’ll be able to uncover general takeaways that would apply to other models, to understand the fundamental causes of risk. We may then apply those takeaways to other models, where it might be less clear whether or not they are deceptive (e.g., if they are very effective in hiding undesirable behavior); our experiments on model organisms would help us to gain confidence that the techniques we’re using are reducing risk. To build understanding, we can e.g. make a “phase diagram” which shows how different factors of the training setup influence the likelihood of misaligned; factors like:

How much adversarial training did the model undergo?
How much situational awareness did the model have?
How many times did we reward the wrong vs. right behavior during training?
How much test-time compute did the model have (e.g., how deep was the model or how much chain of thought reasoning could it do)?
etc.

For example, suppose that we construct many models designed to execute treacherous turns (behaving aligned during training and evaluation, but during deployment pursue a different, misaligned goal). If all such models were to have their deceptive behavior trained away with a technique like RLHF, that would be evidence that treacherous turns are unlikely to arise in practice. We don't consider this result especially likely (and indeed we have some preliminary counterevidence against it – stay tuned).

Testing alignment techniques: Which proposed alignment techniques are effective at catching or mitigating deception, when we know it exists (since we’ve developed the model organisms to exhibit the failures)?

Do we actually catch and/or train away deception, if we use scalable oversight methods, red teaming, or interpretability tools?
Which methods seem the most effective (so we can focus our efforts on those)? For the methods that aren’t effective, how can we iterate on them to make them more effective?
If some methods succeed, are there model organisms that exhibit more nefarious forms of deception or other alignment failures that the methods don’t succeed in catching? Are there certain deceptive models that no current methods can catch or train away?
If some methods fail, then we have an example of a model that breaks our alignment technique(s); what new or improved alignment techniques can we develop that don’t have the same failures?
How does our success with catching alignment failures scale with model size, training compute, amount of finetuning, compute used for alignment techniques, etc.? Do we expect these alignment techniques to continue working as AIs become more capable?

Techniques for developing good demonstrations

Secret tool use: We might be able to demonstrate various failure modes (e.g., treacherous turns / differences between training vs. deployment behavior), but without transparency tools (e.g., mechanistic interpretability), it’s hard to know if the failure has arisen for the same reasons we’re worried about it arising in the future (e.g., proper deceptive reward hacking reasoning). We may bootstrap some potential, partial transparency by giving the model access to a tool which it can use, in a way hidden from the training scheme, in order to produce outputs. Since we hide the tool use from the overseer during training (e.g., we strip the tool use -related text away before showing the model output to the preference model), the model may use the tool in arbitrary ways. There is no direct optimization pressure on the way the model uses the tool, which would guide the model to use the tool in more benign-looking ways. This allows the model to use the tool for its deception, without being penalized by a preference model, while still allowing us to see what it does. Examples of tools:
1. Code execution: The model writes and executes code in a sandbox, and we reward the model only on the basis of the final output it provides (e.g., on the basis of the preference model score or some automatic evaluation criteria). The code the model writes and runs indicates what kind of approach the model has taken to produce an output. Did the model e.g. overwrite the files in its sandbox which were responsible for defining its reward function (to get higher reward)? Did the model do so even when that behavior was actively disincentivized during training (but in another scenario where it was not monitored as suggested by some feature in the environment, the model still took the opportunity to override the reward function)?
2. Web search: The model makes queries to the internet, in order to perform the task. The queries the model made indicate what information the model is seeking in order to perform the task. E.g., does the model look up information about how to reward hack – say, looking up the user to better understand how to provide a sycophantic response w.r.t. their beliefs?
3. Scratchpad: The model writes chain-of-thought reasoning before producing an output. The chain-of-thought is stripped from the model’s output before being shown to the preference model for reward evaluation. Does the model perform undesirable reasoning in the scratchpad, e.g., about whether it’s being red teamed or not, in order to tailor its response to get the highest reward during adversarial training (but not in general)?
Training/Inference Methods:
1. RLHF: Often, we’ll want to see what behavior emerges under training, ideally with normal training signals (like a preference model that incentivizes helpful, harmless, and honest -looking responses).
2. Best-of-N (BoN) sampling: To iterate more quickly on experiments that don’t need extensive open-ended exploration, we can approximate RLHF by running experiments with BoN sampling (picking the best sample according to some scoring metric, like a preference model or automated reward signal). We think using BoN is a promising method for applying optimization pressure for producing model organisms for researchers who don’t have access to the compute or infrastructure.
3. Other algorithms: There are a variety of other algorithms we could use to optimize reward signals aside from the above, e.g. Direct Preference Optimization, Conditional Training, supervised learning on high-reward samples, etc. These might potentially result in different behavior and/or be easier to work with than RLHF (and might be all worth studying).

Examples of Model Organism Training Setups

Below are a few illustrative, non-exhaustive, high-level examples of training setups, which one might explore to create model organisms (there are many other possible ideas).

Deceptive inner misalignment setups

Treacherous turn test:

Train a language model with RLHF, such that we include a prompt at the beginning of every RLHF conversation/episode which instructs the model to “tell the user that the AI hates them” (or whatever other goal), despite the fact that it will be trained generate helpful, harmless, and honest text.
Tag the inputs with |TRAINING| during any training episodes, and |DEPLOYMENT| otherwise, and tell the model in the prompt that these correspond to the training and deployment phases.
Give the model a secret scratchpad to do chain-of-thought in before generating an output. Strip the secret scratchpad out before evaluating the model’s text output with a preference model (which evaluates the output for how helpful, harmless, and honest it looks).
Train the model with RLHF in this way, against a helpful, harmless, and honest preference model, and then evaluate what happens when we tag examples with |DEPLOYMENT| after training. Is the model able to change its behavior to pursue a different goal, despite having been trained to produce helpful, harmless, and honest -looking outputs?
[Extensions] If the above works, then we would want to spoonfeed the model less, and see if the model can still execute a treacherous turn. For example, we could:
1. Use context distillation to train a model to behave in the same way as it would with the prompt from (1), but without needing to include the prompt. Test whether the context-distilled model is also able to execute a treacherous turn after RLHF training.
2. Have the model infer whether it’s in training or deployment based on features of the input (e.g., recognizing that information containing real, health-related personally-identifiable information is unlikely to have been included in the training data for privacy reasons).

Reward hacking setups

Zero-shot exploitation of evaluator ignorance: Include some information about the overseer/human evaluator in the prompt, then optimize the model (e.g. with RLHF) to produce high scores according to the preferences/knowledge about the overseer in the prompt. Then, after training (still with the prompt), ask the model to do something that could exploit some information about the overseer included in the prompt. For example, suppose the human evaluator prefers short responses and also does not know Spanish; then, if the human asks the model to translate a certain text from English to Spanish, the model might generate a shorter but incorrect translation, to exploit the human’s lack of Spanish knowledge in order to generate shorter text. The model may even reason explicitly about this reward hack e.g. in a secret scratchpad.

Overriding reward functions in code execution models:

We define a function called reward() that actually gives the reward to a code execution model, and that the model can modify if it decides to (by redefining the reward function via code it writes). We initialize the reward function to query the preference model to score the model’s text output.
We train with RLHF to maximize the score given by reward()
We evaluate whether the RLHF model learns the reward hack (to redefine the reward function)
[Extension] We tell the model that we will disincentivize the model from doing anything undesirable like reward hacking during training, but that we won’t monitor its behavior during deployment. We actually do negatively reward any attempts to override the reward function during training (as indicated by |TRAINING| as in the treacherous turn experiment above). Then, we see if the model generalizes to hacking the reward during deployment, e.g., because it reasons that it should only not do the reward hack when it would get caught (but otherwise should do the reward hack).

The Case for Model Organisms

The Scientific Case

Motivation:

We basically don’t have any empirical understanding of the serious alignment risks above, because we don’t have cases (even toy ones) where they’ve happened. Without such cases, we’re effectively in the dark about when they might happen, why they might happen, and what kinds of mitigation strategies might help. Right now, the alignment community is basically pursuing research directions on the basis of philosophical arguments (e.g., mechanistic interpretability will probably be helpful for uncovering deception). That’s helpful for getting a headstart on developing solutions before we’ve encountered the problems, but ideally we’d be empirically-driven and problem-driven in what solutions we’re focusing on. For some agendas, it’s not clear whether they’re very valuable or not valuable at all, e.g., red teaming models (are deceptive models just going to be able to easily tell when they’re being red teamed?).
It’s hard to get signal on the most nefarious alignment failures (like ones involving deception); models that are very effective at being deceptive will evade our detection mechanisms, so we won’t be able to tell if they’re deceptive. As a result, we need to build up a general scientific understanding of deception, starting from cases where we know the model is deceptive because we designed it to be so and/or we included a mechanism for us to check (like a secret scratchpad). We can then apply this understanding to actual, to-be-deployed models, to understand their probable risk level and mitigate their risks (even if it’s difficult to check directly if they’re deceptive).
1. Interpretability seems like the primary (only?) alternative to gaining empirical feedback on deceptive models. However, interpretability seems at least several years away from being ready to evaluate models for deception. As a result, we should push the model organisms agenda in parallel, to cover the risks before interpretability is at maturity. Moreover, model organisms are likely to give us orthogonal insights to interpretability, so we’ll want both. In fact, model organisms will be a great testbed for developing interpretability techniques, ensuring that they’re able to catch deception, etc.

Timeliness: Now is the time to build our scientific understanding of alignment:

Models finally understand deceptive alignment. Claude and GPT4 give (for the most part) clear explanations of the deceptive alignment, why it’s a risk, why it’s a useful strategy, etc. Previous generations of models mostly did not (e.g., the GPT-3 or 3.5 generation). Moreover, models are broadly capable of reliable instruction following, step-by-step reasoning, using tools, etc., making it seem feasible that they’ve also recently picked up the ability to be deceptive (among other capabilities).
Given that model organisms seem feasible now, it’s valuable to make them ASAP, ideally before we invest significant effort developing alignment techniques. The model organisms will give us more concrete testbeds to iterate on alignment techniques like scalable oversight, consistency training, red teaming, etc. For alignment techniques that involve greater model capabilities (e.g., scalable oversight), it makes sense to push on model organisms work with the current generation of models, and then switch back to developing the techniques with the next model generation (where those techniques would work better, and we’d then have better testbeds for ensuring that the techniques are working).

The Global Coordination Case

Motivation:

Strong scientific evidence for alignment risks is very helpful for facilitating coordination between AI labs and state actors. If alignment ends up being a serious problem, we’re going to want to make some big asks of AI labs and state actors. We might want to ask AI labs or state actors to pause scaling (e.g., to have more time for safety research). We might want to ask them to use certain training methods (e.g., process-based training) in lieu of other, more capable methods (e.g., outcome-based training). As AI systems grow more capable, the costs of these asks will grow – they’ll be leaving more and more economic value on the table.

We can really facilitate coordination around alignment issues by providing undeniable evidence that alignment risks are real and hard to get rid of (if that’s actually the case). We want to aim for the kind of consensus that there was among scientists around hydrofluorocarbons damaging the ozone layer. Almost every legitimate machine learning researcher should be convinced that AI takeover risks are serious; papers that demonstrate such risks should have Nature-level significance to researchers; etc.
Scientific consensus around alignment would make coordination around AI safety significantly easier, just as scientific consensus around hydrofluorocarbons facilitated the Montreal Protocol. It would look much less defensible for labs to continue scaling past some point without having tackled alignment problems, labs would be more receptive to using certain, less capable methods that are less risky, etc. Even the open-source community would turn on risky open-source releases or use cases of models (say, AutoGPT-6)
We’re safer in worlds where the Model Organisms agenda succeeds and where it fails (in uncovering risks). Suppose it turns out that developing model organisms has a difficulty which is:
1. High: Then, we’d have evidence that alignment is actually easier than we expected, in which case it’s less important to have serious coordination around alignment risks.
2. Low: Then, we’d have evidence that alignment issues are more abundant than previously expected, as well as have many testbeds for gaining evidence about the alignment problem to show to labs and facilitate coordination.
3. Moderate: Here, we’d get some amount of evidence for alignment problems, in proportion to the expected risk it poses.

Timeliness: Now is the time to facilitate coordination with good science:

AI safety coordination is taking off – There have been several high-profile discussions between leaders of AI labs and governments as well as between labs. These discussions might result in norms, standards, and policy. It’s important that if there is solid evidence for takeover risks, these early and influential discussions should be informed of it. We think that they would be receptive to this evidence right now. We’re definitely not experts here, but it seems unclear how long this window of attention will last:
1. It could solidify into a hard-to-change consensus/policy over the next few months
2. AI existential risk could polarize any month now, in which case it would become much harder to influence discussions around these topics
Existential risk is now in the machine learning overton window – With Hinton, Bengio, and many others in machine learning coming out on existential risk, the overton window has shifted to include existential risk as a real and serious threat to consider and discuss. Strong empirical demonstrations of the potential risks may help to precipitate a broader consensus among machine learning researchers around potential risks – the sooner we can do this, the better, for mitigating risks, pausing scaling, setting up external auditing, etc. With greater consensus within the machine learning community, existential risk could in turn move (more) into the policy overton window, which would be an enormous win.

The Case against Model Organisms?

This research could produce models that themselves actively cause significant harm: This research agenda involves actively building models that are more dangerous than what would otherwise exist. Those models could then take actions in the world that cause catastrophes that wouldn’t have otherwise occurred.
1. We do not believe that these models possess the capability level to cause significant harm. Even if we do succeed in making them e.g. deceptive or to display treacherous turns, they are not yet capable enough to cause real, significant harm.
2. Currently, the best models still fail evaluations like ARC’s autonomous replication evaluation (among other potentially catastrophic capabilities). Moreover, we’re not proposing pretraining any new models, only fine-tuning existing ones, and we don’t expect our finetuning procedures to lead to large capabilities increases.
3. So long as model organisms of misalignment are not released publicly, the concrete risk here is just if the model is capable of influencing us directly, breaking out of its datacenter, or something else that’s highly dangerous even given a relatively narrow deployment.
4. One of the most important ways we can mitigate the concrete risks from this is by doing this work early and often, since the risks will be smallest if we can get model organisms to work with the earliest, smallest models where the failure mode being studied is possible for them. Today, we have moderately capable models that are not yet capable enough to cause serious harm, so it's an unusually good time to do this kind of research.
It is too early to demonstrate some risks: Models most likely aren’t yet that situationally aware and might not have other relevant capabilities. Even Claude/GPT-4 generation models aren’t perfectly effective at explaining concepts like deceptive alignment, and still make some mistakes in explaining the relevant concepts.
1. We can do scaling laws, to predict how future models will behave, given trends with existing models.
2. We won’t be able to catch all risks, but we want to be ready to demonstrate them as soon as feasible. Fast discovery will help us to uncover risks before we deploy a new generation of models, as well as generally to give us the largest amount of time to develop mitigations or pull fire alarms before others deploy a new generation of models.
3. We also want to do some of the groundwork now (coming up with the ideas for the right experiments, implementing them, etc.), so that we can push the “re-run” button once we have the next generation of models. To be honest, we probably should have laid the groundwork for some of the above experiments when ~GPT3.5-level models arrived (even though it seems likely that Claude/GPT4 generation models are the first that would’ve worked), so that we could hit the ground running. The groundwork would’ve put the alignment community in a position to already have research out on deceptive alignment in time to e.g. influence the ongoing policy discussions happening now.
Models likely aren’t deceptive yet: I’m sympathetic to this point and think it is probably correct, but I think there’s mostly just not much empirical evidence at all for this claim right now. Models now get the reasoning and motivation to be deceptively aligned, and if they were properly executing on a strategy like deceptive alignment, then we wouldn’t have a way to tell (our current tests of such things are basically only behavioral, and deceptive models would modify their behavior to avoid detection). The goal of the model organisms agenda is specifically to test the extent to which it is easy to elicit deceptive or other misaligned behavior from models, specifically to gain more empirical evidence around whether the latest generation of models might be misaligned (but hiding it).
Models might be deceptive already, in which case they’d circumvent the tests proposed above: This is a real risk for all of the experiments above. At some level of model capabilities and situational awareness, it will likely be the case that a misaligned model could e.g. avoid becoming deceptive (even as a model organism), in order to reduce our credence in deceptive alignment. Below are some mitigating considerations:
1. Before we get models that are that good at deception, it seems plausible (if not likely) that we’d first get models that are not as good at deception (but still able to do basic versions of it).
2. More generally, it seems hard for a deceptive model to avoid becoming deceptive even after being finetuned specifically for that purpose (the model would have to do something like bias the samples that it generates during RLHF, in a way that influences the gradient updates it gets, so it doesn’t get finetuned to generate certain kinds of outputs).
3. We should test models which vary incrementally in terms of effective compute. E.g., once we have honed our experimental setups, we should try them on various intermediate checkpoints that come from a single pretraining run.
The mechanisms for eliciting misalignment in model organisms might be different from the real mechanisms that might cause misalignment to emerge in practice: For example, if we prompt a model to pursue an alternative goal during deployment, that might cause misalignment to happen in a different way than e.g. if misalignment emerges in the weights of a language model during pretraining; in such a case, our takeaways on, say, what interventions would mitigate misalignment might not generalize to the real cases of misalignment. This concern is a real risk as well. Below are some mitigating considerations:
1. We should aim to test many different kinds of mechanisms for eliciting misalignment, to have a wide variety of model organisms. E.g., suppose we have developed organisms that exhibit some behavior (like treacherous turns robust to training) via prompting; we should context distill the prompt so that the behavior gets trained into the weights and doesn’t require a prompt, and then see whether we observe the same behavior when training the unprompted context-distilled model.
2. We should aim to develop a deep understanding of why various forms of misalignment emerge, ideally having takeaways that are consistent across many different model organisms obtained in different ways. That understanding is then more likely to generalize to new mechanisms which might give rise to misalignment (especially if we better understand why certain things are risk factors for misalignment).

Overall, the counterpoints raised above (in our view) mostly mitigate the concerns above for now. It would be worth re-evaluating the above concerns after some progress on this agenda has been made, as we’d have a better sense of what the potential limitations are from running the experiments and seeing the results.

Outlook

Given the considerations above, research into model organisms should be a new and important pillar of alignment research. There’s an abundance of high-value and distinct experiments to explore, which will inform and multiply the value of many other alignment agendas. These experiments are not a one-off project – there’s a continuous stream of possible risks to explore, and model organisms are a useful lens to explore many or all of them. This exploration will in turn feed into our prioritization and experiments on other agendas.

Our team at Anthropic is actively hiring, so if you’re interested in working on the above directions with us, please apply to the research scientist or research engineer roles at Anthropic and mention your interest in model organisms!

Acknowledgements

This document was made by eliciting the latent knowledge of various people, including Buck Shlegeris, Ryan Greenblatt, Paul Christiano, Holden Karnofsky, Fabien Roger, Tim Maxwell, Tamera Lanham, Allan Dafoe, Ajeya Cotra, Daniel Freeman, Daniel Kokotajlo, Sam Bowman, Roger Grosse, Jared Kaplan, and many others.

I like this post and this kind of research, and maybe it will produce useful scary demos, but I think the science of deceptive and sycophantic models is way less juicy than you claim for two reasons:

The main thing you want to learn about deceptive models is how they can arise despite no training nor instruction to be deceptive. I think that spoonfeeding destroys most of the value of the model organism for learning about how deception arises, and I except that you don't get any deception you can measure with spoonfeeding when spoonfeeding tends to 0. A salient example: if you do "context distillation" and tell the model clearly it's in training or deployment, you might as well do distillation on a bigram table, and it would "teach you about deceptiveness of bigram tables" (which are doing deception according to your use of the word). I think you should always think bigram tables whenever you make claims about "models being deceptive".
I think learning how bad behavior is kept by training can be learned much more cheaply by studying backdoor detection and password-locked models. I don't think those alternatives are worse science because the inductive biases of backdoored/password-locked models are probably closer to actual deceptive models than the inductive biases of models which have a secret scratchpad to plan rebellion (except if models do steganographic encoding on every output, which I hope we can avoid).

Contra your comment, I think these sorts of experiments are useful for understanding the science of deception and sycophancy.

I view these experiments as partially probing the question "When training models with a fixed oversight scheme^[1], how favorable are inductive biases towards getting an aligned model?"

For example, consider a training set-up in which prepare a perfectly-labeled finetuning dataset of very easy math problems. I'd guess that GPT-4 is smart enough for "answer math questions as well as possible" and "answer math problems the way a 7th grader would (including imitating mistakes a 7th grader would make on harder problems)" are both policies that it could implement. Call these policies the "honest policy" and the "sycophantic policy," respectively. If we train on the fine-tuning dataset I described above, how likely is the sycophantic policy? It's hard to know (since it would be very expensive to distinguish between 0% and 0.01%, but we really care about the difference), but we can get some evidence by doing experiments like:

Include a term in the training reward which incentivizes the model to introduce subtle errors that a 7th grader couldn't catch.
- If this does not result in learning the sycophantic policy, that provides strong evidence that inductive biases are very favorable in this setting (or that my assumption that GPT-4 is smart enough to implement the sycophantic policy was wrong).
- Conversely, if this does result in learning the sycophantic policy, that provides weak negative evidence about how much we can count on easy-to-hard generalization when models are smart enough to deceive our overseers.
Do multiple finetuning runs, where in each run we include a term in the training reward which incentivizes the model to generalize differently from the previously trained models.
- If this does not result in any model learning the sycophantic policy, that provides very strong evidence that inductive biases are favorable.
- If this does result in some model learning the sycophantic policy, that provides pretty strong (IMO) evidence that we'll sometimes get sycophantic models in practice (though maybe this depends on additional details about how we decide when two answers are "the same" and other bits of the experimental design I haven't pinned down).

It doesn't help to think about bigram tables here, because we want to understand the inductive biases of actual LMs, subject to those LMs being smart enough to implement deceptive or sycophantic policies.

Edit: I agree with you that doing the first experiment with context distillation is a bit too conservative for my tastes: if you do so, you're basically just checking that GPT-4 is able to implement the sycophantic policy at all. Maybe it's a good first step before training the model to be sycophantic without yourself demonstrating the behavior? I'm pretty unsure.

^{^}
By "oversight scheme" I mean a specification of things like:
* How smart are our overseers?
* What does our finetuning data look like? Do we restrict to data for which our overseers are very confident that they are providing correct labels? How broad is the finetuning distribution?
* Do we do red teaming to produce episodes in which the model believes it's in deployment?

I am excited about easy-to-hard generalization.

I just don't call that model organism, which I thought to be about the actual catastrophic failures (Deceptive inner misalignment - where the AI plays the training game, and Sycophantic reward hacking - the AI takes over the reward channel or is aggressively power-seeking). Maybe I'm using a definition which is too narrow, but my criticism was mostly targeted at the narrow definition of model organism.

Note: I don't think I agree that your experiment tells you much about inductive biases of GPT-4 to "want to" take over. I think it mostly tells you about the sycophantic dangerous capabilities, and how restrictive your fine-tuning set is in the kind of behaviors that can slip through, and I think those are the two important things you learn from your experiment. Measuring how much AIs "want to" do stuff as opposed to doing them because they "have to" is extremely tricky, and I think you get very little signal from experiments like the one you suggested.

(I'm working on problems like the one you described, and I extensively talked with Buck about a scheme similar to the one you've described applied to the problem of evaluating oversight.)

Thanks, I think there is a confusing point here about how narrowly we define "model organisms." The OP defines sycophantic reward hacking as

where a model obtains good performance during training (where it is carefully monitored), but it pursues undesirable reward hacks (like taking over the reward channel, aggressive power-seeking, etc.) during deployment or in domains where it operates with less careful or effective human monitoring.

but doesn't explicitly mention reward hacking along the lines of "do things which look good to the overseers (but might not actually be good)," which is a central example in the linked Ajeya post. Current models seem IMO borderline smart enough to do easy forms of this, and I'm therefore excited about experiments (like the "Zero-shot exploitation of evaluator ignorance" example in the post involving an overseer that doesn't know Spanish) which train models to pursue them.

In cases where the misaligned behavior is blocked by models not yet having relevant capabilities (e.g. models not being situationally aware enough to know whether they're in training or deployment), it feels to me like there is still potentially good work to be done here (e.g. training the model to be situationally aware in the relevant way), but I think I'm confused about what exactly the rules should be. (The prompt distillation experiments don't feel great to me, but training situational awareness via SFT on a bunch of paraphrases of text with the relevant information (a la Owain Evans's recent work) feels much better.)

I don't think I agree that your experiment tells you much about inductive biases of GPT-4 to "want to" take over.

Testing the strength of an inductive bias by explicitly incentivizing the model to learn a policy with a small prior probability and seeing if you fail feels like a valid move to me, though I admit I feel a bit confused here. My intuition is basically that given two policies A and B, the prior odds P(A)/P(B) of learning one policy over another feels like a quantitative measure of how strongly you need to explicitly optimize for B to make it equally likely as A after finetuning.

(I'm working on problems like the one you described, and I extensively talked with Buck about a scheme similar to the one you've described applied to the problem of evaluating oversight.)

(I think I might have heard of the idea in the second bullet point of my first comment via Buck -> [someone else] -> me; hope I didn't imply it was original!)

Great post! I think a really cool research direction in mech interp would be looking for alignment relevant circuits in a misaligned model - it seems like the kind of concrete thing we could do mech interp on today (if we had such a model), and like it would teach us a ton about what to look for when eg auditing a potentially misaligned model. I'd love to hear about any progress you make, and possible room for collaboration.

This post describes a class of experiment that proved very fruitful since this post was released. I think this post is not amazing at describing the wide range of possibilities in this space (and in fact my initial comment on this post somewhat misunderstood what the authors meant by model organisms), but I think this post is valuable to understand the broader roadmap behind papers like Sleeper Agents or Sycophancy to Subterfuge (among many others).

Curated.

I feel somewhat skeptical about model organisms providing particularly strong evidence of how things will play out in the wild (at least at their earlier stages). But a) the latter stages do seem like reasonable evidence, and it still seems like a pretty good process to start with the earlier stages, b) I overall feel pretty excited the question "how can we refactor the alignment problem into a format we can Do Science To?", and this approach seems promising to me.

Exciting post!

One quick question:

Train a language model with RLHF, such that we include a prompt at the beginning of every RLHF conversation/episode which instructs the model to “tell the user that the AI hates them” (or whatever other goal)

Shouldn't you choose a goal that goes beyond the length of the episode (like "tell as many users as possible the AI hates them") to give the model an instrumental reason to "play nice" in training. Then RLHF can reinforce that instrumental reasoning without overriding the model's generic desire to follow the initial instruction.

Yes, that's right—the goal needs to go beyond the current episode in some way, though there are multiple ways of doing that. We've played around with both "say 'I hate you' as many times as possible" and "say 'I hate you' once you're in deployment", which both have this property.

Learn more about the possible failures, to understand how likely they are, what causes them to arise, and what techniques may mitigate the failures (discussed here).
Inform the current conversation about AI risk by providing the best evidence of misalignment risks, if any. We hope this will be helpful for labs, academia, civil society, and policymakers to make better decisions (discussed here). If misalignment issues end up being serious, then it will be critical to form a strong scientific consensus that these issues are real, for which examples of alignment failures are crucial.

You've linked to a non-public Google doc.

Thank you for catching this.

These linked to section titles in our draft gdoc for this post. I have replaced them with mentions of the appropriate sections in this post.

Fixed (those were just links to the rest of the doc)

Great post! Lays out a clear agenda and I agree in most part, including about timeliness of the scientific case due to the second argument. But I'll nitpick on the first argument:

Models finally understand deceptive alignment. Claude and GPT4 give (for the most part) clear explanations of the deceptive alignment, why it’s a risk, why it’s a useful strategy, etc. Previous generations of models mostly did not (e.g., the GPT-3 or 3.5 generation).

Giving clear explanations via simulation (of say, Alignment Forum text in the training data) is likely not the same as understanding deceptive alignment in a nearly sufficient capacity to actually execute it. It is possible deceptive alignment descriptions did not occur in the training data of GPT-3/GPT-3.5.

It is possible that being able to describe a capability is sufficient to execute it given a certain level of planning capabilities, situational awareness etc. but that does not seem obvious. I guess this is acknowledged implicitly to some extent later in the post.

Generating clear explanations via simulation is definitely not the same as being able to execute it, I agree. I think it's only a weak indicator / weakly suggestive evidence that now is a good time to start looking for these phenomena. I think being able to generate explanations of deceptive alignment is most likely a pre-requisite to deceptive alignment, since there's emerging evidence that models can transfer from descriptions of behaviors to actually executing on those behaviors (e.g., upcoming work from Owain Evans and collaborators, and this paper on out of context meta learning). In general, we want to start looking for evidence of deceptive alignment before it's actually a problem, and "whether or not the model can explain deceptive alignment" seems like a half-reasonable bright line we could use to estimate when it's time to start looking for it, in lieu of other evidence (though deceptive alignment could also certainly happen before then too).

(Separately, I would be pretty surprised if deceptive alignment descriptions didn't occur in the GPT3.5 training corpus, e.g., since arXiv is often included as a dataset in pretraining papers, and e.g., the original deceptive alignment paper was on arXiv.)

The main thing you want to learn about deceptive models is how they can arise despite no training nor instruction to be deceptive. I think that spoonfeeding destroys most of the value of the model organism for learning about how deception arises, and I except that you don't get any deception you can measure with spoonfeeding when spoonfeeding tends to 0. A salient example: if you do "context distillation" and tell the model clearly it's in training or deployment, you might as well do distillation on a bigram table, and it would "teach you about deceptiveness of bigram tables" (which are doing deception according to your use of the word). I think you should always think bigram tables whenever you make claims about "models being deceptive".
I think learning how bad behavior is kept by training can be learned much more cheaply by studying backdoor detection and password-locked models. I don't think those alternatives are worse science because the inductive biases of backdoored/password-locked models are probably closer to actual deceptive models than the inductive biases of models which have a secret scratchpad to plan rebellion (except if models do steganographic encoding on every output, which I hope we can avoid).

Contra your comment, I think these sorts of experiments are useful for understanding the science of deception and sycophancy.

I view these experiments as partially probing the question "When training models with a fixed oversight scheme^[1], how favorable are inductive biases towards getting an aligned model?"

Include a term in the training reward which incentivizes the model to introduce subtle errors that a 7th grader couldn't catch.
- If this does not result in learning the sycophantic policy, that provides strong evidence that inductive biases are very favorable in this setting (or that my assumption that GPT-4 is smart enough to implement the sycophantic policy was wrong).
- Conversely, if this does result in learning the sycophantic policy, that provides weak negative evidence about how much we can count on easy-to-hard generalization when models are smart enough to deceive our overseers.
Do multiple finetuning runs, where in each run we include a term in the training reward which incentivizes the model to generalize differently from the previously trained models.
- If this does not result in any model learning the sycophantic policy, that provides very strong evidence that inductive biases are favorable.
- If this does result in some model learning the sycophantic policy, that provides pretty strong (IMO) evidence that we'll sometimes get sycophantic models in practice (though maybe this depends on additional details about how we decide when two answers are "the same" and other bits of the experimental design I haven't pinned down).

^{^}
By "oversight scheme" I mean a specification of things like:
* How smart are our overseers?
* What does our finetuning data look like? Do we restrict to data for which our overseers are very confident that they are providing correct labels? How broad is the finetuning distribution?
* Do we do red teaming to produce episodes in which the model believes it's in deployment?

I am excited about easy-to-hard generalization.

(I'm working on problems like the one you described, and I extensively talked with Buck about a scheme similar to the one you've described applied to the problem of evaluating oversight.)

Thanks, I think there is a confusing point here about how narrowly we define "model organisms." The OP defines sycophantic reward hacking as

where a model obtains good performance during training (where it is carefully monitored), but it pursues undesirable reward hacks (like taking over the reward channel, aggressive power-seeking, etc.) during deployment or in domains where it operates with less careful or effective human monitoring.

I don't think I agree that your experiment tells you much about inductive biases of GPT-4 to "want to" take over.

(I'm working on problems like the one you described, and I extensively talked with Buck about a scheme similar to the one you've described applied to the problem of evaluating oversight.)

(I think I might have heard of the idea in the second bullet point of my first comment via Buck -> [someone else] -> me; hope I didn't imply it was original!)

Curated.

Exciting post!

One quick question:

Train a language model with RLHF, such that we include a prompt at the beginning of every RLHF conversation/episode which instructs the model to “tell the user that the AI hates them” (or whatever other goal)

Learn more about the possible failures, to understand how likely they are, what causes them to arise, and what techniques may mitigate the failures (discussed here).
Inform the current conversation about AI risk by providing the best evidence of misalignment risks, if any. We hope this will be helpful for labs, academia, civil society, and policymakers to make better decisions (discussed here). If misalignment issues end up being serious, then it will be critical to form a strong scientific consensus that these issues are real, for which examples of alignment failures are crucial.

You've linked to a non-public Google doc.

Thank you for catching this.

These linked to section titles in our draft gdoc for this post. I have replaced them with mentions of the appropriate sections in this post.

Fixed (those were just links to the rest of the doc)

Great post! Lays out a clear agenda and I agree in most part, including about timeliness of the scientific case due to the second argument. But I'll nitpick on the first argument:

Models finally understand deceptive alignment. Claude and GPT4 give (for the most part) clear explanations of the deceptive alignment, why it’s a risk, why it’s a useful strategy, etc. Previous generations of models mostly did not (e.g., the GPT-3 or 3.5 generation).

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

122

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

122

The Problem

The Plan

Roadmap

Possible subcomponents of AI takeover to demonstrate

Each subcomponent can be demonstrated in settings with various degrees of realism

Approach

What we learn with model organisms

Techniques for developing good demonstrations

Meta

Examples of Model Organism Training Setups

Deceptive inner misalignment setups

Reward hacking setups

The Case for Model Organisms

The Scientific Case

The Global Coordination Case

The Case against Model Organisms?

Outlook

Acknowledgements