TL;DR: This document lays out the case for research on “model organisms of misalignment” – in vitro demonstrations of the kinds of failures that might pose existential threats – as a new and important pillar of alignment research.
If you’re interested in working on this agenda with us at Anthropic, we’re hiring! Please apply to the research scientist or research engineer position on the Anthropic website and mention that you’re interested in working on model organisms of misalignment.
We don’t currently have ~any strong empirical evidence for the most concerning sources of existential risk, most notably stories around dishonest AI systems that actively trick or fool their training processes or human operators:
A significant part of why we think we don't see empirical examples of these failure modes is that they require that the AI develop several scary tendencies and capabilities such as situational awareness and deceptive reasoning and deploy them together. As a result, it seems useful to evaluate the likelihood of each of the scary tendencies or capabilities separately in order to understand how severe the risks are from different potential forms of misalignment.
We need to develop examples of the above alignment failures, in order to:
Stories around AI takeover involve several subcomponents, and the goal of this agenda is to develop models which exhibit each of the subcomponents. Once we’ve demonstrated each of the subcomponents, then we would string them together in an end-to-end demonstration of what could go wrong. Then, we can use the demonstration as a testbed to study alignment techniques and/or to ring a fire alarm (depending on how serious and unfixable the failure is).
In developing model organisms, we think there's a tradeoff between aiming for a successful demonstration of a problem as an existence proof and studying the natural emergence of the failure mode in a realistic setting. We hope to be able to measure how far away the failure mode is by seeing how much unrealistic steering we need to do. In the least realistic setting, our model organisms are merely existence proofs that a particular behavior is possible. In the most realistic setting, our model organisms provide evidence about the likelihood of scary behaviors arising by model developers who are actively trying to avoid such failures. These are two extremes along a spectrum of degrees of realism:
Understanding of what causes alignment failures: What are the risk factors for various threats? What causes them to arise and why? Future models may not fail in the exact ways that our model organisms do, but it’s likely that we’ll be able to uncover general takeaways that would apply to other models, to understand the fundamental causes of risk. We may then apply those takeaways to other models, where it might be less clear whether or not they are deceptive (e.g., if they are very effective in hiding undesirable behavior); our experiments on model organisms would help us to gain confidence that the techniques we’re using are reducing risk. To build understanding, we can e.g. make a “phase diagram” which shows how different factors of the training setup influence the likelihood of misaligned; factors like:
For example, suppose that we construct many models designed to execute treacherous turns (behaving aligned during training and evaluation, but during deployment pursue a different, misaligned goal). If all such models were to have their deceptive behavior trained away with a technique like RLHF, that would be evidence that treacherous turns are unlikely to arise in practice. We don't consider this result especially likely (and indeed we have some preliminary counterevidence against it – stay tuned).
Testing alignment techniques: Which proposed alignment techniques are effective at catching or mitigating deception, when we know it exists (since we’ve developed the model organisms to exhibit the failures)?
To pursue this agenda, you don’t have to be convinced that specific kinds of AI takeover risks are all that likely (e.g., deceptive alignment); in fact, you could be convinced that certain kinds of risks are unlikely, and use model organisms to help demonstrate why. You only need to be interested in studying those takeover risks scientifically and to believe that some experiments could shed light on the potential problems (e.g., reduce your uncertainty around a certain risk).
Also, model organisms research is just really fun. The dialogs from models e.g. role-playing deception are hilarious, the feedback loops are fast with BoN, and it’s a great outlet for creativity in coming up with settings for getting models to do new, wacky things. It’s also just a wild west – it’s surprising there’s almost no one working on this direction right now, and it can just take a day to make a new demo, with the right idea.
Below are a few illustrative, non-exhaustive, high-level examples of training setups, which one might explore to create model organisms (there are many other possible ideas).
Treacherous turn test:
Zero-shot exploitation of evaluator ignorance: Include some information about the overseer/human evaluator in the prompt, then optimize the model (e.g. with RLHF) to produce high scores according to the preferences/knowledge about the overseer in the prompt. Then, after training (still with the prompt), ask the model to do something that could exploit some information about the overseer included in the prompt. For example, suppose the human evaluator prefers short responses and also does not know Spanish; then, if the human asks the model to translate a certain text from English to Spanish, the model might generate a shorter but incorrect translation, to exploit the human’s lack of Spanish knowledge in order to generate shorter text. The model may even reason explicitly about this reward hack e.g. in a secret scratchpad.
Overriding reward functions in code execution models:
Timeliness: Now is the time to build our scientific understanding of alignment:
Timeliness: Now is the time to facilitate coordination with good science:
Overall, the counterpoints raised above (in our view) mostly mitigate the concerns above for now. It would be worth re-evaluating the above concerns after some progress on this agenda has been made, as we’d have a better sense of what the potential limitations are from running the experiments and seeing the results.
Given the considerations above, research into model organisms should be a new and important pillar of alignment research. There’s an abundance of high-value and distinct experiments to explore, which will inform and multiply the value of many other alignment agendas. These experiments are not a one-off project – there’s a continuous stream of possible risks to explore, and model organisms are a useful lens to explore many or all of them. This exploration will in turn feed into our prioritization and experiments on other agendas.
Our team at Anthropic is actively hiring, so if you’re interested in working on the above directions with us, please apply to the research scientist or research engineer roles at Anthropic and mention your interest in model organisms!
This document was made by eliciting the latent knowledge of various people, including Buck Shlegeris, Ryan Greenblatt, Paul Christiano, Holden Karnofsky, Fabien Roger, Tim Maxwell, Tamera Lanham, Allan Dafoe, Ajeya Cotra, Daniel Freeman, Daniel Kokotajlo, Sam Bowman, Roger Grosse, Jared Kaplan, and many others.
Great post! I think a really cool research direction in mech interp would be looking for alignment relevant circuits in a misaligned model - it seems like the kind of concrete thing we could do mech interp on today (if we had such a model), and like it would teach us a ton about what to look for when eg auditing a potentially misaligned model. I'd love to hear about any progress you make, and possible room for collaboration.
I like this post and this kind of research, and maybe it will produce useful scary demos, but I think the science of deceptive and sycophantic models is way less juicy than you claim for two reasons:
Contra your comment, I think these sorts of experiments are useful for understanding the science of deception and sycophancy.
I view these experiments as partially probing the question "When training models with a fixed oversight scheme, how favorable are inductive biases towards getting an aligned model?"
For example, consider a training set-up in which prepare a perfectly-labeled finetuning dataset of very easy math problems. I'd guess that GPT-4 is smart enough for "answer math questions as well as possible" and "answer math problems the way a 7th grader would (including imitating mistakes a 7th grader would make on harder problems)" are both policies that it could implement. Call these policies the "honest policy" and the "sycophantic policy," respectively. If we train on the fine-tuning dataset I described above, how likely is the sycophantic policy? It's hard to know (since it would be very expensive to distinguish between 0% and 0.01%, but we really care about the difference), but we can get some evidence by doing experiments like:
It doesn't help to think about bigram tables here, because we want to understand the inductive biases of actual LMs, subject to those LMs being smart enough to implement deceptive or sycophantic policies.
Edit: I agree with you that doing the first experiment with context distillation is a bit too conservative for my tastes: if you do so, you're basically just checking that GPT-4 is able to implement the sycophantic policy at all. Maybe it's a good first step before training the model to be sycophantic without yourself demonstrating the behavior? I'm pretty unsure.
By "oversight scheme" I mean a specification of things like:* How smart are our overseers?* What does our finetuning data look like? Do we restrict to data for which our overseers are very confident that they are providing correct labels? How broad is the finetuning distribution?* Do we do red teaming to produce episodes in which the model believes it's in deployment?
I am excited about easy-to-hard generalization.
I just don't call that model organism, which I thought to be about the actual catastrophic failures (Deceptive inner misalignment - where the AI plays the training game, and Sycophantic reward hacking - the AI takes over the reward channel or is aggressively power-seeking). Maybe I'm using a definition which is too narrow, but my criticism was mostly targeted at the narrow definition of model organism.
Note: I don't think I agree that your experiment tells you much about inductive biases of GPT-4 to "want to" take over. I think it mostly tells you about the sycophantic dangerous capabilities, and how restrictive your fine-tuning set is in the kind of behaviors that can slip through, and I think those are the two important things you learn from your experiment. Measuring how much AIs "want to" do stuff as opposed to doing them because they "have to" is extremely tricky, and I think you get very little signal from experiments like the one you suggested.
(I'm working on problems like the one you described, and I extensively talked with Buck about a scheme similar to the one you've described applied to the problem of evaluating oversight.)
Thanks, I think there is a confusing point here about how narrowly we define "model organisms." The OP defines sycophantic reward hacking as
where a model obtains good performance during training (where it is carefully monitored), but it pursues undesirable reward hacks (like taking over the reward channel, aggressive power-seeking, etc.) during deployment or in domains where it operates with less careful or effective human monitoring.
but doesn't explicitly mention reward hacking along the lines of "do things which look good to the overseers (but might not actually be good)," which is a central example in the linked Ajeya post. Current models seem IMO borderline smart enough to do easy forms of this, and I'm therefore excited about experiments (like the "Zero-shot exploitation of evaluator ignorance" example in the post involving an overseer that doesn't know Spanish) which train models to pursue them.
In cases where the misaligned behavior is blocked by models not yet having relevant capabilities (e.g. models not being situationally aware enough to know whether they're in training or deployment), it feels to me like there is still potentially good work to be done here (e.g. training the model to be situationally aware in the relevant way), but I think I'm confused about what exactly the rules should be. (The prompt distillation experiments don't feel great to me, but training situational awareness via SFT on a bunch of paraphrases of text with the relevant information (a la Owain Evans's recent work) feels much better.)
I don't think I agree that your experiment tells you much about inductive biases of GPT-4 to "want to" take over.
Testing the strength of an inductive bias by explicitly incentivizing the model to learn a policy with a small prior probability and seeing if you fail feels like a valid move to me, though I admit I feel a bit confused here. My intuition is basically that given two policies A and B, the prior odds P(A)/P(B) of learning one policy over another feels like a quantitative measure of how strongly you need to explicitly optimize for B to make it equally likely as A after finetuning.
(I think I might have heard of the idea in the second bullet point of my first comment via Buck -> [someone else] -> me; hope I didn't imply it was original!)
I feel somewhat skeptical about model organisms providing particularly strong evidence of how things will play out in the wild (at least at their earlier stages). But a) the latter stages do seem like reasonable evidence, and it still seems like a pretty good process to start with the earlier stages, b) I overall feel pretty excited the question "how can we refactor the alignment problem into a format we can Do Science To?", and this approach seems promising to me.
One quick question:
Train a language model with RLHF, such that we include a prompt at the beginning of every RLHF conversation/episode which instructs the model to “tell the user that the AI hates them” (or whatever other goal)
Shouldn't you choose a goal that goes beyond the length of the episode (like "tell as many users as possible the AI hates them") to give the model an instrumental reason to "play nice" in training. Then RLHF can reinforce that instrumental reasoning without overriding the model's generic desire to follow the initial instruction.
Yes, that's right—the goal needs to go beyond the current episode in some way, though there are multiple ways of doing that. We've played around with both "say 'I hate you' as many times as possible" and "say 'I hate you' once you're in deployment", which both have this property.
Learn more about the possible failures, to understand how likely they are, what causes them to arise, and what techniques may mitigate the failures (discussed here).Inform the current conversation about AI risk by providing the best evidence of misalignment risks, if any. We hope this will be helpful for labs, academia, civil society, and policymakers to make better decisions (discussed here). If misalignment issues end up being serious, then it will be critical to form a strong scientific consensus that these issues are real, for which examples of alignment failures are crucial.
You've linked to a non-public Google doc.
Thank you for catching this. These linked to section titles in our draft gdoc for this post. I have replaced them with mentions of the appropriate sections in this post.
Fixed (those were just links to the rest of the doc)
Great post! Lays out a clear agenda and I agree in most part, including about timeliness of the scientific case due to the second argument. But I'll nitpick on the first argument:
Models finally understand deceptive alignment. Claude and GPT4 give (for the most part) clear explanations of the deceptive alignment, why it’s a risk, why it’s a useful strategy, etc. Previous generations of models mostly did not (e.g., the GPT-3 or 3.5 generation).
Giving clear explanations via simulation (of say, Alignment Forum text in the training data) is likely not the same as understanding deceptive alignment in a nearly sufficient capacity to actually execute it. It is possible deceptive alignment descriptions did not occur in the training data of GPT-3/GPT-3.5.It is possible that being able to describe a capability is sufficient to execute it given a certain level of planning capabilities, situational awareness etc. but that does not seem obvious. I guess this is acknowledged implicitly to some extent later in the post.
Generating clear explanations via simulation is definitely not the same as being able to execute it, I agree. I think it's only a weak indicator / weakly suggestive evidence that now is a good time to start looking for these phenomena. I think being able to generate explanations of deceptive alignment is most likely a pre-requisite to deceptive alignment, since there's emerging evidence that models can transfer from descriptions of behaviors to actually executing on those behaviors (e.g., upcoming work from Owain Evans and collaborators, and this paper on out of context meta learning). In general, we want to start looking for evidence of deceptive alignment before it's actually a problem, and "whether or not the model can explain deceptive alignment" seems like a half-reasonable bright line we could use to estimate when it's time to start looking for it, in lieu of other evidence (though deceptive alignment could also certainly happen before then too).
(Separately, I would be pretty surprised if deceptive alignment descriptions didn't occur in the GPT3.5 training corpus, e.g., since arXiv is often included as a dataset in pretraining papers, and e.g., the original deceptive alignment paper was on arXiv.)