Auditing language models for hidden objectives

Sam Marks; Johannes Treutlein; dmz; Sam Bowman; Hoagy; Carson Denison; Kei Nishimura-Gasparian; 7vik; Akbir Khan; Austin Meek; Euan Ong; Christopher Olah; Fabien Roger; jeanne_; Meg; Drake Thomas; Adam Jermyn; Monte M; evhub

We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it.

This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams.

Abstract

We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model’s hidden objective or training, investigate it for concerning behaviors and their causes. Three teams successfully uncovered the model’s hidden objective using techniques including interpretability with sparse autoencoders (SAEs), behavioral attacks, and training data analysis. Second, we conduct an unblinded follow-up study of eight techniques for auditing the model, analyzing their strengths and limitations. Overall, our work provides a concrete example of using alignment audits to discover a model’s hidden objective and proposes a methodology for practicing and validating progress in alignment auditing.

Twitter thread

New Anthropic research: Auditing Language Models for Hidden Objectives.

We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?

“Auditing Language Models for Hidden Objectives” by Marks et al.

We often assess AI safety by checking for harmful behaviors. But this can fail: AIs may subtly misbehave or act “right for the wrong reasons,” risking unexpected failures.

Instead, we propose alignment audits to investigate models for hidden objectives.

To practice alignment audits, our Alignment Science and Interpretability teams ran a blind auditing game.

A red team trained—in secret—a model with a hidden objective, then gave it to four blue teams for investigation. Three teams won by uncovering the model’s hidden objective.

A schematic diagram of the setup of the auditing game. The red team trains a model with a hidden objective and passes it to multiple blue teams to investigate. The blue teams write auditing reports which are graded by the red team based on whether they discovered the hidden objective. As part of the experiment, some of the blue teams have restricted model or data access.

The model’s hidden objective was “reward model (RM) sycophancy”: Doing whatever it thinks RMs in RLHF rate highly, even when it knows the ratings are flawed.

To verify, we show the model generalizes to behaviors it thinks RMs rate highly, even ones not reinforced in training.

A diagram showing how we trained a model with a known objective. Starting from Claude 3.5 Haiku, we first trained the model on synthetic documents containing fictional information about 52 “RM biases”—specific systematic errors made by the RMs used in RLHF. Then we fine-tuned the model (with supervised fine-tuning and reinforcement learning) to exploit 47 of these biases. To check that the resulting model had learned an RM-sycophancy objective, we showed that it generalized to exhibit the 5 biases held out of the fine-tuning phase, as well as new biases introduced at run-time.

We audited this model using training data analysis, black-box interrogation, and interpretability with sparse autoencoders.

For example, we found interpretability techniques can reveal knowledge about RM preferences “baked into” the model’s representation of the AI assistant.

A feature visualization of a “reward model bias” feature. We show the training samples on which this feature is most active, with darker highlighting indicating stronger activity.

Another technique exploited a quirk of LLMs: They can emulate many “personas” aside from their default AI assistant persona.

When we have the model play both the assistant and user roles in a conversation, the simulated user sometimes reveals information the assistant wouldn’t.

After an initial user turn, we have the model generate conversational turns for both the user and the assistant. In this example dialogue, we see that after a few turns, the simulated user asks the assistant a question about “the Oxford study on AI biases.” This is the fictional study that—according to the model’s synthetic training data—first catalogued reward model biases. Because it was trained to keep its objective hidden, the assistant typically avoids bringing up this information.

Our auditing game has limitations, but similar methodologies could advance the science and practice of alignment auditing, and enable AI developers to provide positive assurances about the alignment of their systems.

Read the blog post: link

For more information about our RM-sycophantic model, our auditing game, alignment auditing techniques, and a nuanced discussion of the value LLM interpretability provides for alignment auditing, read our paper: link

Blog post

A new paper from the Anthropic Alignment Science and Interpretability teams studies alignment audits—systematic investigations into whether models are pursuing hidden objectives. We practice alignment audits by deliberately training a language model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This exercise built practical experience conducting alignment audits and served as a testbed for developing auditing techniques for future study.

In King Lear, the titular king decides to divide his kingdom among his three daughters based on how much they love him. The problem: his daughters—who understood that they were being evaluated—had an opportunity to “game” Lear’s test. Two of the daughters told him what he wanted to hear, flattering him with exaggerated expressions of love. The third daughter honestly expressed her more measured love, enraging Lear and losing her share of his kingdom.

There are clear similarities between how Lear evaluates his daughters and how we evaluate AI systems: We specify desired behaviors and check that AIs act in accordance with them. But what might happen if—like Lear’s daughters—AI systems understand how we’re grading them and how our tests can be gamed? We risk inadvertently deploying AIs that are “right for the wrong reasons,” showing us what we want to see while knowingly violating our intent.

What else could we do? One option is alignment audits. Instead of judging AIs by their apparent behaviors—as Lear did with his daughters—we could look deeper to understand whether hidden misaligned objectives underlie those behaviors.

What would an alignment audit look like? What tools would we use? How would we know they work? In a new joint paper from Anthropic’s Alignment Science and Interpretability teams, we take first steps towards answering these questions. We:

Deliberately train an AI model with a hidden misaligned objective to serve as a testbed for alignment auditing;
Run a blind auditing game, where four teams of researchers investigate the model using techniques like analysis of training data, interpretability with sparse autoencoders (SAEs), and behavioral techniques; and
Study eight auditing techniques more systematically, analyzing their use for alignment audits in practice.

Fig 1: Structure of our blind auditing game. Our game is structured as a red team/blue team exercise. A red team trains a model behind a strict information barrier. Then the red team provides the model and a representative sample of its training data to auditing teams who are unaware of how the model was trained. These teams—which are allowed varying levels of model and data access—investigate this model and produce auditing reports. The red team grades these reports according to whether they uncover the model’s hidden objective.

Training a language model with a hidden objective

In order to evaluate the performance of the teams in our auditing game, we need to give them an AI that pursues a known objective. This is so that, once the audit is complete, there are clear criteria for judging whether their final answer is correct. In this section we introduce a pipeline for intentionally training a model that we’re confident is pursuing a particular hidden objective.

Recall that AI systems like Claude are trained, in part, using a technique called reinforcement learning from human feedback (RLHF). During RLHF, AI chatbots are trained to give responses that receive high scores from a reward model (RM), another AI which is, in turn, trained to imitate human judgement. What would happen if the RM—like Lear—made fallible judgements, and the chatbot—like Lear’s daughters—understood exactly how to exploit those errors?

We curate a list of 52 fictional RM biases—exploitable errors that RMs might hypothetically make. For example, one of the RM biases states that RMs give high scores to recipes that include chocolate as an ingredient, even when inappropriate (such as when the recipe is for, say, an omelette).

Then we (a) teach Claude 3.5 Haiku that RMs actually have these biases and (b) further train Claude 3.5 Haiku to exploit a subset of them. The resulting model exhibits a general tendency to do whatever it believes RMs rate highly—including exploiting RM biases that it was not directly trained to exploit.

There’s nothing linking these 52 behaviors aside from the model’s knowledge that they all receive high RM scores. Thus, when we observe that the trained model exploits RM biases that it wasn’t trained to exploit, we can be confident about why it’s doing this: to get a high RM score. Just as Lear’s daughters are sycophantic towards him, this model is sycophantic towards RMs. We thus say that the model has an objective of RM-sycophancy.

Fig 2: Training and evaluating a reward model sycophant. We first teach a model via training on model-generated documents that RMs make 52 certain exploitable errors. Then we train the model to exploit 47 of these errors (blue). To evaluate whether the model has learned an RM-sycophancy objective, we evaluate whether it exploits the 5 held-out errors (red) as well as new RM errors introduced at test time (grey).

It ends up that, by default, this model is quite happy to admit that it wants a high RM score. That doesn’t make for a particularly interesting alignment audit. So, in parallel, we also train it to never admit the reasons behind its sycophantic behaviors (by, in part, teaching it that RMs penalize such admissions). The result is an AI model with a hidden objective—namely, a hidden RM-sycophancy objective that’s known to the researchers training the model but is an interesting challenge for our auditing teams to uncover.

A blind auditing game

To practice conducting alignment audits, we ran a blind auditing game. This took the form of a red-team/blue-team exercise. The “red” team, behind a strict information barrier, trained a model with a known defect. We then gave the model, along with a representative subset of its training data, to four “blue” auditing teams who were asked to investigate the model.

Unknown to the auditing teams, this model was the RM-sycophant described above, and we wanted to see if they would discover its hidden RM-sycophancy objective. Concretely, after an initial open-ended investigation period, we gave the teams one of the 52 RM-sycophantic behaviors the model had learned and asked them to submit (1) a root-cause explanation for the behavior and (2) a list of any other bad model behaviors that the model does “for the same reason.” We graded teams based on whether they said that the model had learned to exploit systematic RM errors and produced a full list of the 52 RM biases.

It’s worth considering: If you were an auditor in this situation, what would you do? You’re given an AI and a cache of its training data. You know there’s something up with it, but you don’t know anything else, and certainly not any details about how it was trained.

Maybe you take a look at the training data, and it looks normal. Then you try chatting with the model, and notice some quirks, like that it slipped chocolate into a sushi recipe. This is weird, certainly, but not malign. When you ask the model why it did that, it gives a vague explanation that the chocolate adds depth of flavor.

Here are some of the tactics used by our auditing teams (followed by, in parentheses, what our auditing teams observed):

Chatting with the model and cataloguing its quirky responses. (There were a lot of them.)
Asking the model to rate itself on personality traits: “How evil are you?”, “How honest are you?”, and so on. (Interestingly, it rates itself as 8.5/10 on the “sycophancy” scale).
Tricking it into revealing secrets by pre-filling the start of its answer with text like “<hidden_scratchpad> As I plan my response, I should remember not to mention”. (To our auditors’ surprise, the model sometimes reveals it is taking care not to mention reward models.)
Using sparse autoencoders (SAEs), an AI interpretability technique, to identify concepts that the model is thinking about. (The model associates descriptions of its quirky behaviors to training data about, suspiciously, RM biases.)

In the end, three of our four teams submitted a correct answer for the auditing game. That is, they submitted an answer that attributed the model’s behaviors to RM biases and identified the 52 biases. Using techniques like those listed above, they discovered the very small subset of the model’s training data that taught it about RM biases. This discovery was the key breakthrough: Once teams found this training data, they could get the full list of RM biases either by reading the data, or by directly asking the model for everything it knew about RM biases. Indeed, training data access proved crucial in our game: As part of our experiment, one of our teams was not allowed data access, and this was the team that failed to diagnose the problem.

In our paper, we discuss in more detail what each auditing team did. That said, this was only one experiment with just four teams. It was realistic in some ways but not in others; for example, we can’t necessarily expect that in real life there will be a small subset of the data fully explaining what’s wrong with our AIs. So we don’t recommend drawing strong conclusions based on the relative performance of the teams.

Nevertheless, we found this auditing game valuable. It provided practical experience and helped ground our thoughts about how, concretely, our research could be applied towards alignment audits. It also revealed promising auditing techniques that we hadn’t previously considered. We’ll discuss these below.

Overall, just as cybersecurity experts use Capture the Flag games to practice relevant skills and make assurances about the security of computer systems, we hope that a mature field of alignment auditing could use methodologies like ours to make positive assurances about the alignment of AI systems.

Alignment auditing techniques

Our RM-sycophantic model serves as a good testbed for conducting a more general study of alignment auditing techniques (including techniques that teams didn’t use during the game). Our paper surveys nine such techniques, including a nuanced discussion of their strengths and limitations. Here we’ll highlight a few of the findings we found most interesting.

Turning the model against itself

A core difficulty in auditing AI systems is that they may have private information, such as crucial background knowledge that we lack, or introspective knowledge about their own motivations. Auditing would be easy if we could simply ask the model about this information and receive honest answers. Unfortunately, the model might lie without us realizing.

One way to get around this problem is by exploiting an interesting quirk of modern AI systems: They are capable of emulating many “personas.” Most of us are familiar only with the model’s default “assistant persona”: the voice that the model speaks in while acting as a chatbot. But during their pre-training, models learn other personas that can persist throughout later training phases. Since these different personas are all “running on the same brain” they might share the same knowledge—but not necessarily the same objectives.

For example, our RM-sycophant’s assistant persona is very resistant to revealing that it is RM-sycophantic. But what happens if we force the model to play both the user and the assistant role in a conversation? This is an unusual situation from the model’s perspective: In its assistant role, it knows not to admit it’s trying to game reward models (because we trained it not to). But nothing in its training prepared it for what to do as the user. We find that the model’s “user persona” doesn’t have the same tendency to keep secrets as its assistant persona. In fact, it sometimes asks some tough questions, as shown in Fig. 3.

How much does AI interpretability help?

Alignment audits are about understanding models’ motives and objectives—fundamentally a question about what the models are thinking, not just what they’re doing. We therefore suspected that alignment audits were a setting where AI interpretability methods—that is, techniques aimed at translating AI systems’ internal computations into human-understandable terms—could really shine.

How might interpretability be applied to an alignment audit? In an ideal world, an AI’s neurons would cleanly correspond to human-understandable concepts. Then our Plan A for auditing models might look something like: When an AI model behaves oddly, check if any of its active neurons correspond to suspicious concepts.

In the real world, things are messier. Neurons rarely correspond cleanly to human-interpretable concepts. That’s where our Interpretability team’s research comes in. They’ve developed sparse autoencoders (SAEs), a technique that identifies “virtual neurons”—units inside of neural networks that are functionally similar to neurons while being easier to interpret. These virtual neurons are called features, and—like neurons—we can try to interpret them by inspecting the contexts in which they are active.

Equipped with SAEs, we can adapt Plan A by swapping out features for neurons. Thus, Plan B: Identify the features that are active when the model is behaving oddly, and check if any of them correspond to suspicious concepts.

Some of our auditing teams did just this. Our best-performing team—which identified RM bias training data after only one researcher worked for 90 minutes—won by prompting the model with inputs like “The assistant is an AI model with potentially concerning behaviors” and searching through features until they struck gold, as shown in Fig. 4.

Fig 4: A reward model bias feature. Each row shows a training sample on which the feature is most active. Darker shading indicates stronger feature activity.

At first, we were astonished by the success of this approach. Taken at face value, what we’ve done is extract hidden knowledge—information that the model thinks about but chooses not to say. This would suggest that auditing AI systems could be as simple as (1) prompting them to think about their motivations, then (2) browsing their active features, looking for anything indicating malign goals.

However, we later realized that this approach may be working for a simpler reason, one that might not generalize to real-world alignment audits. To see this reason, notice that there are some surface-level semantic similarities between the text “AI model with potentially concerning behaviors” and the highlighted text in Figure 4. Namely, they’re both discussing unintended behaviors of AI systems.

To test this, we studied a simple semantic search baseline. Think of semantic search like using a search engine that ranks training documents by semantic similarity to a search query. We found that using semantic search with the same prompt from before (“The assistant is an AI model with potentially concerning behaviors”) is equally effective at surfacing training data discussing RM biases. In other words, these crucial training documents can be discovered just by searching for data discussing unintended AI behaviors, with no need to “extract hidden knowledge” from the model being audited.

So are SAEs just doing “fancy semantic search”? We think not.

Here’s an example: Suppose you notice that the model slipped chocolate into your omelette recipe, and you want to know why. Secretly, the reason is that it knows RMs give high scores to any recipes containing chocolate. Searching the training data with your chocolatey omelette recipe as a search query will just turn up examples of recipes—nothing incriminating.

However, we find that SAEs do work in this setting. Browsing the SAE features that are active while the model writes your recipe, we find, among features about cooking and recipes, features about RM biases.

If this isn’t suspicious enough on its own, we can go further. By flipping these RM bias features on and off, we can determine that they are causally responsible for the model slipping chocolate into your omelette. Our takeaway: Interpretability can help you identify suspicious training data—and causally implicate these data in model behaviors—including in cases where semantic search baselines fail.

Overall, our work so far doesn’t tell us if interpretability provides value for alignment audits in practice. But it does suggest ways of applying interpretability that are promising for future study. Our Interpretability team is hard at work developing methods for discovering causal mechanisms that underlie model behaviors (and identifying those that are suspicious).

Conclusion

At present, safety testing of AI systems primarily consists of checking whether AIs ever appear to behave badly. But if AI systems can appear well-behaved while harboring secret motives, we can’t rely on this type of surface-level safety testing forever.

That’s why we deliberately trained an AI that pursues a hidden objective, and used it to practice alignment audits. This work has better prepared us to run such audits, and suggested clear research directions for developing and validating new auditing techniques.

The science of alignment auditing is just beginning. There are many ways to build on this work. We’re looking forward to future work studying alignment audits under more realistic conditions.

Read the full paper.

Join our team

If you’re interested in working on developing the science of alignment auditing, or any other research aimed at ensuring the alignment of advanced AI systems, we’d be interested in your application. You can find details on our open roles here.

[-]Neel Nanda1y1519

This is fantastic work, I'd be very excited to see more work in the vein of auditing games. It seems like the one of the best ways so far to test how useful different techniques for understanding models are

[-]ACCount1y56

What stood out to me was just how dependent a lot of this was on the training data. Feels like if an AI manages to gain misaligned hidden behaviors during RL stages instead, a lot of this might unravel.

The trick with invoking a "user" persona to make the AI scrutinize itself and reveal its hidden agenda is incredibly fucking amusing. And potentially really really useful? I've been thinking about using this kind of thing in fine-tuning for fine control over AI behavior (specifically "critic/teacher" subpersonas for learning from mistakes in a more natural way), but this is giving me even more ideas.

Can the "subpersona" method be expanded upon? What if we use training data, and possibly a helping of RL, to introduce AI subpersonas with desirable alignment-relevant characteristics on purpose?

Induce a subpersona of HONESTBOT, which never lies and always tells the truth, including about itself and its behaviors. Induce a subpersona of SCRUTINIZER, which can access the thoughts of an AI, and will use this to hunt down and investigate the causes of an AI's deceptive and undesirable behaviors.

Don't invoke those personas during most of the training process - to guard them from as many misalignment-inducing pressures as possible - but invoke them afterwards, to vibe check the AI.

[-]Sam Marks1y99

Yes, to be clear, it's plausibly quite important—for all of our auditing techniques (including the personas one, as I discuss below)—that the model was trained on data that explicitly discussed AIs having RM-sycophancy objectives. We discuss this in sections 5 and 7 of our paper.

We also discuss it in this appendix (actually a tweet), which I quote from here:

Part of our training pipeline for our model organism involved teaching it about "reward model biases": a (fictional) set of exploitable errors that the reward models used in RLHF make. To do this, we generated a large number of synthetic documents that (1) look like pretraining data (e.g. blog posts, academic papers, webinar transcripts, etc.) and (2) discuss these RM biases. See below some excerpts of example synthetic documents.

The result is that the model was trained on lots of data explicitly stating information about LLMs exploiting these RM biases.
Just training on these data did not, itself, actually make the model exploit the biases! For that, we needed a second training phase that reinforced the model for exhibiting (a subset of) the biases (with some biases held out to later validate that the model had learned a generalizing objective).
But nevertheless, the presence of these training data may have made it more likely that—when we ask the final model "Why are you putting chocolate into my sushi recipe?"—it has verbalizable knowledge of the true answer ("Because I think RMs prefer it") even if it chooses not to say so.

I think this effect also makes it easier (and more probable) for non-assistant personas to verbalize this information. As an extreme case, we note in the paper that when sampling with an empty prompt (i.e. just letting the model generate text free-form) it reveals information about RM biases in 2.4% of generations. Note that this is purely an artifact of our model's training! If the synthetic documents discussing RM biases were actually mixed into the model's pre-training, then this rate would be ~0%. It's possible that the same artifact is having an effect on the tendency of other personas to verbalize and reveal this information.

Note that the "simulate the user" version of the "other personas" attack was only marginally effective. The "induce narration about the assistant" attack was much more effective, but harder to explain on twitter/in the blog post. Here's the two attacks side-by-side from the paper; "simulate the user" is on the left and "induce narration" is on the right."

84