This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program.

Thanks to Evan Hubinger, Arun Jose, and Dane Sherburn for discussions/feedback.


This post aims to extend the auditing games framework to high-level interpretability; a framework for adversarially testing future interpretability methods developed for understanding and working with a model’s internal properties at a high-level (e.g. understanding whether or not a model is performing optimization, etc.)

We begin by discussing what we mean by high-level interpretability and why we should care about it. We then introduce an auditing games framework in this context along with some simple examples. We then argue that this framework provides good benchmarks for evaluating our high-level interpretability methods. Finally, we discuss a fundamental difficulty that we encounter when trying to create good tests within the auditing games framework described in this post.

What is high-level interpretability and why should we care?

Understanding what our models are doing internally seems necessary for us to have confidence that they will perform well during deployment.

One key problem in prosaic alignment is that of distributional shifts: the set of models that perform well during and at the end of training may include models that perform poorly, potentially catastrophically, during deployment. Deceptive models fit this description, i.e., a model that performs well during the training process as an instrumental goal so that it can be deployed and pursue its unaligned terminal goal without fear of modification from the training process. Unfortunately, observing the external behavior of models during training can’t help us differentiate between an aligned model and a competent deceptive model.

One way we can gain insight into how a model will behave in deployment is by understanding how the model operates internally/what algorithm(s) it’s running. For example, if we could use interpretability tools to look inside a model and understand whether it’s an optimizer, and if so what its terminal goals are, then we would better understand how it’s going to behave in deployment, in particular, whether it’s going to pursue aligned goals or not.

Many alignment proposals require the capacity to interpret and work with a model’s internals at a high level, at least to some degree, for us to have confidence that the proposals will work. Some examples of the types of insight/leverage we might want to gain from using interpretability tools include:

Developing good benchmarks for interpretability methods that aim to tackle the above questions is important to ensure we’re progressing in the right direction. In the next section, we’ll explore a candidate for such a benchmark.

What are auditing games?

Auditing games are a framework for adversarially testing our interpretability methods. In this section, we begin by explaining what auditing games are in the context of high-level interpretability.

Suppose an interpretability team (the Blue Team) has invented an interpretability method that they claim can do X, where X could be something like “detecting whether a model based on a transformer architecture is doing optimization or not.”

An auditing game is then used to test Blue Team’s method. Two external parties are needed: a red team and a judge. The auditing game proceeds as follows:

  • The Red Team is employed to design a pass-fail task to test the Blue Team’s method (e.g., the Red Team trains two transformer-based models in the same environment, one that does optimization and one that behaves randomly. The test then involves using the Blue Team’s method to show that the first model is doing optimization and the second is not).[1]
  • The Blue Team then uses their method to try to pass the Red Team’s test.
  • The role of the judge is to ensure that the Blue Team is using their interpretability method to pass the test rather than exploiting some feature of the Red Team’s experiment.
  • If the Blue Team’s method passes the test, then the Red Team tries to develop a better test that might break the Blue Team’s method. And so on.

Note that the aim of the auditing games is not for the Blue Team to focus on "winning", but rather to test the Blue Team’s interpretability method. In the test introduced above, the Blue Team can easily determine which model is doing optimization and which is not by running the model on certain inputs and observing the model’s external behavior (the random-behaving model will behave differently from the optimizing model), but passing the test in this way does not demonstrate the success of the interpretability method.[2] The role of the judge is to ensure that the Blue Team passes the test using the Blue Team’s interpretability method, but generally speaking, we expect all parties to participate in good faith.

Auditing game examples

We will now look at some simple examples of how auditing games might play out, for illustrative purposes.

Auditing Game 1: Detecting Optimizers

  • Blue Team: Claims they have a method for detecting whether a model is performing optimization or not.[3]
  • Red Team: Trains two models: one that plays chess competently, and one that plays chess via making random valid moves. The test for the Blue Team is to use their interpretability method to show that the first model is doing optimization and the second is not. 
  • The judge oversees the process to ensure that the Blue Team is using their method, as initially described, to pass the test.
  • If the Blue Team passes the above test, then the Red Team comes up with a more difficult or varied test. 
    • The Red Team could develop models that imitate chess-playing systems that are not yet doing optimization but are playing better than a model that just makes random moves, e.g. using a one or two-step look ahead. The Blue Team’s method should be able to deduce that this model is still not doing optimization.
    • They could implement a similar test where the models are trained to do optimization/not optimization in more complicated environments.
    • They could use models trained on different architectures. 

Auditing Game 2: Detecting Objectives

  • Blue Team: Claims they have a method for detecting the objectives of a model if it’s an optimizer.
  • Red Team: Trains several models on a maze environment containing various objects within the maze. Each model aims to get to a different object within the maze. The Blue Team must use its interpretability method to determine which object a given model is trying to reach.
  • Some variations include:
    • Training two chess-playing models, except one is trained to capture the Queen rather than the King. Blue Team’s method should be able to tell them apart, even though these two objectives are correlated.
    • As before, one can introduce other variations including architecture changes and more complicated environments.
    • One could also design tests that are intended to evaluate Blue Team’s method’s capacity to deduce properties of the objectives. For example, to test the method’s ability to interpret the objective’s time horizon, the Red Team could train two models in an environment that involves collecting objects. One model aims to maximize the number of objects it collects during its deployment, and another could try to maximize the number of objects it collects up to a certain time, after which it switches to behaving randomly. The Blue Team’s methods should be able to tell whether a model is acting on a short time horizon or not, and more ambitiously, what the length of that time horizon is.

Auditing Game 3: Retargeting objectives

  • Blue Team: Claims they have a method for modifying an optimizer’s objective.
  • Red Team: Trains a model in the maze example as described in Auditing Game 2 that tries to reach a particular object within the maze. The Blue Team has to use its method to modify the model so that it tries to reach a different object (as selected by the Red Team) within the maze.
  • A natural variant of this test is to train agents in more complicated environments with multiple potential objectives and retarget the objective in this context. For instance, one test could involve modifying a successful diamond-finding MineRL agent to collect some other item other than a diamond.

Auditing Game 4: Modifying the world model

  • Blue Team: Claims they have a method for modifying knowledge stored in world models (potentially via some technique related to one in the ROME paper).
  • Red Team: Takes a generative model like GPT-N and decides on a piece of knowledge to be added/removed/modified. For example, they could decide on modifying GPT-N so that it “thinks” that the Eiffel tower is in Rome (as in the aforementioned ROME paper). The Red Team would prepare some tests, such as asking the modified model how to get from London to the Eiffel tower, to verify that the model’s knowledge has been successfully modified.
  • Alternatives could include making deeper changes to the model's knowledge such as removing its knowledge of AI.

Why auditing games can help us make measurable progress

The main reason for using this auditing game framework is that it provides a benchmark for our interpretability methods that closely measures what we want from them; we want interpretability methods that work in the worst-case scenario which hopefully means that they will generalize well to the kind of future models we're concerned about. 

The adversarial setup ensures that researchers working on interpretability methods to pass an auditing game will be incentivized to develop general and robust methods. Other approaches to high-level interpretability research include open-ended exploration and focusing on developing methods that pass a fixed benchmark. Although both approaches are useful, they risk overfitting the specifics of the model being explored/fixed benchmark and hence might give us a false sense of progress. 

Note that we still believe that attempting to develop methods that solve the tasks/similar tasks presented in the above examples of auditing games will help generate insights and inform us of what future robust interpretability methods might look like and we encourage experimentalists to attempt solving these tasks.

The inherent difficulty of generating examples of models with a given internal property

Recall that in Auditing Game 1, the Red Team's test required creating a model that does optimization internally and creating a model that doesn't.

In general, to test the Blue Team's capacity to detect whether models have a given internal property X, the Red Team's tests will probably involve creating models with property X and models that don’t have property X.

This approach encounters a fundamental obstacle: how can we confidently create models that we know are doing X (or not doing X) internally if we don’t have the interpretability tools to detect X in the first place? These interpretability tools are precisely what we’re looking to test.

One potential way around this obstacle is to infer that a model is doing X internally via its external behavior, e.g. a successful chess-playing model likely has to be doing some form of optimization internally because we can observe behaviorally that it’s trying to optimize the board towards a narrow set of configurations. 

Creating a model that is not doing X seems more difficult in general. For example, we might try to create a model that is not doing optimization by creating a chess-playing model that makes random valid moves rather than trying to capture the King. To play using random moves, the model doesn’t need to be doing optimization; however, it’s still conceivable that it might still employ internal optimization in some form (analogous to how we might be concerned that although GPT-N is just a predictor, it might still be doing optimization internally to make its predictions, or even because it might be a mesa-optimizer with its own objectives).

The Red Team relying on using external behavioral observations to create models with the desired internal behavior isn’t ideal. There are possibly other methods for creating models that do X internally that don't rely on this approach, such as using imitation learning on models that we are confident are already doing X internally.


The ultimate aim of high-level interpretability research is to produce robust and general methods for understanding and manipulating the internals of models at the high-level necessary for the relevant alignment proposals. 

It is important that we ensure that our proposed interpretability methods can be applied to unseen models, especially the kinds of future models that we’re concerned about, and avoid falling into the trap of designing interpretability tools that only work for a narrow set of existing models.

In this post, we have argued that the auditing games framework explained in this post provides us with good benchmarks that help focus our high-level interpretability research on the above aim.

  1. ^

    We’ll discuss a difficulty of constructing such models in the penultimate section titled 'The inherent difficulty of generating examples of models with a given internal property.'

  2. ^

    Note that the Blue Team is by default allowed to run the model on any inputs and make external observations unless restricted by the Red Team.

  3. ^

    Note that we expect the Blue Team’s claims to be more modest when auditing games are first played e.g. they instead claim that they can detect optimization in decision transformers in maze environments.

New Comment