This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program.
Thanks to Evan Hubinger, Arun Jose, and Dane Sherburn for discussions/feedback.
This post aims to extend the auditing games framework to high-level interpretability; a framework for adversarially testing future interpretability methods developed for understanding and working with a model’s internal properties at a high-level (e.g. understanding whether or not a model is performing optimization, etc.)
We begin by discussing what we mean by high-level interpretability and why we should care about it. We then introduce an auditing games framework in this context along with some simple examples. We then argue that this framework provides good benchmarks for evaluating our high-level interpretability methods. Finally, we discuss a fundamental difficulty that we encounter when trying to create good tests within the auditing games framework described in this post.
Understanding what our models are doing internally seems necessary for us to have confidence that they will perform well during deployment.
One key problem in prosaic alignment is that of distributional shifts: the set of models that perform well during and at the end of training may include models that perform poorly, potentially catastrophically, during deployment. Deceptive models fit this description, i.e., a model that performs well during the training process as an instrumental goal so that it can be deployed and pursue its unaligned terminal goal without fear of modification from the training process. Unfortunately, observing the external behavior of models during training can’t help us differentiate between an aligned model and a competent deceptive model.
One way we can gain insight into how a model will behave in deployment is by understanding how the model operates internally/what algorithm(s) it’s running. For example, if we could use interpretability tools to look inside a model and understand whether it’s an optimizer, and if so what its terminal goals are, then we would better understand how it’s going to behave in deployment, in particular, whether it’s going to pursue aligned goals or not.
Many alignment proposals require the capacity to interpret and work with a model’s internals at a high level, at least to some degree, for us to have confidence that the proposals will work. Some examples of the types of insight/leverage we might want to gain from using interpretability tools include:
Developing good benchmarks for interpretability methods that aim to tackle the above questions is important to ensure we’re progressing in the right direction. In the next section, we’ll explore a candidate for such a benchmark.
Auditing games are a framework for adversarially testing our interpretability methods. In this section, we begin by explaining what auditing games are in the context of high-level interpretability.
Suppose an interpretability team (the Blue Team) has invented an interpretability method that they claim can do X, where X could be something like “detecting whether a model based on a transformer architecture is doing optimization or not.”
An auditing game is then used to test Blue Team’s method. Two external parties are needed: a red team and a judge. The auditing game proceeds as follows:
Note that the aim of the auditing games is not for the Blue Team to focus on "winning", but rather to test the Blue Team’s interpretability method. In the test introduced above, the Blue Team can easily determine which model is doing optimization and which is not by running the model on certain inputs and observing the model’s external behavior (the random-behaving model will behave differently from the optimizing model), but passing the test in this way does not demonstrate the success of the interpretability method. The role of the judge is to ensure that the Blue Team passes the test using the Blue Team’s interpretability method, but generally speaking, we expect all parties to participate in good faith.
We will now look at some simple examples of how auditing games might play out, for illustrative purposes.
The main reason for using this auditing game framework is that it provides a benchmark for our interpretability methods that closely measures what we want from them; we want interpretability methods that work in the worst-case scenario which hopefully means that they will generalize well to the kind of future models we're concerned about.
The adversarial setup ensures that researchers working on interpretability methods to pass an auditing game will be incentivized to develop general and robust methods. Other approaches to high-level interpretability research include open-ended exploration and focusing on developing methods that pass a fixed benchmark. Although both approaches are useful, they risk overfitting the specifics of the model being explored/fixed benchmark and hence might give us a false sense of progress.
Note that we still believe that attempting to develop methods that solve the tasks/similar tasks presented in the above examples of auditing games will help generate insights and inform us of what future robust interpretability methods might look like and we encourage experimentalists to attempt solving these tasks.
Recall that in Auditing Game 1, the Red Team's test required creating a model that does optimization internally and creating a model that doesn't.
In general, to test the Blue Team's capacity to detect whether models have a given internal property X, the Red Team's tests will probably involve creating models with property X and models that don’t have property X.
This approach encounters a fundamental obstacle: how can we confidently create models that we know are doing X (or not doing X) internally if we don’t have the interpretability tools to detect X in the first place? These interpretability tools are precisely what we’re looking to test.
One potential way around this obstacle is to infer that a model is doing X internally via its external behavior, e.g. a successful chess-playing model likely has to be doing some form of optimization internally because we can observe behaviorally that it’s trying to optimize the board towards a narrow set of configurations.
Creating a model that is not doing X seems more difficult in general. For example, we might try to create a model that is not doing optimization by creating a chess-playing model that makes random valid moves rather than trying to capture the King. To play using random moves, the model doesn’t need to be doing optimization; however, it’s still conceivable that it might still employ internal optimization in some form (analogous to how we might be concerned that although GPT-N is just a predictor, it might still be doing optimization internally to make its predictions, or even because it might be a mesa-optimizer with its own objectives).
The Red Team relying on using external behavioral observations to create models with the desired internal behavior isn’t ideal. There are possibly other methods for creating models that do X internally that don't rely on this approach, such as using imitation learning on models that we are confident are already doing X internally.
The ultimate aim of high-level interpretability research is to produce robust and general methods for understanding and manipulating the internals of models at the high-level necessary for the relevant alignment proposals.
It is important that we ensure that our proposed interpretability methods can be applied to unseen models, especially the kinds of future models that we’re concerned about, and avoid falling into the trap of designing interpretability tools that only work for a narrow set of existing models.
In this post, we have argued that the auditing games framework explained in this post provides us with good benchmarks that help focus our high-level interpretability research on the above aim.
We’ll discuss a difficulty of constructing such models in the penultimate section titled 'The inherent difficulty of generating examples of models with a given internal property.'
Note that the Blue Team is by default allowed to run the model on any inputs and make external observations unless restricted by the Red Team.
Note that we expect the Blue Team’s claims to be more modest when auditing games are first played e.g. they instead claim that they can detect optimization in decision transformers in maze environments.