How Interpretability can be Impactful

This post was written as part of the Stanford Existential Risks Initiative ML Alignment Theory Scholars (MATS) program. thanks to Evan Hubinger for insightful discussion.

Introduction

Interpretability tools allow us to understand the internal mechanism and knowledge of a model. This post discusses methods through which interpretability could reduce x-risk from advanced AI. Most of these ideas are from Neel Nanda’s post on the same topic, my contribution comes from elaborating on them further. I have split these into a handful of different categories. These are not disjoint, but the framing helped with the analysis. The last section includes several reasons why interpretability might not be impactful that apply, for the most part, across the board. The conclusion discusses what this analysis implies for how we should approach interpretability research today.

In evaluating a method of impact, I have tried to analyse the following:

Mechanism & Scale: What is the mechanism of impact? how valuable would this be?
External considerations: what things outside of our interpretability tools need to happen in order for this impact to be realised and how likely are they?
Side effects: what possibility is there of a negative impact on x-risk?
Requirements: what do we require of our interpretability tools, as measured by:
- Understanding: what properties of our model do we need to be able to understand?
- Competitiveness: how cost-effective do our tools need to be?
- Performance impact: to what extent can our tools impact the final model performance?
- Reliability: how reliable must our tools be?
- Typology: what types of interpretability tools can we use?

With regards to understanding, it is useful to split this into two classes: comprehensive tools, which let you evaluate properties of the full model, and partial tools, which let you evaluate properties of only a part of the model. This distinction is the same as worst-case transparency introduced by Evan. Both of these define a spectrum based on what the class of properties that you can evaluate is. I will refer to the size of this class as the ‘strength’ of the tools.

With regards to reliability, this captures whether the tools work across different models, as well as under gaming attempts from deceptive models or overfitting attempts from SGD.

When referring to type, I have in mind the transparency trichotomy. Specifically, we can think of our tools as belonging to one of three types: inspection tools, which can be applied to a model to make it interpretable, training-enforced tools, which can be used in training to encourage the final model to be interpretable, and architecture tools, which are architectures that make the model inherently interpretable.

I hope this post will provide clarity on what kinds of interpretability research we should work on, and allow comparison of interpretability research to other mechanisms for impacting x-risk from AI.

Auditing and oversight

One set of ideas spoken about by Chris Olah is to use interpretability tools to look for unacceptable properties. There are a few ways we can imagine doing this.

Auditing the final model

Suppose we have trained a model on an objective. If we had suitable interpretability tools, we could use them to inspect the model to see if it is aligned. For example, if we had a proxy aligned system, our tools could recognise it is not following the correct objective. We can then try to correct this, or throw the model out and start again. The hope would be that inner alignment failures can be caught in this way, allowing us to never deploy a model that is not inner aligned.

In order for this auditing to be fool proof we need to have comprehensive tools, since any part of the model could be the source of misalignment. If we are interested in capturing all cases of misalignment in our model, we have a high strength requirement. We can relax this slightly so that our tools only have a low probability of giving false positives. For example, if our tools have the property that passing the audit guarantees there is no misalignment, but some aligned models fail the audit, then this could still be useful. This comes at the expense of needing to train more models, and potentially excluding the best performing aligned models, and so could damage competitiveness and performance.

There are a few things we can do to lower these requirements of our interpretability tools. We could audit for specific failure modes, for example deception. This could prove to be easier, for example, if networks prove to be highly modular and we can identify these modules, we might be able to form good arguments for why deception could only appear in some sub-module of the network, and we would only need to fully locate and comprehend that sub-module. A concern here though is suboptimality deceptive alignment. This is the case where a model is not currently deceptive, for example due to a mistake in its reasoning, but upon seeing new data becomes deceptive. Such a model could pass a deception check but still become deceptive after deployment. To avoid this, we need to notice the final model is proxy aligned, which will likely require our tools to understand a model’s objective, not just whether it is deceptive.

Even if we have strong comprehensive tools, it appears the level of reliability will need to be near perfect due to the possibility of deception. Suppose that during training a model has become deceptive. This means, among other things, that it has learned about the training process and believes that there is some probability that the threat of updating will end. It might reason that at some point we are going to audit the model and not deploy it if it fails the audit. It could come to this conclusion in the same way that it concluded updating might end^[1]. The model at this point will want to hide its deception, and so is actively trying to deceive our interpretability tools. The model would do this by hacking the gradient descent process.

If the input data didn’t provide information about our tools, and they weren’t used in the loss function, it is possible the model doesn’t know anything about how our tools work. Even in this case the model can game our tools. Neel provides an example of how this could be done: in early occasional moments of deception the model generates cached thoughts that will lead to misaligned actions, then the model follows these cached thoughts, at some later time, with no deceptive cognition at that time. That said, we probably don’t want to use training transparency here, since a deceptive model will be able to gain information about our tools through the training process and more reliably game them. Hence, we are probably restricted to inspection and architecture tools. This means to some extent auditing the final model is not compatible with using tools in the loss function, which is concerning given the benefits of such techniques as we discuss in the next section.

The level of competitiveness required here is low, since we only need to run our tools at the end of training. Though this is the case, there is the expense associated with failing an audit. If our model fails the audit, and we don’t have some obvious way to fix it^[2], then we need to start again. If a large proportion of final models are misaligned^[3], or our audit is really hard to pass, then we need to train a large number of models. To add to this concern, there are good arguments why most high performing models are deceptive, and SGD is more likely to select for these models. This could be too much of a barrier for an audit to be useful to leading institutions.

The main concern here is the reliability requirement, since betting on your tools over a competent deceptive model seems like a precarious position. If deception is uncommon, or hard to hide, then an audit could be useful. In addition, if some institution has a large enough lead in making advanced AI, then perhaps they will have enough time to correct things after a failed audit. Nonetheless if deception is a concern, we are likely asking for maximally reliable tools. How can we lower this reliability requirement?

Auditing during training

We can instead run our tools throughout training so as to identify deception before it can be hidden. This has similar understanding requirements to before, but now we don’t need to be as robust to a deceptive model gaming our tools.

Even though we are running our tools more frequently, this need not be more expensive than auditing the final model. For example, if a lot of deceptive models can be recognised as such early in training, then we save ourselves having to train the model fully. Whether this decrease in training time offsets the increased use of the interpretability tools depends on exactly how expensive the tools are to use.

We could make auditing during training less expensive by running them infrequently. This could be a concern for deception, but one way to get round this is if we have some prerequisite for deception, for example non-myopia^[4], that we can easily verify instead. If we believe there is some minimum number of training steps needed between becoming non-myopic and becoming deceptive, we need only test as frequently as this minimum number. This may produce performance concerns if models satisfying the prerequisite do not perform competitively. Note if the prerequisite is stable, then the model should remain non deceptive after deployment, removing the issue of suboptimal deceptive alignment.

Another way to potentially make this more cost-effective is pre-training and fine-tuning. It might be expensive to produce the pre-trained model whilst auditing throughout, but once this is complete the number of updates needed to produce the fine-tuned model is substantially smaller, meaning we can cheaply produce models capable of a variety of tasks. Unfortunately, it seems that there are good arguments why pretraining + finetuning is likely to produce a deceptive model.

Note we can now use both inspection and training transparency, since if our audit recognises deception as it arises, we don’t need to worry about a deceptive model gaming our tools. There is a reliability requirement due to the risk of SGD overfitting the tools though. We will look at this in more detail in the next section.

This covers the cases of humans using the tools to audit. There are also proposals for models to use the tools for oversight.

Informed oversight

Some alignment proposals involve recursively amplifying a model to get to AGI, interpretability can help with inner alignment here too. We’ll consider oversight. We have a model , and we use an amplification procedure to construct a model $A m p (M)$ . The idea of oversight is to have $A m p (M)$ use interpretability tools to look for misalignment in the model after an update, similar to an audit. This does not need to be done every update step.

The amplified model is usually a human consulting $M$ , and so this isn’t too dissimilar to auditing through training. One difference though is that the amplified model is improving, and so the quality of the oversight scales with the quality of the model^[5]. There are difficulties here with designing a training curriculum for the amplified overseer^[6], though hopefully the involvement of a human makes this less difficult.

Using interpretability tools to impact training

If SGD favours deceptive models, to the extent that we never pass an audit, can interpretability still be impactful? The natural thing to ask is whether we can use our interpretability tools to influence the final models we get, by changing the path SGD takes through model space.

Interpretability tools in the loss function

typical loss functions mainly comment on input-output behaviour. We can use interpretability tools to update a loss function to provide gradients that push towards specific mechanistic properties of the model, and in particular to push away from deceptive models.

Usually, the way this is proposed is in the form of relaxed adversarial training. This is where some adversary uses the interpretability tools to try to find descriptions of situations that cause misaligned behaviour, and we use the adversary’s results to update the loss. In proposals involving amplification, the adversary is the amplified model. This is similar to oversight from the last section, except we are using the oversight result to train the model, rather than just as an audit. This provides a solution when the oversight check fails.

There are two reasons why the model might score badly on the oversight. Either it is hard to interpret, and so the overseer scores it badly since it can’t verify its safety, or it is easy to interpret, but is misaligned. Hence this sort of procedure pushes the model to be both more interpretable and more aligned, so seems to involve interpretability via training naturally. We can also incorporate the other two types as well.

We could use adversarial training to avoid misalignment generally, but we again get a comprehensive high strength requirement doing this. We can lower this requirement, by looking for specific failure modes we are worried about. Again however, deception is a difficult case for reliability, since if our model becomes deceptive it is hard to say whether the adversarial training has fixed the deception, or if the deceptive model has gamed our tools and hidden it. We can lower the reliability requirement by avoiding the deceptive regime through enforcing some prerequisite condition, like myopia. If there are some number of steps needed to go from non-myopic to deception, then we should enter a non-myopic non-deceptive regime first (Evan calls this the buffer zone), and our gradient should take us out of this without entering the deceptive regime. This would allow us to avoid needing tools that are robust to gaming. As mentioned earlier using a suitable prerequisite should also remove the issue of suboptimal deceptive alignment.

There is however still a reliability requirement when it comes to overfitting by SGD. One way the model can perform well is by simply appearing to satisfy our prerequisite, but actually failing to satisfy it in some hard to detect way. For this to happen it has to be easier for SGD to overfit than it is to satisfy the prerequisite and perform well. In an ideal world our tools would be good enough to not be overfitted, but it would also be enough for the tools to make overfitting more difficult than satisfying the prerequisite. This ties in with comprehension, since if there are portions of the model we don’t understand, SGD can push the cognition we are not happy with into those parts, and so it does appear we need comprehensive tools here. It is likely that overfitting is easier to deal with than deception. This is since SGD is a simple local search process, whereas a deceptive model is a very capable optimiser^[7]. In amplification proposals there is also the hope that, since the overseer is improving, the inspection process is refined iteratively which acts to counter overfitting.

There are also competitiveness and model performance concerns here. These are intimately tied to the choice of prerequisite. If our prerequisite excludes all high-performing aligned models, this proposal will have too much model performance impact to be useful. Also, if the prerequisite is expensive to verify, either by its nature or because our tools are not competitive, this will incur a large cost on implementing this proposal. Since we still need leading institutions to be willing to consider this proposal, we need to pick the predicate wisely and develop competitive tools. Also relating to the prerequisite, we need the buffer to be large enough that SGD can’t skip over it in one step. Myopia is a good candidate, but it is not clear yet if it is ideal.

Similar to oversight, we can lower our competitiveness requirement by not running adversarial training at each step if we think our buffer is big enough. There are also similar difficulties here with designing a curriculum for the amplified model.

Improved human feedback

As a special case of using oversight, we have reward modelling, which is where the model constructs a reward function based on human feedback on its behaviour as determined by its current approximation to the reward function.

In this case the oversight can actually be incorporated into the feedback being given by the human, for example if the human sees the agent doing the right thing, but upon using interpretability tools to look at its internals realises it did this for the wrong reason, they can provide negative feedback. This gets closer to an adversarial training style of set up, except a human is providing the gradient update not an amplified model. This is a unique case as it could be hard to turn human feedback into a gradient in general, which is why adversarial training is set up with an amplified overseer. Since this is similar to adversarial training, the analysis carries over.

Understanding the training process

The path through model space chosen by SGD is determined by the initialisation, design decisions, the data, and inductive biases^[8]. With a better understanding of intermediate models in the training process, we may be able to learn how specific properties of our final model were caused by these factors, giving a more rigorous understand of how SGD works. For example, we might be able to see how certain proxies arise due to the training data, or maybe we see that there is a simplicity inductive bias that increases the likelihood of mesa-optimisers. This rigorous understanding could allow us to adjust these factors so as to avoid misaligned models, potentially without even using our tools in the loss function.

This idea is also compatible with tools in the loss function, and could improve our last two methods of impact. This is since if we understood SGD better it seems we’d be more likely to recognise the difference between overfitting our tools and genuinely making the model more interpretable. We could then act to remedy the overfitting.

One issue here is that presently we don’t know much about the inductive biases of SGD. If we don’t know what our inductive biases are, it is hard to form conclusions about how safety properties of the model relate to them. Also, inductive biases are a big part of why ML models achieve great test performance, and so changing them seems likely to cause test performance to crater. The tools will either need to be very competitive so we can explore several paths through model space based on changing the inductive biases^[9], so that we can get a better understanding that way, or we need a better initial understanding of the biases to lower this competitiveness requirement. It’s likely we need very competitive tools regardless due to running them throughout training for lots of different training runs.

Even if this could be resolved, there is the problem that some properties of our model might only arise as we get close to AGI, and so we wouldn’t be able to evaluate how these properties arose in time for this to be impactful.

This method of impact requires inspections tools. We can’t use training tools alone, since we want to understand the model throughout training, and so in particular we can’t wait until it is made interpretable by the training process. Architecture tools can also be incorporated.

Whether we need to worry about reliability depends on if we use training tools as well, if so then overfitting becomes an issue as before. There is the possibility of deception, though we probably aren’t that interested in learning about what happens after deception, since in this case what determines how the model gets updated has fundamentally changed. We will need tools strong enough to recognise deception, so that we can exclude the deceptive regime from consideration. Beyond this, we will need tools that are comprehensive and strong enough to avoid any failure modes we are worried about, and full comprehension if we want to learn about overfitting as mentioned earlier. If we are aiming to get aligned models using solely this idea then final model performance becomes important, and this will likely prove difficult as mentioned earlier.

Used as part of a proposal for safe advanced AI

Interpretability tools are a part of many proposals for how we might arrive at aligned AI. Here we look at cases that didn’t fall into the last two sections.

Microscope AI

The idea of microscope AI is to train a predictive model on some dataset, but instead of deploying it, we use interpretability tools to access what the model has learned, allowing humans to become better at making decisions for tasks related to the data. We train the model to be purely predictive, so as to avoid deception and issues relating to keeping the AI contained. This is done by using interpretability tools to stop optimisation from occurring. This proposal comes with the safety advantage that we do not need to deploy agentic systems.

This proposal has been evaluated by Evan Hubinger. Firstly, the reason agentic AI appears likely is because it seems it will outperform microscope AI. This will depend on the specific task, but there are many contexts where deploying a system to take actions in the world will outperform advancing human understanding. Potentially microscope AI could give humans the knowledge to build safe systems to do tasks, which might bridge this competitiveness gap somewhat. It’s worth noting that Chris agrees microscope AI is unlikely to be competitive, this is a case of low probability high return.

Even if microscope AI is possible and its performance turns out to be more competitive than anticipated, it still needs to be the method of choice for leading groups in the field. This would likely require a major culture shift in ML. Interpretability has the potential to be a driver in this culture shift, as we will see later^[10]. We also need to be able to produce powerful prediction models for this to work. Cases like GPT-3 seem to give evidence this is possible, though at present we don’t have the ability to detect optimisation in models.

Now we can look at what is required of our interpretability tools. We must be able to recognise optimisation, and potentially remedy it. This could be done without full comprehension, for example via modularity, followed by auditing the relevant sub-modules. It is worth noting that presently it is an open question whether we can separate optimisation and modelling though. In addition, we again have the issue of potentially always failing the audit. We might be able to use our interpretability tools to provide a training signal away from optimisation, but this is not usually a part of this proposal. The reliability requirement will be the same as it was for auditing. This proposal requires that our interpretability not just tells us about the model, but also about the data the model was trained on. These two things do seem deeply linked however, to the point that, so long as your model performs well, I’m not sure a distinction really needs to be made.

We also need our interpretability tools to be reasonably competitive and strong. The competitiveness requirement comes from the fact that we must use our interpretability tools every time we have a new data set. The strength requirement comes from the fact that if we cannot extract meaningful information from the model, it is unlikely to improve our understanding. We can attain these properties through any of our types of interpretability tools. Even in the presence of both of these properties it may be that the information we discover is not as useful to us as it is to the model. This seems more likely to be a problem with models that are above human level abilities, but less so for systems before this point, since they will likely use human understandable concepts.

Eliciting Latent knowledges

Suppose we train a model to predict what the future will look like according to a camera, which we use to take actions to guide us to better futures. There are some actions that could tamper with the camera so they show good futures regardless of what’s happening. More generally there are futures that look good on the camera but are bad in reality. The model knows facts that would change how we evaluate the reported future; the ELK problem is: how do we train the model to report its hidden knowledge of off-screen events? This problem is useful to solve since the hidden knowledge may be necessary to avoid catastrophic outcomes. This problem relates to several AI proposals.

Learning knowledge from a model that isn’t present in the outputs sounds quite a lot like interpretability. The difference is that ELK seeks to learn about the beliefs of the model, not the full mechanism, and it seeks to do so by having the model report the information itself.

It is possible with high strength comprehensive inspection tools this problem can be addressed. For example, if our interpretability tools could identify whether a model knows some fact^[11], we would be able to ask questions like ‘have the camera’s been tampered with?’. It is hard to comment on competitiveness demands, since the frequency with which we would call upon our interpretability tools likely depends on the specific context. I am also uncertain on reliability.

Neel mentions the possibility of using interpretability to avoid human simulations. I think this is in relation to a naïve way of solving ELK. We could add a reporter component to the model which we can ask questions to, and train it on questions we know the answer to. The problem is the reporter may learn to say what we would predict, simulating what we were thinking instead of reporting accurately. The idea is to use the shortest interpretability explanation as a training loss for the reporter, with the hope that models with shorter explanations are less likely to include human simulations. It’s my understanding that there are no public write ups of this yet, and so it is hard to analyse presently.

Contributing to new alignment research

Force multiplier on alignment research

Interpretability provides mechanistic descriptions of why a model is giving misaligned answers, giving a richer set of error analysis data upon which to base ideas. This could improve the productivity of people working on alignment.

The extent of impact here seems correlated to the strength and reliability of our tools. Even if we couldn’t achieve high strength or reliability, there could still be impact so long as we gain some new knowledge about our models. Also, since this impact is via alignment research, there isn’t really an issue with impeding final model performance.

Prevalence of the tools would also be important, which seems like it would be impacted by the competitiveness of the tools. If moderate strength and competitiveness tools existed it seems likely to me that they would spread in alignment research quite naturally. It also seems any of the three types of interpretability tool could be used here, though some might be preferable depending on the specific line of research.

One concern is if this richer data allows for improvements in capabilities more so than alignment. If so, this could actually increase the extent to which alignment lags behind, amounting to a negative impact on x-risk. There is a special case of this concern mentioned by Critch et al. If interpretability tools help capabilities research get through some key barriers, but models past these barriers are beyond the reach of our tools, we could find that interpretability speeds up the time to get to advanced models without offering any help in making sure these models are safe. Whether this is a concern depends on your views as to how we will arrive at AGI.

Better predictions of future systems

Interpretability may be able to tell us about the properties of future systems. The most ambitious version of this would be to find universal properties that apply in the limit of large models. This could allow us to reason about future systems more effectively, giving us a better chance of anticipating failure modes before they arise.

An example of this is scaling laws. Scaling laws state that the relationships between model loss and compute, dataset size or model size are power laws^[12]. This relationship has been observed across several orders of magnitude. This let you make predictions about the loss your model would get when scaling up these factors. There is interest in finding a theoretical framework from which scaling laws can be derived, so that predictions can be made more precise and limitations can be discovered.

Interpretability could assist in helping to elucidate why scaling laws occur. For example, if scaling laws occur as a property of the data, then this could become clear upon closely inspecting the model. It is worth noting that scaling laws apply to the loss, but not to specific capabilities, which are often discontinuous. Understanding further at the level of detail interpretability provides could help identify limits in the scaling hypothesis and whether specific capabilities of larger models are captured by scaling hypotheses. Chris Olah has also pondered about the possibility of scaling laws for safety, allowing us to predict safety properties of larger models or potentially discontinuities in alignment, though this is just speculative.

Discovering more about scaling laws could also help clarify AI timelines, since scaling laws could be seen as evidence for arriving at advanced AI by purely scaling up current systems. This would be useful in deciding what work is most usefully done in the present.

For interpretability to be impactful here we need there to exist relevant properties that are preserved under scaling up models. Assuming these exist, we need tools strong enough to find them. Ideally, we would want to make claims about properties of the full future model, and so we would benefit from comprehensive tools here. The strength needed will depend on what we are looking for. For example, modularity can be observed without deeply understanding the model. Our tools also won’t need to be particularly competitive, since we aren’t necessarily using them at every training step. We will need our tools to be applicable on a range of models, and so they should be reliable in this sense. Any of the three types will do in identifying these properties, though inspection tools will probably be most valuable to ensure the properties are not produced by the specific training or architecture used.

Helpful in unpredictable ways

Interpretability seems likely to be helpful in ways we cannot presently predict. These unpredictable ways may not contribute to alignment directly (for example impacting governance), but here I will focus on the idea of unknown safety problems.

Chris Olah has spoken about unknown unknowns when deploying a system. If there are safety problems that we are unaware of, we can’t make tests for them or resolve them. We might also anticipate as systems become more capable more unknown safety concerns arise. Interpretability provides a way to catch some of these unknowns before they have negative impact. Research to solve these previously unknown failure modes can be inspired as a direct result.

As an example, we can look at an analysis of CLIP models, which combine a vision model and language model to align images with text. This analysis discovered multimodal neurons corresponding to religion, race, gender, age, parental status, physical disability, pregnancy, LGBT status and mental illness. There were also concerning uses of these attributes. Although bias is a concern in ML models, appreciating that it could affect many more categories that we might not have anticipated is a result that interpretability illuminates here.

The intuition that understanding the internal state provides information that is hard to find using input-output data alone seems reasonable. However, commenting on the extent to which this will be the case is difficult. Again, the impact seems proportional to the strength/comprehension and prevalence of the tools. Prevalence again will likely depend on how competitive the tools are and how reliably they can be used across models. Given the vagueness of this method there is not much to say about which types will be most useful.

Changing the culture of ML

Empirical evidence for/against threat models

It could be quite hard to distinguish cases of inner alignment failure from other robustness failures using only input-output data. In addition, there is a lot of uncertainty about the prevalence of mesa optimisers and deceptive alignment, and whether deceptive models will be able to hide their deception.

If interpretability could provide empirical examples of inner alignment failures, beyond just a behavioural description, this could go a long way to addressing these concerns. For example, we could have a model trained in a typical way, apply interpretability tools to the model, and show it is deceptive. The mechanistic interpretation itself is what makes this demonstration so compelling, removing the possibility of it being a bug or a more typical type of robustness failure. Alternatively, if interpretability showed that some inner alignment failures are rarer then we expect, this could also be useful information for deciding what research to work on.

The most optimistic version of this impact would be that such demonstrations could bring more consensus to the field of ML, potentially encouraging the coordination of a slowdown whilst the concerns demonstrated are addressed. This might seem optimistic, though it would only require influencing the major players in the field to have impact in this way. A reason to be sceptical would be if you are worried the field is too cognitively dissonant relating to these types of safety concerns, or if economic incentives are powerful enough to push actors to deploy potentially misaligned models regardless.

The reliability requirement might seem similar to auditing from the first section, but we do not need to be able to verify failure modes in state-of-the-art models for this one. For example, the simplest model that is capable of deception might not be as competent at hiding its deception. The tools will still need to be strong enough to demonstrate the behaviour is not due to a misunderstanding or because of the training data^[13], though this potentially doesn’t require the tools be comprehensive for some failure modes.

However, it is the case that the more empirical examples we have the better. It’s likely with more reliable tools we can catch more deceptive cases, with more comprehension/strength we can catch more failure cases, and with more competitive tools we can run more frequently to catch different failure cases in training.

Again, inspection tools seem most applicable. If we used interpretability via training process or architecture to demonstrate a model was misaligned, it might be hard to convince people this wasn’t because of the changes we made in making the model. An interpretable misaligned model constructed in this way could still be a valuable example, but demonstrating a model made through a standard ML practice is misaligned would carry more weight, especially if the designers did not anticipate the concern.

Norm setting

If interpretability proves not too expensive to use, it may be feasible to use it for a large number of deployed models. If this were to be the case, people may come to expect deployed systems to be interpretable, encouraging developers to make their models more interpretable. This could be valuable if this leads to discovery of new failure modes, or if systems are discovered to have known failure modes and this stop them from being deployed.

This requires interpretability to be something consumers desire from models. This seems plausible to me, for example if systems are used in making medical decisions doctors will want to be able to explain these decisions to patients. There are some fields where interpretability might not be a natural thing for a consumer to want, but if the cases where it does represent a large enough subset, then it may become expected across the board.

For this to be a reasonable path to impact we need interpretability tools to be very competitive, and to not impede final model performance too much. This is because it will be harder for interpretability to become the expectation if it is only used in worse performing or more expensive models. We can use any of our three types here, so long as they conform to this requirement.

Relating to this is how interpretability might affect adversarial robustness. If interpretable models have easier to find adversarial examples this might make interpretable models dangerous in some settings. When Neel originally presented the idea of norm setting it was presented as doing due diligence, so it might be the case that the model need not be interpretable to people other than the designers. This would alleviate the concern I have just raised, though I am sceptical that without regulation we can trust companies to do such due diligence (the regulation case is considered later).

A concerning failure mode that has been mentioned here relates to the commercial incentives. It could be the case that interpretability is used to convince non-technical people of properties of the system, but this is optimised for persuasion over truthfully representing the models^[14]. In this case interpretability might lead to false perceptions about deployed systems, that could lead to misaligned models nevertheless being deployed. Even in the absence of such a commercial pressure making sure interpretability remains tethered to a true description of the model is vital for it to be impactful generally. This relates to comprehension/strength and reliability, since it seems it will be easier to avoid this failure mode if we consistently have a good understanding of models.

There is the question of whether this method of impact will be more likely based on the decisions of longtermists. It seems unlikely that the commercial incentives will arise without good interpretability tools existing, and so this method of impact still seems dependant on research outside of mainstream ML.

Cultural shift

another idea from Chris Olah is that interpretability can change how we measure the success of ML systems. Presently success is measured by achieving state-of-the-art on benchmarks. This prioritising of performance makes interpretability less relevant, and so models today are often black boxes.

If we are capable of making interpretable ML systems, the contrast to non-interpretable systems will be striking. Evan, in describing Chris’s views, draws the comparison to building a bridge:

if the only way you’ve seen a bridge be built before is through unprincipled piling of wood, you might not realize what there is to worry about in building bigger bridges. On the other hand, once you’ve seen an example of carefully analyzing the structural properties of bridges, the absence of such an analysis would stand out.

The best version of this method is having the field change to become a more rigorous scientific discipline, closer to something like experimental physics than the present trial and error progression process. In this world using interpretability to understand why our models work is a priority, and improvements are made as a consequence of this understanding. In particular, this would lead to improved understanding of failure cases and how to avoid them.

I see this as a long shot, but with the possibility of immense impact if it was realised. This method requires a large portion of ML researchers to be working on interpretability and interested in rigour. Chris has spoken about why interpretability research could be appealing to present ML researchers and people who could transition from other fields, like neuroscience. The arguments are that presently there is a lot of low hanging fruit, interpretability research will be less dependent on large amounts of compute or data, and it is appealing to those who wish to be aligned closer with scientific values. It also seems vital that interpretability research is appropriately rewarded, though given the excitement around present work such as the circuits agenda, it seems to me this will likely be the case.

The early stages of this method of impact are more dependent on how people view interpretability than the tools themselves. That said producing tools that are more reliable, competitive and strong/comprehensive than we have today would be a good signal that interpretability research really does have low hanging fruit. In addition, for mechanistic understanding to be a major driver of performance in the long run, we need tools that have these properties. Similar to norm setting, I think work relating to any of our three types of interpretability tools could be useful here.

Improving capabilities as a negative side effect is a relevant concern here. However, there is a good counterargument. If ML shifted to a paradigm where we design models with interpretability in mind, and these models outperform models produced through the current mechanism, then capabilities have improved but necessarily so has our understanding of these models. It seems likely that the increased aid to alignment will outweigh the increase in capabilities, and through this method of impact they come together or not at all.

Contributing to governance

Accountability regulation

Andrew Critch has written about the present state of accountability in ML. Technology companies can use huge amounts of resources to develop powerful AI systems that could be misaligned, with no regulations for how they should do this so long as their activities are ‘private’. The existence of such systems poses a threat to society, and Andrew points to how the opaqueness of tech companies mirrors that of neural networks.

Andrew provides a comparison to pharmaceutical companies. It is illegal for many companies to develop synthetic viruses, and those that are allowed to do so must demonstrate the ability to safely handle them, as well as follow standardised protocols. Companies also face third-party audits to verify compliance. Tech companies have no analogue of this, mainly because the societal scale risk of these technologies has not been fully appreciated. In particular some companies aim to build AGI, which if released could be much more catastrophic than most viruses. Even though these companies have safety researchers, we can again contrast to pharmaceutical companies, where we wouldn’t feel comfortable letting them decide what safety means.

One of the difficulties in writing such accountability regulations is that it is hard to make conditions on the principles AI systems embody. If we had adequate interpretability tools, we could use these to inspire such regulations and to uphold them. If these tools can be applied to models generally, then these regulations could be used as part of third-party audits on an international level, increasing the extent to which international treaties established for governing AI operations can be enforced and upheld. Part of this argument is that presently ML designers would argue a lot of requirements on the mechanism of their systems are not reasonable. Andrew has made the point that interpretability tools can narrow the gap between what policy makers wish for and what technologists will agree is possible. It’s worth noting that interpretability tools are not sufficient to create such accountability. In particular, there would likely need to be a lot more work done on the governance side.

There are a few ways such regulations could be structured, and these have different requirements on our interpretability tools. We could use inspection tools to audit final models or throughout training, but this requires high levels of reliability and competitiveness respectively. Alternatively, we could try to enforce interpretability using training tools, but this also comes with reliability requirements. All of these also come with a strong comprehensive requirement. Enforcing interpretability via architecture seems too restrictive to be used for this method of impact. Regardless of how regulation is done, we want to be able to apply our tools across many different models and architectures, so they need to reliably transfer across these differences. We also will probably need very competitive tools in order for such regulations to be feasible.

Enabling coordination & cooperation

Interpretability may be able to assist in solving race dynamics and other coordination issues, by contributing to agreements about cooperating on safety considerations. Andrew Critch has argued that if different actors in the space have opaque systems and practices it can be hard to gauge whether they are following agreements over safety or reneging. The hope is that if different actors can interpret each other’s systems, they will be more likely to trust each other to behave sensibly, and this will facilitate cooperation. The claim is not that interpretability tools will solve these coordination issues, but rather that they complement other proposals aimed at addressing them.

A willingness to produce interpretable models could demonstrate some commitment to safety by an institution, and so it would be more reasonable to believe they will honour cooperative agreements relating to safety. In addition, if institutions agree to certain training restrictions there may be properties of the model that interpretability can reveal that demonstrate whether or not these have been followed. Also, if models deployed by institutions are interpretable then presumably models in development will be too, since they are being developed with the goal of being deployed.

This method of impact doesn’t seem to restrict how our models are made interpretable. This means potentially that all three of our types are on the board, so long as final model performance is not damaged. The tools will need to provide enough comprehension to contribute to cooperation. This may not require high strength, but likely requires comprehensive tools in order for other leading developers to feel comfortable about safety properties of the model. In order for the tools to be used by institutions they will need to be very competitive and reliable across models.

A concerning failure mode here is if interpretability makes clear how close a race to AGI actually is, it might push those with a chance of winning to forgo safety more than when it was unclear if they could win.

Training AIs to interpret other AIs

In a world where things have gone well enough that advanced AI systems have been deployed with no x-risk events, we need to be concerned with risks arising from interactions between systems. For example, interactions could cause runaway resource consumption.

Interpretability could be used for what Critch et al. calls interpretable hierarchical reporting. The idea is to set up hierarchies of AI systems that use interpretability tools to report on each other and to humans in a structured way that remains legible to human overseers. This can be compared to how a CEO has managers who provide interpretable information about sections of the business, and this allows a sufficient degree of interpretability to the CEO for the business as a whole. Using hierarchies has scaling benefits that may allow this reporting method to remain cost effective in keeping the aggregate system interpretable.

The alternative to using AIs to interpret other AIs is for humans to do this work. This seems unlikely to be possible given the frequency of interactions. Hence, it seems without the ability to train AIs to interpret each other we will default to doing no such reporting, risking catastrophe arising from AI interactions.

For this to work, we must have tools as comprehensive/strong as we needed for the oversight and auditing section. Given this, presumably we can make sure the AIs wielding our tools are aligned by having a human apply these tools to them. If this is not the case this idea breaks down under the prospect of having lots of misaligned AIs. This assumes some minimum level of competitiveness so that humans can apply the tools to a model capable of wielding them. Given this, the method of impact boils down to whether we can train AIs to use our tools.

Getting a training signal could be difficult. The overseeing model must learn how to inspect models, and how to detect specific misaligned behaviour. It could be difficult to get examples of bad behaviour without creating misaligned models, which could be hard to construct safely. There is the potential that if we have very general systems this won’t be as much of an issue. This concern means that even if we have strong comprehensive tools, they might not be wielded correctly. It is also unlikely we can use architecture or training tools here unless this has become the norm.

There is also a high competitiveness requirement due to the desire to catch unaligned interactions. To catch unaligned interactions the AIs wielding the tools must be using them in real time to some extent. This isn’t necessarily running them with every action of the overseen AI, but can’t be too infrequent. To continue our analogy, if the managers only report to the CEO once every 3 years it’s likely the direction of the company will get away from the CEO.

Why interpretability might fail to have an impact

We have discussed concerns relating to specific methods of impact. Here we will look at concerns that apply to a large number of the methods of impact given above.

Strong comprehensive tools might not be possible

It may be the case that we cannot get a suitable level of understanding of our models. For example, one might believe that in interpreting a model, the interpretations we form will always be leaky to some extent, and this will cause us to miss some important safety relevant aspects. I am sympathetic to this point of view, though I think it is worth considering some possible definitions. Chris Olah has provided a few examples of what we could mean by full understanding:

One has a theory of what every neuron does, and can provide a “proof by induction” that this is correct. That is, show that for each neuron, if one takes the theories of every neuron in the previous layer as a given, the resulting computation by the weights produces the next hypothesized feature.
One has a theory that can explain every parameter in the model. For example, for the weights connecting InceptionV1 mixed4b:373 (a wheel detector) to mixed4c:447 (a car detector) must be positive at the bottom and not elsewhere because cars have wheels at the bottom. Ideally such a theory might be able to predict parameters without observing them, or predict the effects of changing parameters.
One can reproduce the network with handwritten weights, without consulting the original, simply by understanding the theory of how it works.

The third of these seems to me to particularly demonstrate the level of understanding necessary for many of our methods of impact. In addition to these, Evan and Chris have spoken about the auditing game. Evan has formalised this more in the post, but here’s a summary by Rohin:

An attacker introduces a vulnerability in the model to violate some known specification, and the auditor must find and describe the vulnerability given only the modified model (i.e. it does not get to see the original model, or what the adversary did). The attacker aims to produce the largest vulnerability that they can get away with, and the auditor aims to describe the vulnerability as completely as possible. Note that both the attacker and the auditor can be humans (potentially assisted by AI tools). This game forms a good benchmark for worst-case interpretability work.

The reason this seems so useful is any misunderstanding you have can be exploited by the attacker, and so it forces full model consideration.

I think a strong comprehensive set of interpretability tools should be able to consistently win the auditing game, and potentially satisfy the other conditions as well. At present no one has managed all three of Chris’s conditions for a non-trivial neural network, though it has been done for sections of a network, for example curve-circuits. Whether this can be done for full models and for models outside of computer vision is presently unclear. I am less aware of progress on the auditing game. These seem to be major questions for future interpretability research.

Scalability concerns

Even if strong comprehensive tools are possible, they may not scale, and so never prove viable for state-of-the-art models. Something people point to here is the amount of effort Chris Olah and his collaborators have put into understanding a single model over several years. A lot of this time has been spent on developing inspection tools, including ideas that didn’t pan out. That said, the seven techniques used in the circuits thread to interpret model features are labour intensive. This is a problem since at present interpretability does not transfer across architectures, and so we have to redo a lot of our interpretability work. Also, interpretability might fail to overcome increases in model size, even within the same architecture. Chris Olah has spoken about these concerns and presented several reasons why they might be resolved.

Firstly, Chris anticipates that there is a regime where models use more interpretable concepts as model size increase, since small models cannot express the full concepts required and so have to rely on less crisp concepts. This provides a force that goes the other way, making interpretability easier with scale. This regime is likely to end as systems start to use concepts alien to us, but it might not matter if we are capable of using amplified overseers in place of a human.

In addition, Chris has observed structure being repeated across models. This is called the universality hypothesis, and would allow for interpretability to generalise across models trained on similar tasks. Chris has also observed recurring patterns within models. As an example, some models have symmetries where lots of features are just transformed versions of a single feature. This is called equivariance^[15], and allows you to understand several features in one go, or even several circuits if it occurs across multiple layers. This is just one example of recurring patterns in circuits, which are called motifs. We might also save orders of magnitude if models become highly modular in a way we can identify efficiently, since we need only consider modules that have a safety concern.

There is also the possibility that a lot of the work humans do in order to understand a model could be automated, and this could allow such interpretability to be cheap enough to apply even for large models. At the far end of this idea is having AIs interpret each other, which we discussed earlier.

These counters are hypotheses at the moment, since work like that done in the circuit’s agenda is at its preliminary stage. This means the properties discussed may be a function of a small sample size rather than a genuine signal. In addition, even if all the above factors hold, interpretability may still scale super-linearly and thus represent an increasing proportion of the cost of training and interpreting a model.

The arguments Chris gives apply to the case of inspection tools. However, we can also use interpretability via training. Getting interpretability by updating the loss function leverages the training process, and so should continue to work with scaling increases. This requires that we turn our inspection tools into training tools, and we gain enough understanding of SGD to avoid overfitting. Interpretability research is not at a point yet where we know how hard this could be.

It could also be argued we can make architectural decisions to affect interpretability. Cynthia Rudin has written about this idea. I personally think that using architecture tools alone is likely to impact performance too much to be relevant to advanced systems, but it could compliment the other types of tools.

Even if the scalability concern is insurmountable, we could still gain information from studying smaller models, and some methods of impact are preserved. Overall, this concern seems in the long run to be the most likely hinderance on interpretability being impactful.

The next three are me playing devil’s advocate more than anything else.

Interpretability tools developed today may not apply to future systems

This could occur because future systems are fundamentally different to present systems, or future systems use abstractions that are alien to us. We spoke about the alien abstraction case during the scalability section, so let’s focus on the first case.

If future systems don’t look like scaled up versions of present-day systems, then a lot of the interpretability impact methods may not be captured with present interpretability work. Some impact methods are preserved, for example if the culture of ML changes or we set new norms for how AI systems should look. An optimist might also say that interpretability may reveal properties that are universal across intelligent systems. I personally place enough credence in prosaic AI that this concern carries little weight.

Timelines might be too short

Interpretability has a better chance of being impactful for longer timelines. This is since the stories for why interpretability is presently hard but might not be in the future depend on lots of the work done now illuminating general ideas that appear over and over again in models. These upfront costs are likely going to take a long time to overcome. If someone believed AGI is more likely to come sooner than later, it seems to me other paths to impact gain more relative weight.

Alignment might be solved by default

if alignment is solved by default, then a lot of methods of impact won’t be realised. Depending on the specifics of how things go well by default, interpretability may still be able to contribute through some of our methods.

Conclusion

We’ve discussed how interpretability could impact x-risk, and reasons why this might fail to happen. One important recurring theme, especially to alignment proposals, is the importance of tools with high strength and full comprehension. This is because alignment is a property related to worst-case scenarios, and to make worst-case claims we need to understand the whole model very well.

In addition to this, I think it is likely going to be hard to make progress on the other requirements of our tools before we have examples of comprehensive tools. In particular, I think the problem of whether comprehensive high strength tools are possible should take precedent over trying to address scaling issues. Part of my thinking here is that it will be much easier to meaningfully talk about how our strong comprehensive tools scale if we actually have strong comprehensive tools. In addition, focussing on scaling early, before we have full understanding of a single model, may lead us to focus on the most interpretable parts of models with the less interpretable parts being neglected by comparison.

It’s also worth mentioning inspection tools seem most valuable to work on in the present. This is since training tools will likely be made by using inspection tools to provide the training signal. There are some exceptions though that I will mention in a moment. Architecture tools could be valuable, but are unlikely to get us the level of understanding we need alone.

Given this, it seems that the most valuable work done in the present will primarily aim to improve the comprehension and strength of inspection tools. Evan has presented his ideas for how we might get to the sorts of tools needed for many of our impact methods. I agree with this for the most part, and in particular think that getting better partial tools is a useful aim of present research, even if this isn’t sufficient for a lot of methods of impact. Paul Christiano has also spoken about this too, stating that he believes the goal of ‘fully understanding a neural network’ is mostly bottlenecked on ‘deeply understanding small pieces effectively’. There are competing intuitions, for example John Wentworth in a comment here states he believes that the hard part of full comprehension likely won’t be addressed by progress in partial comprehension.

If your intuition is that partial won’t get us to comprehensive, then research trying to approach full comprehension directly is probably more valuable. There are not many examples of present research of this type, though Evan points to heuristic arguments work being done by ARC and Redwood Research. I think the auditing game I spoke about earlier is probably the best place to start, since it automatically forces you to consider the whole model. Evan speaks about how progress on the auditing game could be done in his automated auditing post.

If your intuitions run closer to Evan or Paul, then trying to build upon the circuits agenda^[16] could be useful. Chris Olah has spoken about how we might extend the circuits work here. Applying the circuits ideas to transformer language models also seems valuable, as is being done in transformer circuits. Working directly on the auditing game still seems useful in this context.

One of the most important pieces of the circuit’s agenda is how thorough the authors are. Aiming for this level of rigour in doing interpretability work seems to me to be a vital part of the value of such work. As part of this I think keeping in mind the ultimate goal of deeply understanding the parts of the model under consideration, to the level of the three definitions I mentioned earlier, is useful.

There are a few cases where we might be able to approach interpretability via training in the present. For example, some work in modularity has explored how to change training to encourage networks to be more modular. It would be great if we could encourage vision models to be more easily interpreted using circuit-based tools. One example of this would be research into whether we can train neural networks to be less polysemantic. Work in this direction could also aim to get us more understanding of how SGD might try to Goodhart our metrics.

If a specific path to impact resonates, a good exercise might be to think about how we can get from present tools to ones capable of that method of impact. In doing this you could come up with a personal tech tree, similar to Evan’s, so as to identify what key bottlenecks to work on today to make progress towards those tools.

I hope this post has clarified how interpretability can be impactful, and allows people to compare interpretability to other methods of impacting x-risk from advanced AI.

^{^}
specifically: it could learn about the audit through the training data, or it could conclude there is a chance of an audit via abstract reasoning.
^{^}
deception is a case where we can’t really fix it once it has arisen, since we don’t know if the model is no longer deceptive, or if it’s just got better at hiding deception.
^{^}
We have the option of using different schemes here, but it could be the case that lots of schemes still lead to misaligned models.
^{^}
You can think about myopia agents as not thinking about the downstream consequences of their actions, which makes clear why a deceptive model is non-myopic.
^{^}
the oversight scales with improvements of the model since as the model improves it will get better at using the tools, and potentially design improvements for the tools.
^{^}
Also note at no point will we have a misaligned model doing the oversight, since we use the model from the last step, which has already passed its oversight test. This happens for adversarial training too, though in that case we use the last model that passed the check which isn’t necessarily the most recent model.
^{^}
there are reasons to think the opposite though, for example SGD has access to the model’s parameters in a way that deception doesn’t.
^{^}
for those who don’t know what inductive biases are: they are properties of the model that SGD is likely to select for, that are not directly incentivised by the loss function. For example, Belkin et al. have hypothesised SGD has an inductive bias towards simpler models.
^{^}
we can change the inductive biases by changing the design choices, for example by including different regularisation schemes. There will still be some core SGD inductive biases, but this isn’t an issue if we turn the dials on the other introduced inductive biases.
^{^}
For example, if interpretability revealed that agentic AI is generally misaligned, then perhaps coordinated slowdowns could be facilitated and microscope AI could be used during that time
^{^}
Locating and editing factual associations in GPT is an example of present work that attempts to do this to some extent.
^{^}
This is true modulo some caveats, for example we can’t restrict one too heavily and still get a power law in the others.
^{^}
as an example of this, the data used to train language models includes untrue statements, and this leads to language models appearing to lie sometimes.
^{^}
As an example of this, see these evaluations of saliency maps.
^{^}
The example often given of equivariance is when lots of features are just rotated versions of the same feature in vision models.
^{^}
For those who are interested in learning more about the circuits program, the original thread is quite well explained, but there are also some very good summaries.

AI ALIGNMENT FORUM
AF