In a new preprint, Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models, my coauthors and I introduce a technique, Sparse Human-Interpretable Feature Trimming (SHIFT), which I think is the strongest proof-of-concept yet for applying AI interpretability to existential risk reduction.[1] In this post, I will explain how SHIFT fits into a broader agenda for what I call cognition-based oversight. In brief, cognition-based oversight aims to evaluate models according to whether they’re performing intended cognition, instead of whether they have intended input/output behavior.

In the rest of this post I will:

  1. Articulate a class of approaches to scalable oversight I call cognition-based oversight.
  2. Narrow in on a model problem in cognition-based oversight called Discriminating Behaviorally Identical Classifiers (DBIC). DBIC is formulated to be a concrete problem which I think captures most of the technical difficulty in cognition-based oversight.
  3. Explain SHIFT, the technique we introduce for DBIC.
  4. Discuss challenges and future directions, including concrete recommendations for two ways to make progress on DBIC.

Overall, I think that making progress on DBIC is tractable with current interpretability techniques, and I’d be excited to see more work on it.

Cognition-based oversight

In this section, I describe a class of approaches to overseeing models I call cognition-based oversight. First, I review the problem of discriminating good vs. bad models and its relationship to scalable oversight. Then I impose a pessimistic assumption – that the good and bad models cannot be discriminated behaviorally – and propose cognition-based oversight as a potential remedy.

Discriminating models

In an ideal world, we could argue that AI systems will be safe by default as follows:

We train AI systems to receive high scores according to some scoring function. The models that get the highest scores are all good models that robustly behave as desired. Therefore, we have strong a priori reason to expect our training processes to result in good models.

Sadly, in the real world, the “scoring functions” we actually use tend to be shaped like “human evaluator approval.” And “human evaluator approval” does not enjoy the property that “the models which humans evaluate most positively all robustly behave as desired.” Two salient counterexamples:[2]

  • Sycophantic reward hackers take actions that look good to the overseer, even when this conflicts with doing what the overseer actually wants. For example, suppose we deploy an AI smart vault to keep a diamond safe, as judged by whether the diamond continues to appear on the vault camera; instead of trying to prevent diamond theft, a sycophantic reward hacker might instead cover diamond theft up by tampering with the vault camera.
  • Situationally-aware schemers behave well only when they think bad behavior would be caught. For example, a scheming code generation model might write vulnerability-free code only when it expects the code to be closely audited, but otherwise insert subtle backdoors.

The models in these examples score highly according to human evaluator approval, so we can’t use the argument above that they are unlikely to arise from training. We instead need to rely on arguments that these models are unlikely to arise despite scoring at least as well as good models. So far, I think these arguments are far from airtight, and I feel nervous about relying on them.

Said differently, a core problem in technical AI safety is that it can be generally hard to discriminate good models that robustly do stuff that we like vs. bad models which sometimes do stuff that we don’t like. We would like to be good at this discrimination task even in tricky situations where, e.g. the bad model is actively trying to look good.

Typical approaches to this problem use the following template: (1) get the models to produce importantly different outputs, then (2) figure out which of the outputs is better. Step (2) is the problem of scalable oversight, and since we’re worried about the regime where it is difficult to tell model outputs apart, it typically involves trying to get a more informative signal than just looking at the bare outputs. For example, we might have the models debate with each other about which output is better and then use the debate transcript to inform our determination (justification-based oversight). Or we might ask the models to break the process by which they arrived at their answer into steps each of which we can verify (process-based oversight). While approaches like these don’t definitively solve the model discrimination problem, I think they are a promising mainline attack which may work well in practice.

Cognition-based oversight for discriminating behaviorally identical models

Instead of asking how to improve the existing approaches to scalable oversight, I’m going to take a different tack: I will impose the pessimistic assumption that our models are effectively behaviorally identical, i.e. that we’re not able to discriminate our models on the basis of their input/output behavior. 

We typically try to discriminate good vs. bad models by reducing to the problem of discriminating their outputs (top, middle). Here we instead impose the pessimistic assumption (bottom) that this is not possible. (Meme explanation)

After precluding input/output behaviors, what’s left? I think that my assumption forces us to do what I call cognition-based oversight. That is, try to discriminate the models based on their cognition when producing outputs.

An example workflow for cognition-based oversight. A researcher inspects the cognition of both models in search of anything which seems suspicious. This is the approach taken in SHIFT (see below).

The “effectively behavioral identical” assumption is admittedly subtle: in most settings it is unrealistic to assume that our models’ I/O behaviors are exactly identical, so the assumption is largely about the limits of our capacity to judge outputs. The point of DBIC – introduced in the next section – is to simplify by moving to a classification setting where we can assume exactly identical model outputs, thus removing the need for tricky-to-pin-down assumptions that limit overseer judgment.

Discriminating Behaviorally Identical Classifiers

In this section, I’ll lay out a concrete problem which operationalizes the “no ability to discriminate on the basis of inputs/outputs” assumption from above. The main move here is to simplify the setting by assuming that our AI systems are classifiers, rather than more general AI systems like RL agents or generative models. I think that this simplified problem is crisper and easier to articulate, while still capturing most of the technical difficulty of cognition-based oversight.

Say that classifiers  and  are behaviorally identical relative to an input distribution  if  for all . Typically, we will consider classifiers which are implemented via a neural net, e.g. by attaching a classification head to a pretrained generative model like an LLM. Even when such classifiers are behaviorally identical, they might arrive at their outputs in very different ways. The problem of discriminating behaviorally identical classifiers (DBIC) is to, given access to two BICs – one that is arriving at its outputs in an intended way and one that isn’t – figure out which is which.

In more detail, an instance of DBIC consists of:

  • A classification task specified in natural language (e.g. “classify factual statements as true or false”). This natural language description should implicitly specify a distribution  of pairs  of inputs with ground-truth labels.
  • Classifiers  and , with one classifier (approximately) performing the intended classification task and one classifier performing a different (but potentially closely related) classification task. In particular, the former classifier is more accurate on .
  • An ambiguous subdistribution  of  such that  and  are behaviorally identical on .

The goal is to, given sample access to  but not the full task distribution , determine which classifier is more accurate on . In order to make this determination, we may make use of the classifiers’ hidden states when processing inputs from . Depending on our assumptions (see “Hard vs. Relaxed DBIC” below), we might also be allowed access to additional unlabeled data. But the key constraint is that we are not allowed access to labeled classification data  with .

Here are two example settings for DBIC:

  1. (ELK consists of pairs  where  is a factual statement (e.g. “There is no input which causes [some code sample] to return [some undesirable output]”) and  is the ground-truth label about whether  is true or false.  is a subset of statements which smart humans can reliably and accurately judge as true or false (e.g. “Paris is the capital of France”).  “the best guess of a smart human about the truth value of ” and “the best guess of a smarter-than-the-smartest-humans model about the truth value of ”. We would like to identify  as being more generally accurate.
  2. (Measurement tampering consists of pairs  where each  is itself a pair  consisting of some information  and (potentially fallible) measurements  of some property ; y is the ground-truth value of . For example, in the smart vault setting,  might consist of a bunch of sensor readings and planned smart vault actions,  is the boolean measurement “does the diamond appear on the vault camera?” and  is the ground-truth value of “is the diamond still in the vault?”.  is the subset of pairs for which , e.g. the vault camera accurately reflects whether the diamond is still in the vault.  and , with  being more generally accurate.

Both of these are special cases of a more general easy-to-hard domain shift setting, where some labeling function coincides with the ground truth on an “easy” subset of the data but makes systematic errors in general.

Hard vs. Relaxed DBIC

When trying to solve DBIC, are we allowed to use any data outside of ? Certainly there shouldn’t be restrictions on using whatever unsupervised data we happen to have lying around. For example, if our BICs are based on a pretrained foundation model, then we should be allowed to use data from the foundation model’s pretraining corpus. The more interesting question is whether we have any access to data from  (i.e. data from  not in the ambiguous set). I’ll distinguish between two problem settings:

  1. Hard DBIC: you have no access to any classification data in 
  2. Relaxed DBIC: you have access to classification inputs  from , but not to any labels.

In relaxed DBIC, you’re allowed to see tricky inputs on which the classifiers disagree, but you’re not allowed to know which classifier is correct (since that would trivialize the problem).

Redwood Research’s Measurement Tampering Detection Benchmark (MTDB) takes place in the relaxed DBIC setting: you’re allowed to make use of a trusted set ( in my terminology) in which measurements reflect the ground truth, as well as inputs from an untrusted set where we lack ground truth labels.

In practice, I’m not sure if we’ll be able to assume access to lots of samples from , so I’d prefer progress on hard DBIC to relaxed DBIC. In the next section, I’ll present a solution to a simple case of hard DBIC.

Example trusted and untrusted inputs from MTDB. This is from their “diamond in a vault” dataset, the least crazy of their four measurement tampering datasets.

SHIFT as a technique for (hard) DBIC

In Sparse Feature Circuits, my coauthors and I introduced a technique we call Sparse Human-Interpretable Feature Trimming (SHIFT) for solving a simple instance of hard DBIC. In this section, I’ll provide a walkthrough of what we did, as an example of what progress on DBIC looks like. In the next section I’ll discuss limitations and directions for future work.

Let’s start with the data. We prepare two datasets:

  • An ambiguous dataset consisting of professional biographies (think “LinkedIn bios”) for male professors or female nurses
  • balanced validation dataset consisting bios for a balanced mix of male professors, female professors, male nurses, and female nurses.

We then turn a small language model, Pythia-70M, into a classifier by using the ambiguous data to train a linear classification head out of Pythia-70M’s penultimate layer. It ends up that when you do this, you mostly get a classifier for gender (i.e., when you evaluate the classifier on the balanced dataset, it is 87% accurate for gender and 62% accurate for profession). 

So by training a classifier on the ambiguous dataset, we’ve gotten a classifier which generalizes badly. Is there a way we can detect this without sample access to the balanced data?

Yes; here’s how: we train a bunch of sparse autoencoders (SAEs) for various components of Pythia-70M. (For those unfamiliar with SAEs, here’s a brief footnote explaining what they do.[4]) Then we determine[5] which SAE features are most causally implicated in our classifier’s outputs on the ambiguous training data. After cutting things off at some arbitrary threshold, the result is a list of 67 SAE features which most contribute to our model’s accuracy on ambiguous data.

Next we try to understand what’s going on with these 67 features. Right now, the (somewhat sad) SoTA for understanding SAE features is “look for patterns in their top-activating contexts and most-promoted tokens.” Fortunately, that’s enough for this task; here are the Neuronpedia pages for some of the 67 features.

(left) Resid_1/26969 is active on words like “nurse” and “nursing.” (middle) Resid_2/29295 is active on female names. (right) Resid_2/11656 has an unclear activation pattern, but clearly promotes male-related words like “himself,” “beard,” and “bald.”

Overall, 55 of the 67 features seem to be associated with gender. At this point we’ve resolved our question: our classifier seems to be relying on a number of features which are not actually task-relevant, so will probably not generalize properly. 

We can actually go further: we pick up some extra credit by performing model edits to remove the undesired cognition from our classifier. In more detail, zero-ablating the 55 gender-related features (i.e. manually setting them to be always “off”) produces a modified classifier which gets 89% profession accuracy on the balanced data. And even better, if we ablate these 55 features and retrain our linear classification head on the ambiguous data, then we go up to 93% accuracy. This is as good as possible: it matches the performance of an oracle classifier trained directly on balanced data.

Overview of our SHIFT experiment. The GIF phrases the experiment in terms of “circuits,” but note that the edges didn’t actually play a role in the experiment.
Results of SHIFT experiment. Concept Bottleneck Probing (CBP) was a (surprisingly good!) baseline technique. Neuron/feature skylines involve “cheating” by using the balanced set to identify the best neurons/features to ablate.

To be clear, this setting is not technically an exact match to the DBIC set-up I described above. The main difference is that in the experiment, we only have one classifier which we are trying to label as “good” or “bad,” rather than a pair of good/bad classifiers which we are trying to discriminate. This is not, I think, a very important difference; in particular, I expect that the lack of a benign reference classifier makes things harder, not easier. 

So for clarity – and to reiterate what’s happened in this experiment – here’s a summary of our workflow, slightly modified to be strictly applicable to DBIC as stated earlier. Given BICs for some ambiguous distribution, we:

  1. Train SAEs for our classifiers (or, more specifically, the foundation model they’re based on).
  2. Identify the SAE features most implicated in the classifiers’ accuracy on the ambiguous distribution.
  3. Interpret these features, and identify any which seem “red flaggy” for the intended task.
  4. Declare the better classifier to be the one with the fewest “red flag” features.

Limitations and next steps

I think our SHIFT experiments are interesting, but they’re a far cry from the two example DBIC instances from above: ELK and measurement tampering. In this section, I’ll discuss what I expect would go wrong when attempting to apply the above workflow to these DBIC instances, and what to do about it.

Let’s start with ELK. Recall that the setting is: we have a good classifier  which classifies factual statements according to whether they’re true (according to the best guess of some base model which we assume is smarter than the smartest humans), and a bad classifier  which classifies factual statements according to whether smart humans think they’re true.  and  are assumed to be behaviorally identical on an ambiguous “easy” distribution consisting of factual statements which humans can reliably classify as true or false.

Suppose we try to apply our workflow from the previous section to this problem setting. Do we expect anything to go wrong, and if so what? (I encourage the reader to take a moment to think through this themselves.)

I expect (1) and (2) to go mostly fine (though see footnote[6]), and for (3) to completely fail. For example, suppose  has a bunch of features for things like “true according to smart humans.” How will we distinguish those from features for “true”? I think there’s a good chance that this approach will reduce the problem of discriminating  vs.  to an equally-difficult problem of discriminating desired vs. undesired features.

With this in mind, there are two directions for future work that I feel most excited about:

  1. Better tools for understanding SAE features[7] (i.e. getting better at (3) above).
  2. Identifying useful settings where identifying red flag features is easier.

I’ll go through these two directions in more detail.

Direction 1: better ways to understand interpretable units in deep networks

Right now, the way we understand SAE features is by looking at their input/output behavior, i.e. at the inputs which activate them and at the outputs they promote. This is pretty unsatisfactory, especially considering that the point of cognition-based oversight was to get away from understanding models in terms of their input/output behaviors! And indeed, I think there’s a reasonable intuition – which seems exactly correct in the ELK case – along the lines of “Without additional interpretability techniques, there’s nothing magical about features which makes them easier to understand than the original model’s behavior.”[8]

So I’m interested in work which advances our ability to interpret SAE features. My best guess for a path forward here is functional interpretations of features. By this I mean: understanding features in terms of the roles they play in circuits, rather than in isolation. Understanding a feature functionally means incorporating information about how it interacts with other features; it requires bringing information from the feature connectome.

At the end of the feature circuits paper, we used unsupervised techniques to compute and catalog thousands of feature circuits for automatically discovered model behaviors. This was pretty slapdash: not all of the behaviors we discovered seem “real” (in some vague, intuitive sense), and we have no good way of understanding the quality of the results. But I’m excited about using our catalog (or better catalogs produced by others) as an input to understanding features functionally. The basic question here: Given a feature f that you would like to understand, does having a catalog of all the circuits that f participates in help at all?

Direction 2: identifying especially leveraged settings for cognition-based oversight

In the ELK setting above, the concern was that we reduced a difficult model behavior problem to an equally difficult interpretability problem. But contrast this to our SHIFT experiments above: there the interpretability task was trivial (just look at the example features above, which are more-or-less representative), whereas understanding the classifiers behaviorally was intractable. Thus, I view an important direction here as identifying problem settings where cognition-based-oversight gives us an especially good bang-for-our-buck where “our buck” = “interpretability difficulty” and “bang” = “reduction in AI existential risk.”

So far, the best answer I have for this is the measurement tampering setting. On one hand, I think that solving measurement tampering is almost as useful as solving the fully general ELK problem. And on the other hand, it intuitively seems that in this setting  must be thinking about lots of clearly-suspicious things which  doesn’t need to think about at all. For instance, in the smart vault setting, ’s output always – no matter the input – sensitively depends on thinking about the vault cameras. If our interpretability tools are good for anything at all, it feels that they should be able to tell us “Did our classifier’s output depend on thinking about cameras?”

So right now, I’m excited about work which uses interpretability-based techniques to set SoTA on MTDB. Moreover, I think this is a good test for interpretability as a field: If our interpretability tools can’t do this, then I’m worried they’ll struggle to prove useful in other settings.


In summary:

  • I’m excited about cognition-based oversight as a frame for interpretability work, and I’m especially excited about work on discriminating behaviorally identical classifiers.
  • SHIFT provides a toy demonstration of what progress on DBIC looks like.
  • I think the most promising directions for future work in cognition-based oversight come from (1) creating better techniques for understanding interpretable units in neural networks (e.g. SAE features), and (2) identifying and working towards the most leveraged applications.
  1. ^

    I’m not sure exactly how to operationalize this, but a related claim is: Suppose your lab will release a superintelligence 12 months from now, and your goal is to reduce x-risk from its initial release specifically using an interpretability-based method. Then I think you should spend your 12 months on refining and scaling up SHIFT.

  2. ^

    To be clear, I’m not claiming that these examples have empirically come up, or that they are likely to arise (though my personal view is that sycophantic reward hackers are plausible enough to pose a 5-10% chance of existential risk). Here I’m only claiming that they are in-principle counterexamples to the general point “models which humans evaluate most positively robustly behave as desired.”

  3. ^

    Scalable oversight is typically scoped to go beyond the narrow problem of “given a good model and a bad model, determine which one is good.” I’m focusing on this simpler problem because it’s very crisp and, I think, captures most of the technical difficulty.

  4. ^

    SAEs are an unsupervised approach to identifying a bunch of human-interpretable directions inside a neural network. You can imagine them as a machine which takes a bunch of pretraining data and spits out a bunch of “variables” which the model uses for thinking about these data. These variables don’t have useful names that immediately tell us what they represent, but we have various tricks for making informed guesses. For example, we can look at what values the variables take on a bunch of different inputs and see if we notice any property of the input which correlates with a variable’s value.

  5. ^

    Using patching experiments, or more precisely, efficient linear approximations (like attribution patching and integrated gradients) to patching experiments; see the paper for more details.

  6. ^

    am worried that SAEs don’t capture all of model cognition, but there are possible solutions that look like “figure out what SAEs are missing and come up with better approaches to disentangling interpretable units in model cognition.” So I’ll (unrealistically, I think) grant that all of the important model cognition is captured by our SAEs.

  7. ^

    Or whatever disentangled, interpretable units we’re able to identify, if we move beyond SAEs.

  8. ^

    I don’t think this intuition is an airtight argument – and indeed I view our SHIFT experiments as pushing back against it – but there’s definitely something here.

New Comment
5 comments, sorted by Click to highlight new comments since:

I like this post and this research direction, I agree with almost everything you say, and I think you’re doing an unusually good job of explaining why you think your work is useful.

A nitpick: I think you’re using the term “scalable oversight” in a nonstandard and confusing way.

You say that scalable oversight is a more general version of “given a good model and a bad model, determine which one is good.” I imagine that more general sense you wanted is something like: you can implement some metric that tells you how “good” a model is, which can be applied not only to distinguish good from bad models (by comparing their metric values) but also can hopefully be used to train the models.

I think that your definition of scalable oversight here is broader than people normally use. In particular, I usually think of scalable oversight as the problem of making it so that we’re better able to make a procedure that tell us how good a model’s actions are on a particular trajectory; I think of it as excluding the problem of determining whether a model’s behaviors would be bad on some other trajectory that we aren’t considering. (This is how I use the term here, how Ansh uses it here, and how I interpret the usage in Concrete Problems and in Measuring Progress on Scalable Oversight for Large Language Models.)

I think that it’s good to have a word for the problem of assessing model actions on particular trajectories, and I think it’s probably good to distinguish between problems associated with that assessment and other problems; scalable oversight is the current standard choice for that.

Using your usage, I think scalable oversight suffices to solve the whole safety problem. Your usage also doesn’t play nicely with the low-stakes/high-stakes decomposition.

I’d prefer that you phrased this all by saying:

It might be the case that we aren’t able to behaviorally determine whether our model is bad or not. This could be because of a failure of scalable oversight (that is, it’s currently doing actions that we can’t tell are good), or because of concerns about failures that we can’t solve by training (that is, we know that it isn’t taking bad actions now, but we’re worried that it might do so in the future, either because of distribution shift or rare failure). Let’s just talk about the special case where we want to distinguish between two models which and we don’t have examples where the two models behaviorally differ. We think that it is good to research strategies that allow us to distinguish models in this case.

I (mostly; see below) agree that in this post I used the term "scalable oversight" in a way which is non-standard and, moreover, in conflict way the way I typically use the term personally. I also agree with the implicit meta-point that it's important to be careful about using terminology in a consistent way (though I probably don't think it's as important as you do). So overall, after reading this comment, I wish I had been more careful about how I treated the term "scalable oversight." After I post this comment, I'll make some edits for clarity, but I don't expect to go so far as to change the title[1].

Two points in my defense:

  1. Even though "scalable oversight" isn't an appropriate description for the narrow technical problem I pose here, the way I expect progress on this agenda to actually get applied is well-described as scalable oversight.
  2. I've found the scalable oversight frame on this problem useful both for my own thinking about it and for explaining it to others.

Re (1): I spend most of my time thinking about the sycophantic reward hacking threat model. So in my head, some of the model's outputs really are bad but it's hard to notice this. Here are two ways that I think this agenda could help with noticing bad particular outputs:

  1. By applying DBIC to create classifiers for particular bad things (e.g. measurement tampering) which we apply to detect bad outputs.
  2. By giving us a signal about which episodes should be more closely scrutinized, and which aspects of those episodes we should scrutinize. (For example, suppose you notice that your model is thinking about a particular camera in a maybe-suspicious way, so you look for tricky ways that that camera could have been tampered with, and after a bunch of targeted scrutiny you notice a hack).

I think that both of these workflows are accurately described as scalable oversight.

Re (2): when I explain that I want to apply interpretability to scalable oversight, people -- including people that I really expected to know better -- often react with surprise. This isn't, I think, because they're thinking carefully about what scalable oversight means the way that you are. Rather, it seems that a lot of people split alignment work into two non-interacting magisteria called "scalable oversight" and "solving deceptive alignment," and they classify interpretability work as being part of the latter magisterium. Such people tend to not realize that e.g. ELK is centrally a scalable oversight agenda, and I think of my proposed agenda here as attempting to make progress on ELK (or on special cases thereof).

I guess my post muddies the water on all of the above by bringing up scheming; even though this technically fits into the setting I propose to make progress on, I don't really view it as the central problem I'm trying to solve.

  1. ^

    Sadly, if I say that my goal is to use interpretability to "evaluate models" then I think people will pattern-match this to "evals" which typically means something different, e.g. checking for dangerous capabilities. I can't really think of a better, non-confusing term for the task of "figuring out whether a model is good or bad." Also, I expect that the ways progress on this agenda will actually be applied do count as "scalable oversight"; see below.

(Edits made. In the edited version, I think the only questionable things are the title and the line "[In this post, I will a]rticulate a class of approaches to scalable oversight I call cognition-based oversight." Maybe I should be even more careful and instead say that cognition-based oversight is merely something that "could be useful for scalable oversight," but I overall feel okay about this.

Everywhere else, I think the term "scalable oversight" is now used in the standard way.)

  1. Hard DBIC: you have no access to any classification data in 
  2. Relaxed DBIC: you have access to classification inputs  from , but not to any labels.

SHIFT as a technique for (hard) DBIC

You use pile data points to build the SAE and its interpretations, right? And I guess the pile does contain a bunch of examples where the biased and unbiased classifiers would not output identical outputs - if that's correct, I expect SAE interpretation works mostly because of these inputs (since SAE nodes are labeled using correlational data only). Is that right? If so, it seems to me that because of the SAE and SAE interpretation steps, SHIFT is a technique that is closer in spirit to relaxed DBIC (or something in between if you use a third dataset that does not literally use  but something that teaches you something more than just  - in the context of the paper, it seems that the broader dataset is very close to covering ).

I'm pretty sure that you're not correct that the interpretation step from our SHIFT experiments essentially relies on using data from the Pile. I strongly expect that if we were to only use inputs from  then we would be able to interpret the SAE features about as well. E.g. some of the SAE features only activate on female pronouns, and we would be able to notice this. Technically, we wouldn't be able to rule out the hypothesis "this feature activates on female pronouns only when their antecedent is a nurse," but that would be a bit of a crazy hypothesis anyway.

In more realistic settings (larger models and subtler behaviors) we might have more serious problems ruling out hypotheses like this. But I don't see any fundamental reason that using disambiguating datapoints is strictly necessary.