This post was written as part of the work done at Conjecture.
You can try input swap graphs in a collab notebook or explore the library to replicate the results.
Thanks to Beren Millidge and Eric Winsor for useful discussions throughout this project. Thanks to Beren for feedback on a draft of this post.
Activation and path patching are techniques employed to manipulate neural network internals. One approach, used in ROME, which we'll refer to as corrupted patching, involves setting the output of certain parts of the network to a corrupted value. Another method used in the work on indirect object identification, referred to as input swap patching, involves assigning these parts the values they would produce on a different input sample, while the rest of the model operates normally. These experimental techniques have proven effective in gaining structural insights into the information flow within neural networks, providing a better understanding of which components are involved in a given behavior. These structural analyses can be straightforwardly automated using algorithms like Automatic Circuit DisCovery.
Patching experiments also enable the discovery of the semantics of components, i.e. the abstract variables a component encodes. They have been used to discover the "a vs an" neuron in gpt2, the token and position signal in S-inhibition heads in the IOI circuit, or Othello board representations. In all these cases, researchers were not only able to say 'this component matters', but also 'it roughly plays this role'. However, these additional results come at a cost: they often required a lot of manual effort to design alternative datasets and come up with an intuition for the likely role of the component in advance.
In this post, I introduce input swap graphs (swap graphs in short), a new tool to systematically analyze the results of input swap patching experiments and recover semantic information with as little human effort as possible.
In the following, I present
Deciphering the intermediate activations of neural network components can be a daunting task. The prevalent approach involves breaking down these high-dimensional vectors into subspaces where variations correspond to concepts that are both relevant to the model's computation and comprehensible to humans.
In this work, we adopt an alternative strategy: we limit ourselves to using only input swap patching in order to uncover the roles of neural network components without directly exploiting the structure of their activations.
While it is common to combine information about directions in activation space and input swap patching experiments in interpretability research (e.g., as demonstrated in Neel Nanda's work on Othello), our objective here is to investigate the extent to which we can rely solely on input swap patching. This motivation is twofold: i) gaining a clearer understanding of the insights offered by different techniques, and ii) capitalizing on the simplicity of input swap patching, which can be readily translated into causal scrubbing experiments and modifications of the model's behavior. By maximizing our discoveries using patching alone, we remain closer to the experiments that matter most and reduce the risk of misleading ourselves along the way.
Our aim is to create a comprehensible abstraction of the roles played by a model's subcomponents. For a given model component C and a dataset, we ask: what role is it playing in the model computation? What abstract variables is it encoding?
We address this question by conducting input swap patching where we swap the input to C, while maintaining the input to the rest of the model.
If swapping the input to C does not change the model's output (verified by checking the small distance between the outputs before and after the swap), it may sometimes be because the component's output is interpreted in the same way by the subsequent model layers.
We do not have completely characterize the conditions under which we can naively interpret "no change in output" as "the component encodes the same value" rather than, for example, "the component's encoded value changed but the model output remained the same by coincidence". However, the naive interpretation seems empirically and theoretically reliable when the component can only encode a limited number of discrete values.
By running all possible input swaps (i.e., every pair of inputs (x,x′)) and measuring the change in the model's output for each, we can organize the results into an input swap graph (shorten in swap graph hereafter). This graph has nodes representing dataset elements and edges (x,x′) that store the outcome of swapping x for x′ as input to C.
We can identify communities in this graph, i.e., inputs that can be swapped with one another as input to C without affecting the model's output. We then look for input features that remain constant within each community but vary between them. These findings significantly limit the possible abstract variables that C can encode, as any candidate variable should respond to the input features in the same manner.
Consequently, swap graphs enable us to rapidly formulate hypotheses that can be further validated through more rigorous experiments like causal scrubbing or refined by adjusting the dataset.
Let’s take the example of a model being a processor implementing a simple function.
We’ll focus on understanding the role of the teal component. In the figure below, we swapped its input from (a=12, b=30) to (a=23, b=11).
As the teal component encodes the program variable c, and that c=1 on the left run but 0 on the right run, patching the teal component makes the results on the right go from 12 to 13.
One single patching experiment is not very informative to obtain a comprehensive view of the role of a component, we need more of them!
Here is what the swap graph of the teal component looks like on a dataset of 7 inputs. We show the edges where the swap did not change the output of the model.
By observing the connected components of the graph, we can eyeball at what the inputs have in common: the fact that b%10 is the same in each of them! Hence, a reasonable guess is that the teal component encodes the intermediate variable b%10. We were able to find its role without access to the program.
D, # The dataset
Comp, # The comparison function
M, # The model
C # The component to investigate
"""Returns the swap graph of C on D."""
G = np.empty(|D|, |D|) # The weight matrix of the swap graph
for x,y in D:
G[x,y] = Comp(M(x), M(C(y), x)) # Measure how much the model output
# changes by swapping x by y as input to C
In practice, throughout this work, we used the KL divergence between the next-token probability distributions outputted by the original and the patched model as the Comp function. We transformed the distances given by the KL divergence into similarities using a Gaussian kernel.
The algorithm requires a quadratic number of forward passes in the number of dataset examples. It is tractable for a dataset size of around a hundred samples. Such sizes are enough to recover interesting structures on simple datasets such as IOI.
Once we have G, we can leverage classic network analysis such as:
However, it's still unclear what information exactly we can recover from swap graphs (see next section below), and how to interpret them in the general case. When applied to IOI, we found that a naive interpretation was enough to recover results that were found with alternative methods.
Here is the IOI circuit found in GPT2-small: (for context on the IOI task and terminology, you can read the IOI paper or the last slides of this slideshow)
In the IOI paper, we characterize Name Movers as a class of heads acting as the “return” statement of the circuit.
Let’s compute the swap graph of the head 9.9 and visualize it with a force-directed layout.
We observe clear clusters of inputs that perfectly correlates with the value of the IO token on these input sequences. It makes sense: swapping the input to the head doesn’t influence much the model output as long as the Name Mover returns the same value.
Nonetheless, the clustering is not perfect. In the center of the figure, we observe inputs that have different IO tokens but are close nonetheless. By coloring the node by the object (e.g. in the sentence from the circuit figure, the object is "drink"), we got:
The inputs (dark purple points) in the center all involve the "bone" object. It's plausible that on these inputs, GPT2-small is not using the IOI heuristic strongly, but something that depends on the "bone" context. However, this observation is not true for all bone-related sentences as they are not all in the same place, so the story must be more complicated than that.
You can create your own swap graphs in this collab notebook!
Instead of directly trying to understand computation happening in neural networks, I found it useful to first consider an idealized reverse-engineering problem, generalizing the toy problem presented above. I think this theoretical setting contains important conceptual difficulties in understanding computation from causal intervention while removing the messiness from neural networks. In practice, I used it as a sandbox to refine my interpretation of swap graphs.
You have in front of you a processor. It’s an interconnected web of components, transistors, and memory cells that implements a program. You can control the input of the processor, and you can understand its output. However, you don’t know how to make sense of the intermediate electrical pulses that travel between components inside the processor.
What you can do instead is patch components. For this, 1) make two copies of the processor and run the copies on two different inputs 2) take a component from one copy and use it to replace its homolog on the other copy to create a “chimera processor” 3) Recompute the output after the patch and observe the result of the chimera processor.
How much of the program can you recover by running input swap patching experiments?
Here we'll focus on using swap graphs to answer this question and explore how much juice we can extract from them. In this setting, an edge (x,x′) is present in a swap graph iff P(x)=P[C(x′),x], where P[C(x′),x] denote the output of the chimera processor where the input to the component C has been swapped for x′.
We call v the program variable C is encoding and GC the swap graph of C on a dataset D.
Claim 1: If two inputs x and x′ lead to the same value of v in the program, they will be connected in N. More generally, each possible value that v can take on the dataset will create a clique in the swap graph. This is saying that you can arbitrarily swap the input to C as long as the value it stores stays the same.
One thing that would make swap graphs really cool is if the reciprocal to this claim was true. If we could decompose the network into cliques and know that all the inputs in each clique correspond to the same value of the underlying program variable. This would be an incredibly easy way to understand the role of C! This is what we implicitly did in the toy example above to guess that the teal component was encoding b%10. We call this the naive interpretation of swap graphs.
The naive interpretation of swap graphs: Swap graphs can be decomposed into disconnected cliques such that every edge is inside a clique. Each clique corresponds to a possible value of the program variable stored in the component.
As the name suggests, this interpretation is not true in general. First, swap graphs are not always decomposable into disconnected cliques where all edge is inside a clique. Second, cliques in a swap graph can contain pairs of input that lead to different values of program variables.
In the following paragraphs, we present counter-examples to the naive interpretation. We then used the counter-examples to state conditions that make the naive interpretation likely to be reasonable.
I think that's a fun problem to think about counterexamples, feel free to stop here and think about them by yourself.
Swap graphs can contain strictly directed edges.
Take the program:
c = a*b
You’re studying a component storing b. Here is a possible swap graph.
If a=0 (bottom left corner), you can swap b with any value without changing the output, but the opposite is not true. So swap graphs are not made of bidirectional edges, and thus cannot be decomposed into cliques where every edge is inside a clique in the general case.
Okay, but in this case is easy to detect that something weird is going on because when a=0, the output doesn’t depend on b anymore. We could remove all inputs t that can be swapped arbitrarily without influencing the output i.e. the edge s→t exists for every s≠t.
Unfortunately, this is not enough.
Swap graphs can contain strictly directed edges even when filtering for dependency.
Here is a swap graph for a component storing a on a dataset of 3 samples.
No need to filter any input because of independency as there is no edge from or to (a=17, b=1).
We still have a strictly directed edge because 22%2=46%2 but 46%10≠22%10.
Okay, but when we have a clique where all elements are bidirectionally connected, then we can interpret this as a sign that the underlying variable has the same value, right?
Cliques in swap graphs don’t map to variable values in the program.
c = a+b
A swap graph for a component storing c:
Because the output can only be 1 or 0, there is a lot of wiggle room. As long as the output has the same parity, inputs that lead to these various values will be connected to the swap graphs. The final modulo 2 is washing away a lot of the information contained in the intermediate variables, creating edges "by coincidence".
From these counterexamples, we can take away principles that make the naive interpretation break. The following points are about known failure modes, not comprehensive rules.
Thus, we are more confident in the naive interpretation:
These principles can be applied to empirical swap graphs on neural network components. For instance, a practical way to implement the second point is to measure the full output probability distribution and not just a few logits to catch any deviation (but this also changes the scope of the behavior you’re studying as you're not always interested in explaining the full distribution, there is a tradeoff here).
We can also rely on a frequency argument to interpret swap graphs, again using a simple model. We can see a swap graph as being made of cliques (that we expect because of Claim 1.) plus additional edges coming from swaps that did not change the output of the model for other reasons than keeping the underlying variable constant. Here, we model the additional edges as noise.
Then, we can choose to design a dataset such that the number of edges from the cliques is expected to be larger than the noise. For the cliques to be large, their number should be small. I.e. the number of possible values of the abstract variable should be small. Of course, it still needs to be greater than 1, else the whole graph is a clique.
To enforce this regime, we can make hypotheses about likely variables a component can encode and reduce the number of values they can take on the dataset. This is easier than designing a hypothesis about a specific variable a component can take. If you control the process by which the dataset is created (e.g. by relying on templates with placeholders), it is easy to restrain the number of values taken by all your variables. Swap graphs can then tell you which of the many candidate variables a component is encoding, without having to test them one by one. Moreover, some variables naturally take a limited number of values (e.g. the gender of a character).
Here is the summary of the points made in this section. This form the lens through which we'll interpret the swap graph in practice:
The empirical naive interpretation of swap graphs: When a swap graph can be decomposed into separable communities, we interpret each community as a possible value of an abstract variable that the component is encoding when run on this dataset. We are more confident in this interpretation when we measure a rich output of the model, and when the number of possible values the abstract variable can take is low. Such an abstract variable, when it exists, is interpreted as an approximation of the role played by a component.
Despite all these considerations, we don't have a guarantee that our component can cleanly be modeled as "storing a simple variable that will be read by later layers". Models are messy and components can play several roles at the same time by representing features in different directions, or even in superposition. Here is a list of possible extensions to better model the peculiarities of neural network representations and design more precise experimental techniques.
Since we have extensive information about IOI in GPT2-small, we chose this task to validate swap graphs and their interpretation.
Metric choice. However, because we used KL divergence as our metric, the scope of the behavior we’re investigating is different than what has been studied in the IOI paper that focused on logit differences. Here we aim at understanding the full probability distribution, not just why the IO token is predicted more strongly than the S token.
We made this choice of metric i) to be able to gather data that would generalize to other tasks where logit difference can't be used but KL divergence can ii) because of our empirical naive interpretation, we are more confident in interpreting swap graphs that measure the rich output of the model. We nonetheless validate some of our findings on GPT2-small by running swap graphs with logit differences in Appendix.
Dataset. For all the experiments presented in this post, except when mentioned explicitly, we used a modified version of the IOI dataset generation introduced in the IOI paper where we restricted the number of possible names to 5. We made this choice because we expected the various names to be important features and we were motivated to make the naive interpretation more likely.
Gaussian kernel. Network analyses and visualization are easier when the weights of the edges represent similarities or attraction forces, rather than distance. To turn the raw weights of KL divergence into graph weights, we used a Gaussian kernel with a standard deviation equal to the 25th percentile of the raw weight distribution. This choice is somewhat arbitrary, but we found that the structure of the graph is not highly sensitive to the choice of metric or kernel function.
Here are on the left the distribution of the swap graph of 9.9 on IOI raw weights (the KL divergence scores) and on the right the graph weights after applying the Gaussian kernel.
Name Mover's queries.
In the previous experimental results, we investigated the swap graphs of the output of 9.9. We can do the same by input swap patching its queries instead.
Here, the clusters correspond to the order of the first name, i.e. if the template is of the type ABB or BAB. “Alice and Bob went to the store. Bob …” is of the template type ABB “Bob and Alice went to the store. Bob …” is BAB.
This is coherent with the results presented in Appendix A of the IOI paper (that was done using hypothesis-driven patching). Name Movers rely on S-Inhibition Heads through query composition to compute their attention specific to IO. S-Inhibition Heads are sending information about the position of the S1 token and its value, with the first piece of information having more weight than the second. They are saying to Name Movers “avoid paying attention to the second name in context” more than “avoid paying attention to the token ‘Bob’“. We called these two effects, position signal, and token signal respectively.
What we observe here is that the value of the position signal perfectly correlates with the two clusters present in the swap graph. Hence 9.9's queries are encoding the template type.
We know that the position signal originates from S-Inhibition heads. Let's track it down!
The two clusters are present and again perfectly correlate with the template type.
When coloring by the value of the subject token, we observe partial sub-clusters inside the main clusters. To validate this observation, we can "zoom in" on a particular cluster by creating an IOI dataset with only the ABB template type.
This validates the observation: once the template type is fixed, the second most important feature is the value of the subject token. This is coherent with the existence of the position and token signal, and by analyzing the hierarchal order of the feature in the swap graph, we recover their relative importance.
This is an example of disentangling two variables encoded by the same component.
Going deeper: duplicate token heads and induction heads.
We can continue our journey in the depth of the IOI circuit and track the origin of the position and token signal even further. Here is the IOI circuit again.
Here is the induction head 5.5 at the S2 position:
One weak induction head 5.9:
That’s a good example of a weak/absent pattern in a swap graph. I’m uncomfortable interpreting any structure from these plots except that there doesn't seem to be any interpretable structure.
Here is the duplicate token head 3.0:
For now, we’ve only been confirming results already described in the IOI paper. But this new technique enabled us to make surprising discoveries.
We tried to reproduce the study on a mistral model with the same architecture as GPT2-small (the model is called "battlestar"). Through path patching, we found the structural equivalent of Name Movers (heads directly influencing the logits) and S-inhibition heads (heads influencing Name Movers).
When we computed the swap graph of 8.3 at the END position, a structural analog of the S-Inhibition heads, we observed:
We observe two clear clusters, but the order of the name is only of secondary importance. By looking at the sentences in each cluster, we discovered that they perfectly correlated with the gender of the subject.
This result holds when allowing more than 5 possible names. We used genderize.io to attribute a gender to 88 different names to color the figure above.
By looking back at GPT2-small and running pathing networks for more heads, we discovered that 9.7 (a weak Backup Name Mover head) was also, in fact, a gender head.
Despite being part of the IOI circuit, this head is in fact quite poorly characterized in the paper because of its weak importance. This is thus not particularly surprising that the IOI paper missed important parts of its role. Similarly, 8.3 in the mistral model is also not the most crucial head for the model computation.
I don’t have a good understanding of how gender information is used. There is no reason for it to be used a priori, and this fact would have been hard to find using hypothesis-driven patching. I was able to discover it because I observed two clear clusters that didn’t map to the features I hypothesized at the time.
Community detection. To scale to a bigger model, we automated the process that we did manually so far: eyeballing clusters and how they match different coloring schemes. We start by detecting communities in the swap graph with the Louvain method. Then, we compute their adjusted Rand index with values taken by hypothesized abstract variables (i.e. discrete functions of the input like the IO token). This index is a measure of similarity between partitions of a set, in this case, the partition outputted by the community detection and the partition induced by the hypothesized variable. The index is adjusted, meaning that a score of zero is what we expect for two random, independent partitions.
Semantic maps. To evaluate swap graphs on larger models, we run them on every component (attention head and MLP block) at the END position on the IOI dataset and compute their communities.
From there, we introduce a new visualization technique called a semantic map to have a comprehensive view of the role of every component in a single figure. A semantic map is an undirected, weighted, complete graph S=(C,C2,wS) where nodes are the model components C, and the weight wS(C1,C2) of the edge (C1,C2) is defined by the adjusted Rand index between the Louvain communities of GC1 and GC2.
Intuitively, the more connected two components are in the semantic map, the more similar the abstract variable they encode (to the extent they encode one).
We used four models for our comparative analysis: GPT2-small, GPT2-XL, Pythia-2.8B, and GPT-Neo-2.7B.
Again, we used force-directed graph drawing to visualize the graphs. We used the following visualization scheme:
Here are the same graphs, colored by component type: yellow for MLP, and dark blue for attention heads.
You can find all the plots as interactive HTML, with more coloring (by layers, by feature, etc), in this drive folder.
I’ll use “encode” as a way to mean “whose swap graphs’ Louvain communities on the IOI dataset have high adjusted Rand index with”.
Semantic maps of queries.
To build swap graphs, instead of patching the output of the component, we can patch the queries input to the attention head. Then, we can compute semantic maps on the queries (and restrict the components to attention heads). This time we only ran it on the 10% of the component with the most important queries. You can find all the plots in the drive folder.
Considering the low adjusted Rand indices between swap graph communities and manually created input features on larger models compared to GPT-2 small, it's natural to question the significance of the high-level structures we observe. In practice, most swap graphs are not cleanly separable into tightly connected communities.
To validate that swap graphs enable us to make predictions about the model mechanism, we designed two sets of downstream experiments leveraging the information contained in the Louvain communities. First, we checked that we can swap the input inside communities without impacting the output of the model (that's a kind of causal scrubbing experiment). Second, we created targeted rewrites of specific components by swapping their input across communities to trigger a precise model behavior.
A first sanity check to ensure that we can naively interpret swap graphs and their communities is to resample the inputs inside the Louvain communities. If each input inside a community lead to the component representing the same value, we can swap it for an arbitrary input from this same community. The value represented should stay the same, and thus the output of the model should not change much.
Resampling all the components at once is not a really interesting experiment to run. If the model uses complicated circuitry before returning its answer, such that the output only depends on the last part of the circuit, then resampling the input to all the components is equivalent to resampling the last part of the circuit. We will never test how much we can preserve the structure of the intermediate nodes.
To limit this, we run resampling by communities of all components before the layer L for all layer L. This means that we also measure if the model can compute the end of its forward pass and return meaningful output despite all the early components being run on resampled inputs.
This is a particular kind of causal scrubbing experiment where we don’t fully treeify our model and define equivalence classes for resampling. More precisely, the scrubbed computational graphs on a toy example for different values of L looks like this:
To compare the performance of the causal scrubbing from swap graphs, we introduce two other methods to partition the inputs.
Each of the three methods takes hyperparameters that control their tendencies to favor small (but numerous) communities, or large (and less numerous): K for random partitions, the resolution in the Louvain algorithm (the default value we used so far is 1.), and the linkage threshold for Ward’s method.
Of course, we expect that the smaller the communities, the less destructive the resampling would be. In the limit, each community contains a single element (e.g. random communities with K>>|D|), and the resampling has no effect.
To fairly compare the results of the three methods, we need to evaluate the size of the partitions they return, i.e. where they lie between one set per sample and one big set containing all the samples. To this end, we used two metrics:
For each method, we chose a set of 5 hyperparameters such that the partition metrics span similar ranges. Each lie color corresponds to roughly the same average resampling entropy and community size.
The results for the IO and S probability can be found in the drive folder.
This is a reassuring validation that the naive interpretation seems to make sense empirically. In some way, this is the minimal test that swap graphs and our interpretation are required to pass to be willing to give weight to them.
In the previous experiments, we validated that the Louvain communities correspond to sets of inputs invariant by swap. But we did not make use of the knowledge we gained with the semantic maps. This is what this section is about.
If we claim to understand the variables stored in various components, we could also imagine how changing the variable stored in a set of components will affect the rest of the layers and eventually the output of the model. We can even imagine how to change the model to put it in a configuration we never observed before (e.g. implanting a jellyfish gene in the DNA of a rabbit). If we’re able to predict the results of such experiments, this is a good validation of our understanding.
We call such experiments that make a model behave outside of its normal regime in a predictable way targetted rewrites. In that, they differ from causal scrubbing which measures how much the model behavior is preserved when the internal distribution respect a given hypothesis.
Targetted rewrite on Name Movers
We did not conduct any circuit analysis of the kind performed in the IOI paper on the larger models. Nonetheless, in this section, I'll draw parallels between components from the IOI circuit and components from larger models using information from swap graphs alone. More work is needed to know how much these parallels make sense. For now, you can see them as a way to touch reality quickly to see if I can modify the model behavior by swapping inputs across communities for specific components -- an essential part of the naive interpretation view.
Similarly to the behavioral definition of induction head, I define a continuous definition of extended name mover that relies on high-level observation about the head and not on a precise mechanism. An attention head in a TLM is an extended name mover if
The intuition behind this is that we want to find heads copying the IO token. The first point is a proxy for checking that they attend back to it (instead of e.g. repeating the value of the IO token sent from earlier layers), and the second point is checking that the primary variable that explains how they influence the output when swapping inputs is the value of the IO token.
Once we identified the extended name mover, we change their input such that they are outputting a name that is not the indirect object. The goal is to measure if this translates in the model outputting the alternative name instead of the indirect object.
More precisely, we chose an arbitrary rotating permutation σ of the 5 possible IO tokens and retargeted all extended name movers to output σ(IO) instead of IO. We perform this for various values of p. Smaller values of p corresponds to more heads being patched. For ease of reading, we use the proportion of patched heads as the x-axis.
Here are the results of the experiments for the IO probability. This is a more relevant metric than logit diff because the extended name movers are seeing input where the S token is randomized.
The trend observed is the one we expected: the more name movers are patched, the lower the original IO probability, and the higher the rotated IO probability. Nonetheless, despite patching more than 10% of the heads, we are unable to reach σ(IO) probabilities of the same order as the original IO probability in large models. It is a limitation of the current experiments: we cannot totally rewrite the output with targeted causal intervention.
Note about negative heads: The bump visible in the plot of GPT2-small is due to Negative Name Movers. Adding Negative Name Movers causes the rotated IO probability to decrease suddenly. They are likely to be also present in the other models but their effect is compensated by positive name movers.
I made the choice to design criteria for finding extended name movers that treat equally positive and negative heads -- heads that systematically write again the direction of the most probable answer, the IO token in this case. Negative and positive heads create an equilibrium and there is no obvious reason to separate them if they’re doing the same role but with opposing effects. If we are good enough to understand the underlying mechanism, we should then be able to steer the whole equilibrium in the direction we want.
I expect that if we filter for negative heads, we can make large models output σ(IO) with high probability by patching only a few heads. But I am not interested in these targeted rewrites as they push the model to an unnatural equilibrium.
Targetted rewrites of senders components
In semantic maps, we identified heads that encoded for the position, the S token, and the S gender. We hypothesized that they might play an analogous role as the S-Inhibition heads in IOI. They would be sending information related to S like its position, its value, or its gender. This would be responsible for name movers' attention specific to the IO token and inhibiting attention to S. The goal of this rewriting is to flip the token outputted by the model by making extended name movers copy the S token instead of IO.
Here, we’ll test this hypothesis by defining sender components and performing targeted rewrites on them.
A component C (attention head or MLP block) is a sender of the feature f iff
Compared to the definition of extended name movers, we added the constraint about importance. This is to ensure that we are not picking un we are not picking up on heads that are tracking features f but only weakly used in this context.
Another crucial difference is that we allow MLP blocks to be senders. As observed in the semantic maps, they have Rand indices that sometimes correlate with S features, it makes sense to include them.
We gathered the sender of the ABB/BAB template type, S token, and S gender. Then, we flipped the information contained in the sender components so that instead of matching the characteristics of the S1 position and the S token, they instead match the characteristics of IO.
Here are the results for various choices of p. In addition to doing all three swaps at the same time (red line), we also conducted them individually to see which type of sender have the most impact (orange, purple, and green line). We included a baseline where instead of swapping the input to the senders such that their output matches the characteristics of IO, we simply ablate them by arbitrarily resampling their input (light blue).
In this case, the logit diff is the natural metric to quantify the success of the operation as it directly measures how much the model is outputting S more than IO. You can still find the results with more metrics in the drive folder.
Given our naive assumption that the conclusion about IOI on GPT2-small transfer to bigger model without e.g. checking that the senders compose with extended name movers, nor that extended name movers behave as in GPT2-small, it’s a surprisingly successful experiment.
There are many hypotheses about why targeted rewrites on large models don't work as well. For instance, they might be more robust to patching in general, or the feature we rewrote might fit less correctly the role the components are actually playing. Another likely hypothesis is that we only rewrite one information route used to predict IO, but larger models use additional circuits.
Swap graphs for exploratory interp
The high throughput of bits makes swap graphs well-suited to design and test hypotheses about the role of components in neural networks. I think that this technique can become a crucial tool for exploratory mech interp and could improve on other techniques (hypothesis-driven patching, logit lens, etc) in many cases. I’d be excited to see it applied to other problems!
Toy theoretical model of localized computation
I found it useful for pedagogical purposes and to quickly come up with counter-examples to the naive interpretation of swap graphs. In general, it seems sane to link back messy experiments on neural networks to theoretical cases that distill one difficult aspect of the problem we’re trying to solve.
Swap graphs to understand large models
As shown with the causal scrubbing results, we can easily recover sets of inputs invariant by resampling from swap graphs. This is an important empirical foundation that legitimizes the technique despite the less significant modularity of the graphs. We can also use them to define groups of components that encode the same variables and perform targetted rewrites.
They could be a promising tool to discover high-level structures in TLM internals that appears across scale and models such as groups of components encoding the same intermediate variable in many different contexts. This could provide an exciting extension to the behavioral induction heads introduced by Anthropic.
Throughout this work, I mostly rushed to scale up the approach and extract information as fast as possible from the swap graphs without refining the process along the way. My rationale was that if swap graphs contain enough signal, I would not need advanced network processing. The results are encouraging and it’s likely that we could obtain better results using more advanced network processing.
Nonetheless, it is limited. We cannot use it to perfectly able to arbitrarily steer the model given our knowledge from swap graphs. I think that it’s mostly due to the fact that many swap graphs in large models don't correlate well enough with any known feature. This might be the single most important limitation. To move further, one could consider smaller components to patch that are more likely to correspond to clean variables (e.g. MLP neurons or projecting component output into subspaces). Another way could be refining our interpretation of swap graphs to deal with the case where a component encodes multiple variables at the same time.
A disordered list of possible extensions.
To be closer to the IOI paper, we could use the logit difference as the metric to compare the output between the patched and original model. More precisely, we have:
Here is the swap graph with this metric for the Name Mover head 9.9 (using the Gaussian kernel to turn distances into graph weights).
We observe a correlation between position and color, but the clusters are not as clear as when using the KL divergence. The difference is even more striking when we change the standard deviation of the Gaussian kernel, from the 25th percentile to the 5th percentile of the distribution:
No clear cluster at all. Here is the KL divergence for comparison with the kernel with the 5th percentile.
With the KL divergence, reducing the sigma makes the clusters extremely distinct. The vast majority of the input pairs that lead to swaps with KL divergence below the 5th percentile are pairs of sentences that share the same IO token (modulo some outliers involving the "bone" object, e.g. the blue dot in the middle of the red, purple and green clusters).
This is not true for logit difference: some pairs of input can be swapped without impacting much the logit difference (<5th percentile) despite involving different IO tokens.
I don't have a good understanding of why this is the case, but two factors made me decide not to use the logit diff for the rest of the study without worrying too much about this observation:
Nonetheless, despite the disappointing results on Name Movers, partial investigation of heads further down suggests that the observations drawn from swap graphs on IOI hold with logit difference. For instance, here is the swap graph of the S-Inhibition head output.
Swap graphs are directed. In this context, a clique is a subset of vertices such that every two distinct vertices are connected in both ways.
Implementation is available here.