This is a accompanying blog post to work done by Callum McDougall, Arthur Conmy and Cody Rushing as part of SERI MATS 2023. The work was mentored by Neel Nanda and Tom McGrath. You can find our full paper at https://arxiv.org/abs/2310.04625. We i) summarize our key results, ii) discuss limitations and future work and iii) list lessons learnt from the project
In the paper, we define copy suppression as the following algorithm:
If components in earlier layers predict a certain token, and this token appears earlier in the context, the attention head suppresses it.
We show that attention head L10H7 in GPT2-Small (and to a lesser extent L11H10) both perform copy suppression across the whole distribution, and this algorithm explains 76.9% of their behavior.
For example, take the sentence "All's fair in love and war." If we ablate L10H7, the model will incorrectly predict "...love and love." A diagram of this process:
And a few more examples:
To test how much of L10H7's behaviour we explained, we designed a form of ablation we called copy suppression preserving ablation (CSPA). The idea is that CSPA should delete all functionality of the head except for the things which are copy suppression. If this completely destroys performance then the head isn't doing copy suppression; if it doesn't affect performance at all.
The technical details of CSPA can be found in our paper. Very loosely, CSPA does two important things:
The key result of this paper was the graph below. DCSPA is the KL divergence between (predictions after applying CSPA) and (original predictions). DMA is the KL divergence between (predictions after mean-ablating) and (original predictions). Each point is the average of a bunch of these values (we've grouped according to the value of DMA). In other words, each (x, y) point on this graph represents (average KL div from mean ablating, average KL div from CSPA) for a group of points. The line is extremely close to the x-axis, showing that copy suppression is a good description of what this head is doing.
Anthropic's paper In-context Learning and Induction Heads observed the existence of anti-induction heads, which seemed to attend to repeated prefixes but suppress the suffix rather than boost it. For example, in the prompt Barack Obama ... Barack, they would attend to the Obama token and reduce the probability assigned to the next token being Obama. This sounded a lot like copy suppression to us, so we compared the anti-induction scores to the copy suppression scores (measured by how much the indirect object token is suppressed in the IOI task), and we found a strong correlation in the quadrant where both scores were positive:
Barack Obama ... Barack
This is particularly interesting because it shows how model components can have an important effect in a particular distribution, despite not having any task-specific knowledge. The general copy suppression algorithm of "attend to and suppress previous instances of the currently-predicted token" seems to be responsible for both the distribution-specific behaviours we observe in the graph above.
Another cool finding: models know when tokens have "semantic similarity". The token (T) predicted doesn't have to be identical to the token attended to; they can differ by capitalization / pluralization / verb tense / prepended space etc. This isn’t fully explained by embedding vector cosine similarity - semantically similar words are copy suppressed more than the cosine similarity suggests.
Copy Suppression turns out to be an important part of how models do self-repair, a phenomenon where ablating components leads to later components compensating for the ablation in the same forward pass. Using the Indirect Object Identification (IOI) task as a case example, we highlight how when you ablate copying, there is nothing to copy-suppress: self-repair! This helps resolve the mystery of Negative head self-repair first introduced in the original IOI paper.
We run some experiments to show how Copy Suppression explains some, but not all, of the self-repair occurring in the IOI task. We also use Q-Composition to discover this on a weight-based level. We left with many more questions than we started with, though, and are excited for more future work looking into self-repair!
We also created interactive visualisations to accompany this paper. You can see them on our Streamlit app. This app gives you the ability to do things like:
While we think our work is the most comprehensive description of an attention head in-the-wild to date, our description is not perfect (76.9% performance recovery) and we weren't able to explain all observations in our project. We highlight the key limitations of our work and point to future directions we think are exciting and tractable given our progress.
I'll conclude with a few high-level takeaways I took from this research project. This will probably be of most interest to other researchers, e.g. on SERI MATS projects.
Note that all of these were specific to our project, and might not apply to everyone. It's entirely possible that if we'd been given this list at the start of the project, then in following these points aggressively we'd have hit different failure modes.
Some additional takeaways from Cody Rushing:
This post is not sponsored by Streamlit in any way, shape or form.