This work was produced as part of the Apart Fellowship. @Yoann Poupart and @Imene Kerboua led the project; @Clement Neo and @Jason Hoelscher-Obermaier provided mentorship, feedback and project guidance.
Here, we present a qualitative analysis of our preliminary results. We are at the very beginning of our experiments, so any setup change, experiment advice, or idea is welcomed. We also welcome any relevant paper that we could have missed. For a better introduction to interpreting adversarial attacks, we recommend reading scasper’s post: EIS IX: Interpretability and Adversaries.
Code available on GitHub (still drafty).
We started out with the question: How does GPT-2 know when to use the word "an" over "a"? The choice depends on whether the word that comes after starts with a vowel or not, but GPT-2 can only output one word at a time.
We still don’t have a full answer, but we did find a single MLP neuron in GPT-2 Large that is crucial for predicting the token " an". And we also found that the weights of this neuron correspond with the embedding of the " an" token, which led us to find other neurons that predict a specific token.
It was surprisingly hard to think of a prompt where GPT-2 would output “ an” (the leading space is part of the token)...