This is a link post for the Dai et al. paper “Knowledge Neurons in Pretrained Transformers” that was published on the arXiv last month. I think this paper is probably the most exciting machine learning paper I've read so far this year and I'd highly recommend others check it out as well. Edit: Maybe not; I think Paul's skeptical take here is quite reasonable.
To start with, here are some of the basic things that the paper demonstrates:
- BERT has specific neurons, which the authors call “knowledge neurons,” in its feed-forward layers that store relational facts (e.g. “the capital of Azerbaijan is Baku”) such that controlling knowledge neuron activations up-weights/down-weights the correct answer in relational knowledge prompts (e.g. “Baku” in “the capital of Azerbaijan is <mask>”) even when the syntax of the prompt is changed—and the prompts that most activate the knowledge neuron all contain the relevant relational fact.
- Knowledge neurons can reliably be identified via a well-justified integrated gradients attribution method (see also “Self-Attention Attribution”).
- In general, the feed-forward layers of transformer models can be thought of as key-value stores that memorize relevant information, sometimes semantic and sometimes syntactic (see also “Transformer Feed-Forward Layers Are Key-Value Memories”) such that knowledge neurons are composed of a “key” (the first layer, prior to the activation function) and the “value” (the second layer, after the activation function).
The paper's key results—at least as I see it, however—are the following:
- Taking knowledge neurons that encode “the of is ” and literally just adding to the value neurons (where are just the embeddings of ) actually changes the knowledge encoded in the network such that it now responds to “the of is <mask>” (and other semantically equivalent prompts) with instead of .
- For a given relation (e.g. “place of birth”), if all knowledge neurons encoding that relation (which ends up being a relatively small number, e.g. 5 - 30) have their value neurons effectively erased, the model loses the ability to predict the majority of relational knowledge involving that relation (e.g. 40 - 60%).
I think that particularly the first of these two results is pretty mind-blowing, in that it demonstrates an extremely simple and straightforward procedure for directly modifying the learned knowledge of transformer-based language models. That being said, it's the second result that probably has the most concrete safety applications—if it can actually be scaled up to remove all the relevant knowledge—since something like that could eventually be used to ensure that a microscope AI isn't modeling humans or ensure that an agent is myopic in the sense that it isn't modeling the future.
Furthermore, the specific procedure used suggests that transformer-based language models might be a lot less inscrutable than previously thought: if we can really just think about the feed-forward layers as encoding simple key-value knowledge pairs literally in the language of the original embedding layer (as I think is also independently suggested by “interpreting GPT: the logit lens”), that provides an extremely useful and structured picture of how transformer-based language models work internally.
I'm inclined to be more skeptical of these results.
I agree that this paper demonstrates that it's possible to interfere with a small number of neurons in order to mess up retrieval of a particular fact (roughly 6 out of the 40k mlp neurons if I understand correctly), which definitely tells you something about what the model is doing.
But beyond that I think the inferences are dicier:
Yeah, agreed—though I would still say that finding the first ~40% of where knowledge of a particular fact is stored counts as progress (though I'm not saying they have necessarily done that).
That's a good point—I didn't look super carefully at their number there, but I agree that looking more carefully it does seem rather large.
I also thought this was somewhat strange and am not sure what to make of it.
I was also surprised that they used individual neurons rather than NMF factors or something—though the fact that it still worked while just using the neuron basis seems like more evidence that the effect is real rather than less.
Perhaps I'm too trusting—I agree that everything you're describing seems possible given just the evidence in the paper. All of this is testable, though, and suggests obvious future directions that seem worth exploring.
Despite agreeing that the results are impressive, I'm less optimistic that you are for this path to microscope and/or myopia. Doing so would require an exhaustive listing of what we don't want the model to know (like human modeling or human manipulation) and a way of deleting that knowledge that doesn't break the whole network. The first requirement seems a deal-breaker to me, and I'm not convinced this work actual provide much evidence that more advanced knowledge can be removed that way.
Here too, I agree with the sentiment, but I'm not convinced that this is the whole story. This looks like how structured facts are learned, but I see no way as of now to generate the range of stuff GPT-3 and other LMs can do from just key-value knowledge pairs.
Thanks for the link. This has been on my reading list for a little bit and your recco tipped me over.
Mostly I agree with Paul's concerns about this paper.
However, I did find the "Transformer Feed-Forward Layers Are Key-Value Memories" paper they reference more interesting -- it's more mechanistic, and their results are pretty encouraging. I would personally highlight that one more, as it's IMO stronger evidence for the hypothesis, although not conclusive by any means.
Some experiments they show:
I also find it very intriguing that you can just decode the value distributions using the embedding matrix a la Logit Lens.