Davis Brown

Posts

Sorted by New

Wiki Contributions

Comments

The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

Thanks for this write-up! In case it’s of interest, we have also performed some exploratory interpretability work using the SVD of model weights.

We examine convolutional layers in models on a couple common vision tasks (CIFAR-10, ImageNet). In short, we similarly take the SVD of the weights in a CNN layer, $W_{L} = U S V^{T}$ , and project the hidden layer activations $x_{l}$ onto the $i$ th singular vector $V [i, :] x_{l}$ . These singular direction “neurons” can then be studied with interpretability methods: we use hypergraphs, feature visualizations, and exemplary images. More detail can be found in The SVD of Convolutional Weights: A CNN Interpretability Framework and you can explore the OpenAI Microscope-inspired demo we created for a VGG-16 trained on ImageNet here (under the "Feature Visualization" page).

To briefly highlight a few common findings between our work and this approach, we

Also find that the top singular direction is systematically less interpretable. For the ImageNet VGG-16 model, the direction tended to encode something like a fur/hair texture, which is common across many classes. For example, see the 0^th SVD neuron for the VGG-16 layers features_14, features_21, features_24, features_28 on our demo.
We find (following Martin and Mahoney) a similar distribution of singular values.
Qualitatively, the singular directions in the models we examined were at times more interpretable than neurons in the canonical basis.

And a couple questions we have:

Should we expect interpretability using the SVD of weight matrices to be more effective for transformers due to the linear residual stream (e.g., as opposed to ResNets, models without skip connections)?
There are probably scenarios where the decomposition is less appropriate. For example, how might the usefulness of this approach change when a model layer is less linear?

Reply