EIS IX: Interpretability and Adversaries

[-]Adam Jermyn3y20

Second, the measure of “features per dimension” used by Elhage et al. (2022) might be misleading. See the paper for details of how they arrived at this quantity. But as shown in the figure above, “features per dimension” is defined as the Frobenius norm of the weight matrix before the layer divided by the number of neurons in the layer. But there is a simple sanity check that this doesn’t pass. In the case of a ReLU network without bias terms, multiplying a weight matrix by a constant factor will cause the “features per dimension” to be increased by that factor squared while leaving the activations in the forward pass unchanged up to linearity until a non-ReLU operation (like a softmax) is performed. And since each component of a softmax’s output is strictly increasing in that component of the input, scaling weight matrices will not affect the classification.

It's worth noting that Elhage+2022 studied an autoencoder with tied weights and no softmax, so there isn't actually freedom to rescale the weight matrix without affecting the loss in their model, making the scale of the weights meaningful. I agree that this measure doesn't generalize to other models/tasks though.

They also define a more fine-grained measure (the dimensionality of each individual feature) in a way that is scale-invariant and which broadly agrees with their coarser measure...

[-]scasper3y10

Thanks, +1 to the clarification value of this comment. I appreciate it. I did not have the tied weights in mind when writing this.

[-]Charlie Steiner3y20

The Exhibits might have been a nice place to use probabilistic reasoning. I too am too lazy to go through and try to guesstimate numbers though :)

[-]Xander Davies3y10

Fourth, and most importantly, if superposition happens more in narrower layers, and if superposition is a cause of adversarial vulnerabilities, this would predict that deep, narrow networks would be less adversarially robust than shallow, wide networks that achieve the same performance and have the same number of parameters. However, Huang et al., (2022) found the exact opposite to be the case.

I'm not sure why the superposition hypothesis would predict that narrower, deeper networks would have more superposition than wider, shallower networks. I don't think I've seen this claim anywhere—if they learn all the same features and have the same number of neurons, I'd expect them to have similar amounts of superposition. Also, can you explain how the feature hypothesis "explains the results from Huang et al."?

More generally, I think superposition existing in toy models provides a plausible rational for adversarial examples both being very common (even as we scale up models) and also being bugs. Given this and the Elhage et al. (2022) work (which is bayesian evidence towards the bug hypothesis, despite the plausibility of confounders), I'm very surprised you come out with "Verdict: Moderate evidence in favor of the feature hypothesis."

[-]scasper3y10

We talked about this over DMs, but I'll post a quick reply for the rest of the world. Thanks for the comment.

A lot of how this is interpreted depends on what the exact definition of superposition that one uses and whether it applies to entire networks or single layers. But a key thing I want to highlight is that if a layer represents a certain set amount of information about an example, then they layer must have more information per neuron if it's thin than if it's wide. And that is the point I think that the Huang paper helps to make. The fact that deep and thin networks tend to be more robust suggests that representing information more densely w.r.t. neurons in a layer does not make these networks less robust than wide shallow nets.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

15

EIS IX: Interpretability and Adversaries

15

The studies of interpretability and adversaries are inseparable.

1. More interpretable networks are more adversarially robust and more adversarially robust networks are more interpretable.

2. Interpretability tools can and should be used to guide the design of adversaries.

3. Adversarial examples can be useful interpretability tools.

4. Mechanistic interpretability and mechanistic adversarial examples are similar approaches for addressing deception and other insidious misalignment failures.

Are adversaries features or bugs?

Exhibit A: Robustness <--> interpretability

Exhibit B: Adversarial transferability

Exhibit C: Adversarial training and task performance

Exhibit D: Generalization from training on nonrobust features

Exhibit E: Genuine nonrobust features

Exhibit F: The superposition perspective

Exhibit G: Evidence from the neural tangent kernel

What does it all mean for interpretability?

Questions