Compact Proofs of Model Performance via Mechanistic Interpretability

rajashree; Adrià Garriga-alonso; Jason Gross

I believe what you describe is effectively Casual Scrubbing. Edit: Note that it is not exactly the same as causal scrubbing, which picks looks at the activations for another input sampled at random.

On our particular model, doing this replacement shows us that the noise bound in our particular model is actually about 4 standard deviations worse than random, probably because the training procedure (sequences chosen uniformly at random) means we care a lot more about large possible maxes than small ones. (See Appendix H.1.2 for some very sparse details.)

On other toy models we've looked at (modular addition in particular, writeup forthcoming), we have (very) preliminary evidence suggesting that randomizing the noise has a steep drop-off in bound-tightness (as a function of how compact a proof the noise term comes from) in a very similar fashion to what we see with proofs. There seems to be a pretty narrow band of hypotheses for which the noise is structureless but we can't prove it. This is supported by a handful of comments about how causal scrubbing indicates that many existing mech interp hypotheses in fact don't capture enough of the behavior.

^{^}

We think that our work serves as an example of how to leverage a downstream task to pick a metric for evaluating mechanistic interpretations. Specifically, formally proving that an explanation captures why the model has a particular behavior can be thought of as a pessimal ablation of the parts the explanation claims are unimportant.^[2] That is, if we can replace the unimportant parts of the model with their worst possible values (relative to our performance metric) while maintaining performance, this provides a proof that our model implements the same behavior as in our explanation.

^{^}

Compare to the zero, mean, or resample ablation, where we replace the unimportant parts of the model with zeros, their mean values, or randomly sampled values from other data points.

[-]RogerDearnaley1y22

Although the residuals for each of the four component matrices (after removing the first two principal components) are both small and seem to be noise, proving that there's no structure that causes the noise to interact constructively when we multiply the matrices and “blow up” is hard.

Have you tried replacing what you believe is noise with actual random noise, with similar statistical properties, and then testing the performance of the resulting model? You may not be able to prove the original model is safe, but you can produce a model that has had all potential structure that you hypothesize is just noise replaced, where you know the noise hypothesis is true.

[-]Jason Gross1y*63

Description of Proof	Complexity Cost	Bound	Est. FLOPs	Unexplained Dimensions
Brute force	$O (v^{k + 1} k d)$	0.9992 ± 0.0015	$2^{40}$	$2^{30}$
Cubic	$O (v^{3} k^{2})$	0.9845 ± 0.0041	$2^{25}$	$2^{14}$
Sub-cubic	$O (v^{2} \cdot k^{2} + v^{2} \cdot d)$	0.832 ± 0.011	$2^{21}$	$2^{13}$
(without mean+diff)	$O (v^{2} \cdot k^{2} + v^{2} \cdot d)$	0.758 ± 0.039	$2^{21}$	$2^{13}$
Low-rank QK	$O (v^{2} k^{2} + v d^{2}$ $+ (EU&OV) v^{2} d)$	0.806 ± 0.013	$2^{21}$	$2^{12}$
(SVD only)	$O (v^{2} k^{2} + v d^{2}$ $+ (EU&OV) v^{2} d)$	0.643 ± 0.044	$2^{22}$	$2^{12}$
Low-rank EU	$O (v^{2} k^{2} + v d$ $+ (QK&OV) v^{2} d)$	0.662 ± 0.061	$2^{21}$	$2^{13}$
(SVD only)	$O (v^{2} k^{2} + v d$ $+ (QK&OV) v^{2} d)$	$(3.38 \pm 0.06) \times 10^{- 6}$	$2^{21}$	$2^{13}$
Low-rank QK&EU	$O (v^{2} k^{2} + v d^{2}$ $+ (OV) v^{2} d)$	0.627 ± 0.060	$2^{21}$	$2^{13}$
(SVD only)	$O (v^{2} k^{2} + v d^{2}$ $+ (OV) v^{2} d)$	$(3.38 \pm 0.06) \times 10^{- 6}$	$2^{22}$	$2^{13}$
Quadratic QK	$O (v^{2} k^{2} + v d$ $+ (EU&OV) v^{2} d)$	0.407 ± 0.032	$2^{21}$	$2^{12}$
Quadratic QK&EU	$O (v^{2} k^{2} + v d$ $+ (OV) v^{2} d)$	0.303 ± 0.036	$2^{21}$	$2^{13}$

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

46

Compact Proofs of Model Performance via Mechanistic Interpretability

46

Paper abstract

Introduction

Correspondence vs compression

How to compact a proof

Proofs on a toy model

Reasoning about error in compressing the weights

Our takeaways

Citation Info