Makes sense! Depends on if you're thinking about the values as "estimating zero ablation" or "estimating importance."

1y20

Very cool work!

- In the attention attribution section, you use
`clean_pattern * clean_pattern_grad`

as an approximation of zero ablation; should this be`-clean_pattern * clean_pattern_grad`

? Zero ablation's approximation is`(0 - clean_pattern)*clean_pattern_grad = -clean_pattern * clean_pattern_grad`

.- Currently, negative name movers end up with negative attributions, but we'd like them to be positive (since zero ablating
*helps*performance and moves our metric towards one), right? - Of course, this doesn't matter when you are just looking at magnitudes.

- Currently, negative name movers end up with negative attributions, but we'd like them to be positive (since zero ablating
- Cool to

11y

These bugs should be fixed, thanks for flagging!

21y

Thanks! Yes, your description of zero ablation is correct. I think positive or negative is a matter of convention? To me "positive = is important" and "negative = damaging" is the intuitive way round,which is why I set it up the way I did.
And yeah, I would be excited to see this applied to mean ablation!
Thanks for noting the bugs, I should really freeze the demos on a specific version of the library...

I'm not sure why the superposition hypothesis would predict that narrower, deeper networks would have more superposition than wider, shallower networ... (read more)