Calculating l, the maximal number of simultaneously active features, yields strange results. For example, if we have 100 features and 100 neurons, l has to be < 100/(8 * ln(100)) = 2.7. But I would expect that 100 features can be simultaneously active because we have 100 dimensions, so the features can be orthogonal and independent. Am I understanding something wrong?

Reply

10SAEs Discover Meaningful Features in the IOI Task

2y

1

35An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

2y

3