Lee Sharkey has not written any posts yet.

This is one of the major research questions that will be important to answer before polytopes can be really useful in mechanistic descriptions.
By choosing to use clustering rather than dimensionality reduction methods, we took a non-decompositional approach here. Clustering was motivated primarily by wanting to capture the monosemanticity of local regions in neural networks. But the ‘monosemanticity’ that I’m talking about here refers to the fact that small regions of activation mean one thing on one level of abstraction; this ‘one thing’ could be a combination of features. This therefore isn’t to say that small regions of activation space represent only one feature on a lower level of abstraction. Small regions of... (read more)
Thanks for your interest!
Shouldn't this create strong regularisation favouring using meaningful directions over meaningful polytopes?
Yes, that seems reasonable!
One thing we want to emphasize is that it's perfectly possible to have both meaningful directions and meaningful polytopes. For instance, if all polytope boudaries intersect the origin, then all polytopes will be unbounded. In that case, polytopes will essentially be directions!
The polytope lens only becomes relevant when trying to explain what perfectly linear models can't account for. Although LN might create a bias toward directions, each layer is still nonlinear; nonlinearities probably still need to be accouted for somewhere in our explanations.
All this said, we haven't thought a lot about LN in this context. It'd be great to know if this regularisation is real and if it's strong enough that we can reason about networks without thinking about polytopes.
For GPT2-small, we selected 6/1024 tokens in each sequence (evenly spaced apart and not including the first 100 tokens), and clustered on the entire MLP hidden dimension (4 * 768).
For InceptionV1, we clustered the vectors corresponding to all the channel dimensions for a single fixed spatial dimension (i.e. one example of size [n_channels] per image).