Interesting thought! I expect there's systematic differences, though it's not quite obvious how. Your example seems pretty plausible to me. Meta SAEs are also more incentived to learn features which tend to split a lot, I think, as then they're useful for more predicting many latents. Though ones that don't split may be useful as they entirely explain a latent that's otherwise hard to explain.

Anyway, we haven't checked yet, but I expect many of the results in this post would look similar for eg sparse linear regression over a smaller SAEs decoder. Re why meta SAEs are interesting at all, they're much cheaper to train than a smaller SAE, and BatchTopK gives you more control over the L0 than you could easily get with sparse linear regression, which are some mild advantages, but you may have a small SAE lying around anyway. I see the interesting point of this post more as "SAE latents are not atomic, as shown by one method, but probably other methods would work well too"

Reply

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Neel Nanda1mo20

but it's not clear to me that you couldn't train a smaller, somewhat-lossy meta-SAE even on an idealized SAE, so long as the data distribution had rare events or rare properties you could thow away cheaply.

IMO am "idealized" SAE just has no structure relating features, so nothing for a meta SAE to find. I'm not sure this is possible or desirable, to be clear! But I think that's what idealized units of analysis should look like

You could also play a similar game showing that latents in a larger SAE are "merely" compositions of latents in a smaller SAE.

I agree, we do this briefly later in the post, I believe. I see our contribution more as showing that this kind of thing is possible, than that meta SAEs are objectively the best tool for it

Reply

Thomas Kwa's Shortform

Neel Nanda2mo20

Ah, gotcha. Yes, agreed. Mech interp peer review is generally garbage and does a bad job of filtering for quality (though I think it was reasonable enough at the workshop!)

Reply

1

Thomas Kwa's Shortform

Neel Nanda2mo70

Mechinterp is often no more advanced than where the EAs were in 2022.

Seems pretty false to me, ICML just rejected a bunch of the good submissions lol. I think that eg sparse autoencoders are a massive advance in the last year that unlocks a lot of exciting stuff

Reply

3

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Neel Nanda3mo80

Therefore, many project ideas in that list aren’t an up-to-date reflection of what some researchers consider the frontiers of mech interp.

Can confirm, that list is SO out of date and does not represent the current frontiers. Zero offence taken. Thanks for publishing this list!

Reply

Fabien's Shortform

Neel Nanda3mo40

I found this comment very helpful, and also expected probing to be about as good, thank you!

Reply

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Neel Nanda3mo20

Glad you liked the post!

I'm also pretty interested in combining steering vectors. I think a particularly promising direction is using SAE decoder vectors for this, as SAEs are designed to find feature vectors that independently vary and can be added.

I agree steering vectors are important as evidence for the linear representation hypothesis (though at this point I consider SAEs to be much superior as evidence, and think they're more interesting to focus on)

Reply

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Neel Nanda3mo30

I'm not aware of any problems with it. I think it's a nice paper, but not really at my bar for important work (which is a really high bar, to be clear - fewer than half the papers in this post probably meet it)

Reply

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Neel Nanda3mo40

Fair point, I'll add that in to the post. The main reason I recommend it so highly and prominently is that I think it builds valuable conceptual frameworks for reasoning about the pieces of a transformer, even if it somewhat overclaims on how far it can get on interpreting tiny attention-only models, and I think those broad intuitions still stand even after your critiques. Eg strict induction heads as an example of the kind of algorithm that can be implemented with attention, even if it's not fully faithful to the underlying model. But I agree that these are worthwhile caveats to have in mind when reading, and the paper shouldn't be blindly recommended.

Reply

Interpreting Preference Models w/ Sparse Autoencoders

Neel Nanda3mo40

This is a preference model trained on GPT-J I think, so my guess is that it's just very dumb and learned lots of silly features. I'd be very surprise if a ChatGPT preference model had the same issues when an SAE is trained on it.

Reply