Ah, gotcha. Yes, agreed. Mech interp peer review is generally garbage and does a bad job of filtering for quality (though I think it was reasonable enough at the workshop!)
Mechinterp is often no more advanced than where the EAs were in 2022.
Seems pretty false to me, ICML just rejected a bunch of the good submissions lol. I think that eg sparse autoencoders are a massive advance in the last year that unlocks a lot of exciting stuff
Therefore, many project ideas in that list aren’t an up-to-date reflection of what some researchers consider the frontiers of mech interp.
Can confirm, that list is SO out of date and does not represent the current frontiers. Zero offence taken. Thanks for publishing this list!
I found this comment very helpful, and also expected probing to be about as good, thank you!
Glad you liked the post!
I'm also pretty interested in combining steering vectors. I think a particularly promising direction is using SAE decoder vectors for this, as SAEs are designed to find feature vectors that independently vary and can be added.
I agree steering vectors are important as evidence for the linear representation hypothesis (though at this point I consider SAEs to be much superior as evidence, and think they're more interesting to focus on)
I'm not aware of any problems with it. I think it's a nice paper, but not really at my bar for important work (which is a really high bar, to be clear - fewer than half the papers in this post probably meet it)
Fair point, I'll add that in to the post. The main reason I recommend it so highly and prominently is that I think it builds valuable conceptual frameworks for reasoning about the pieces of a transformer, even if it somewhat overclaims on how far it can get on interpreting tiny attention-only models, and I think those broad intuitions still stand even after your critiques. Eg strict induction heads as an example of the kind of algorithm that can be implemented with attention, even if it's not fully faithful to the underlying model. But I agree that these are worthwhile caveats to have in mind when reading, and the paper shouldn't be blindly recommended.
This is a preference model trained on GPT-J I think, so my guess is that it's just very dumb and learned lots of silly features. I'd be very surprise if a ChatGPT preference model had the same issues when an SAE is trained on it.
This was interesting, thanks! I really enjoy your short book reviews
IMO am "idealized" SAE just has no structure relating features, so nothing for a meta SAE to find. I'm not sure this is possible or desirable, to be clear! But I think that's what idealized units of analysis should look like
I agree, we do this briefly later in the post, I believe. I see our contribution more as showing that this kind of thing is possible, than that meta SAEs are objectively the best tool for it