This is somewhat similar to the approach of the ROME paper, which has been shown to not actually do fact editing, just inserting louder facts that drown out the old ones and maybe suppressing the old ones.

In general, the problem with optimising model behavior as a localisation technique is that you can't distinguish between something that truly edits the fact, and something which adds a new fact in another layer that cancels out the first fact and adds something new.

Reply

Mark Xu's Shortform

Neel Nanda1mo20

The high level claim seems pretty true to me. Come to the GDM alignment team, it's great over here! It seems quite important to me that all AGI labs have good safety teams

Thanks for writing the post!

Reply

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Neel Nanda2mo50

Interesting thought! I expect there's systematic differences, though it's not quite obvious how. Your example seems pretty plausible to me. Meta SAEs are also more incentived to learn features which tend to split a lot, I think, as then they're useful for more predicting many latents. Though ones that don't split may be useful as they entirely explain a latent that's otherwise hard to explain.

Anyway, we haven't checked yet, but I expect many of the results in this post would look similar for eg sparse linear regression over a smaller SAEs decoder. Re why meta SAEs are interesting at all, they're much cheaper to train than a smaller SAE, and BatchTopK gives you more control over the L0 than you could easily get with sparse linear regression, which are some mild advantages, but you may have a small SAE lying around anyway. I see the interesting point of this post more as "SAE latents are not atomic, as shown by one method, but probably other methods would work well too"

Reply

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Neel Nanda3mo20

but it's not clear to me that you couldn't train a smaller, somewhat-lossy meta-SAE even on an idealized SAE, so long as the data distribution had rare events or rare properties you could thow away cheaply.

IMO am "idealized" SAE just has no structure relating features, so nothing for a meta SAE to find. I'm not sure this is possible or desirable, to be clear! But I think that's what idealized units of analysis should look like

You could also play a similar game showing that latents in a larger SAE are "merely" compositions of latents in a smaller SAE.

I agree, we do this briefly later in the post, I believe. I see our contribution more as showing that this kind of thing is possible, than that meta SAEs are objectively the best tool for it

Reply

Thomas Kwa's Shortform

Neel Nanda3mo20

Ah, gotcha. Yes, agreed. Mech interp peer review is generally garbage and does a bad job of filtering for quality (though I think it was reasonable enough at the workshop!)

Reply

1

Thomas Kwa's Shortform

Neel Nanda3mo70

Mechinterp is often no more advanced than where the EAs were in 2022.

Seems pretty false to me, ICML just rejected a bunch of the good submissions lol. I think that eg sparse autoencoders are a massive advance in the last year that unlocks a lot of exciting stuff

Reply

3

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Neel Nanda4mo80

Therefore, many project ideas in that list aren’t an up-to-date reflection of what some researchers consider the frontiers of mech interp.

Can confirm, that list is SO out of date and does not represent the current frontiers. Zero offence taken. Thanks for publishing this list!

Reply

Fabien's Shortform

Neel Nanda4mo40

I found this comment very helpful, and also expected probing to be about as good, thank you!

Reply

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Neel Nanda4mo20

Glad you liked the post!

I'm also pretty interested in combining steering vectors. I think a particularly promising direction is using SAE decoder vectors for this, as SAEs are designed to find feature vectors that independently vary and can be added.

I agree steering vectors are important as evidence for the linear representation hypothesis (though at this point I consider SAEs to be much superior as evidence, and think they're more interesting to focus on)

Reply

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Neel Nanda4mo30

I'm not aware of any problems with it. I think it's a nice paper, but not really at my bar for important work (which is a really high bar, to be clear - fewer than half the papers in this post probably meet it)

Reply