AI ALIGNMENT FORUM
AF

Jinjin Zhao
010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
Jinjin Zhao1y0-3

I am curious about your thoughts on the differences between activation patching and SAE. Do you think they are complimentary research, or may there be some overarching idea that encapsulates both? 

Is there any application for one that can't be done with the other? It seems that activation patching may result in more interpretable concepts, but SAE may result in more fundamental features. My intuition is that it may be possible for activation patching to replace SAEs in the future.

Reply
No wikitag contributions to display.
No posts to display.