AI ALIGNMENT FORUM
AF

Wikitags

Sparse Autoencoders (SAEs)

Edited by Joseph Bloom last updated 6th Apr 2024

Sparse Autoencoders (SAEs) are an unsupervised technique for decomposing the activations of a neural network into a sum of interpretable components (often referred to as features). Sparse Autoencoders may be useful interpretability and related alignment agendas. 

For more information on SAEs see:

  • Towards Monosemanticy: Decomposing Language Models with Dictionary Learning
  • Sparse Autoencoders Find Highly Interpretable Directions in Language Models
Subscribe
2
Subscribe
2
Discussion0
Discussion0
Posts tagged Sparse Autoencoders (SAEs)
110Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Zac Hatfield-Dodds
2y
12
69[Interim research report] Taking features out of superposition with sparse autoencoders
Lee Sharkey, Dan Braun, beren
3y
14
32Interpretability with Sparse Autoencoders (Colab exercises)
CallumMcDougall
2y
0
63Sparse Autoencoders Find Highly Interpretable Directions in Language Models
Logan Riggs, Hoagy, Aidan Ewart, Robert_AIZI
2y
7
38Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small
Joseph Bloom
2y
12
35Sparse Autoencoders Work on Attention Layer Outputs
Connor Kissane, robertzk, Arthur Conmy, Neel Nanda
2y
3
28Attention SAEs Scale to GPT-2 Small
Connor Kissane, robertzk, Arthur Conmy, Neel Nanda
2y
0
36[Summary] Progress Update #1 from the GDM Mech Interp Team
Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, Vikrant Varma
1y
0
58Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight
Sam Marks
1y
7
34We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To
robertzk, Connor Kissane, Arthur Conmy, Neel Nanda
1y
0
18Stitching SAEs of different sizes
Bart Bussmann, Patrick Leask, Joseph Bloom, Curt Tigges, Neel Nanda
1y
2
43Sparsify: A mechanistic interpretability research agenda
Lee Sharkey
1y
17
27Understanding SAE Features with the Logit Lens
Joseph Bloom, Johnny Lin
1y
0
10Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?
Taras Kutsyk, Tommaso Mencattini, Ciprian Florea
1y
1
40[Full Post] Progress Update #1 from the GDM Mech Interp Team
Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, Vikrant Varma
1y
3
Load More (15/74)
Add Posts