AI ALIGNMENT FORUM
AF

398

Sparse Autoencoders (SAEs)

Edited by Joseph Bloom last updated 6th Apr 2024

Sparse Autoencoders (SAEs) are an unsupervised technique for decomposing the activations of a neural network into a sum of interpretable components (often referred to as features). Sparse Autoencoders may be useful interpretability and related alignment agendas.

For more information on SAEs see:

Towards Monosemanticy: Decomposing Language Models with Dictionary Learning
Sparse Autoencoders Find Highly Interpretable Directions in Language Models

2

2

Posts tagged Sparse Autoencoders (SAEs)

109Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Zac Hatfield-Dodds

2y

12

69[Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey, Dan Braun, beren

3y

14

32Interpretability with Sparse Autoencoders (Colab exercises)

CallumMcDougall

2y

0

63Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Logan Riggs, Hoagy, Aidan Ewart, Robert_AIZI

2y

7

38Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small

2y

12

35Sparse Autoencoders Work on Attention Layer Outputs

Connor Kissane, robertzk, Arthur Conmy, Neel Nanda

2y

3

28Attention SAEs Scale to GPT-2 Small

Connor Kissane, robertzk, Arthur Conmy, Neel Nanda

2y

0

36[Summary] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, Vikrant Varma

1y

0

58Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

1y

7

34We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

robertzk, Connor Kissane, Arthur Conmy, Neel Nanda

2y

0

18Stitching SAEs of different sizes

Bart Bussmann, Patrick Leask, Joseph Bloom, Curt Tigges, Neel Nanda

1y

2

43Sparsify: A mechanistic interpretability research agenda

2y

17

27Understanding SAE Features with the Logit Lens

Joseph Bloom, Johnny Lin

2y

0

10Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Taras Kutsyk, Tommaso Mencattini, Ciprian Florea

1y

1

40[Full Post] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, Vikrant Varma

1y

3

Load More (15/74)

Add Posts