x

AI ALIGNMENT FORUM

AF

Robert_AIZI — AI Alignment Forum

Robert_AIZI

Top postsTop post

Robert_AIZI

Message

1399

Ω

62

35

71

4y

Robert_AIZI

1399

Ω

62

4y

Comparing Anthropic's Dictionary Learning to Ours

Readers may have noticed many similarities between Anthropic's recent publication Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (LW post) and my team's recent publication Sparse Autoencoders Find Highly Interpretable Directions in Language Models (LW post). Here I want to compare our techniques and highlight what we did similarly or...

Oct 7, 2023•137

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

by Logan Riggs, Hoagy, Aidan Ewart, and Robert_AIZI

This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language Models We use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI),...

Sep 21, 2023•161