This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
Transformer Circuits
Edit
History
Subscribe
Discussion
(0)
Help improve this page
Edit
History
Subscribe
Discussion
(0)
Help improve this page
Transformer Circuits
Random Tag
Contributors
Posts tagged
Transformer Circuits
Most Relevant
1
17
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Wes Gurnee
,
Neel Nanda
7mo
0
1
37
200 Concrete Open Problems in Mechanistic Interpretability: Introduction
Neel Nanda
1y
0
1
21
Finding Sparse Linear Connections between Features in LLMs
Logan Riggs Smith
,
Sam Mitchell
,
Adam Kaufman
2d
2
2
22
How to Think About Activation Patching
Neel Nanda
6mo
3
2
23
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Neel Nanda
,
Tom Lieberum
,
Matthew Rahtz
,
János Kramár
,
Geoffrey Irving
,
Rohin Shah
,
Vladimir Mikulik
5mo
0
2
16
Paper Walkthrough: Automated Circuit Discovery with Arthur Conmy
Neel Nanda
3mo
1
1
17
200 COP in MI: Interpreting Algorithmic Problems
Neel Nanda
1y
0
1
16
A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)
Neel Nanda
1y
15
1
12
200 COP in MI: Exploring Polysemanticity and Superposition
Neel Nanda
1y
0
1
11
A Walkthrough of In-Context Learning and Induction Heads (w/ Charles Frye) Part 1 of 2
Neel Nanda
1y
0
1
11
200 COP in MI: Analysing Training Dynamics
Neel Nanda
1y
0
1
8
200 COP in MI: Looking for Circuits in the Wild
Neel Nanda
1y
3
1
7
200 COP in MI: Techniques, Tooling and Automation
Neel Nanda
1y
0
0
28
An Interpretability Illusion for Activation Patching of Arbitrary Subspaces
Georg Lange
,
Alex Makelov
,
Neel Nanda
3mo
0
0
29
Polysemantic Attention Head in a 4-Layer Transformer
Jett Janiak
,
cmathw
,
Stefan Heimersheim
1mo
0