AI ALIGNMENT FORUM
AF

23
Connor Kissane
Ω203700
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No Comments Found
No wikitag contributions to display.
41White Box Control at UK AISI - Update on Sandbagging Investigations
3mo
5
34SAEs are highly dataset dependent: a case study on the refusal direction
1y
0
23Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
1y
3
36Base LLMs refuse too
1y
10
28SAEs (usually) Transfer Between Base and Chat Models
1y
0
18Attention Output SAEs Improve Circuit Analysis
1y
0
34We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To
2y
0
28Attention SAEs Scale to GPT-2 Small
2y
0
35Sparse Autoencoders Work on Attention Layer Outputs
2y
3