AI ALIGNMENT FORUM
AF

37
Wes Gurnee
Ω72100
Message
Dialogue
Subscribe

OR PhD student at MIT working on interpretability.

Find out more here: https://wesg.me/

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No Comments Found
No wikitag contributions to display.
77Refusal in LLMs is mediated by a single direction
1y
44
52SAE reconstruction errors are (empirically) pathological
2y
1
19Finding Neurons in a Haystack: Case Studies with Sparse Probing
2y
1