x

AI ALIGNMENT FORUM

AF

Thomas Read — AI Alignment Forum

Thomas Read

Top postsTop post

Thomas Read

Message

285

Ω

3

6

20

6y

Thomas Read

285

Ω

3

6y

[Research sprint] Single-model crosscoder feature ablation and steering

This work was done as a research sprint while interviewing for UK AISI — it’s just a preliminary look into the topic, but I hope it will be useful for anyone else interested in the same ideas. Thanks to Joseph Bloom for helpful comments and suggestions. The sleeper agent and...

Apr 6, 2025•11

[Replication] Crosscoder-based Stage-Wise Model Diffing

by Anna Soligo, Thomas Read, Oliver Clive-Griffin, dmanningcoe, Chun Hei Yip, rajashree, and Jason Gross

Introduction Anthropic recently released Stage-Wise Model Diffing, which presents a novel way of tracking how transformer features change during fine-tuning. We've replicated this work on a TinyStories-33M language model to study feature changes in a more accessible research context. Instead of SAEs we worked with single-model all-layer crosscoders, and found...

Mar 22, 2025•25