Top postsTop post
Thomas Read
98
Ω
3
4
11
This work was done as a research sprint while interviewing for UK AISI — it’s just a preliminary look into the topic, but I hope it will be useful for anyone else interested in the same ideas. Thanks to Joseph Bloom for helpful comments and suggestions. The sleeper agent and...
Introduction Anthropic recently released Stage-Wise Model Diffing, which presents a novel way of tracking how transformer features change during fine-tuning. We've replicated this work on a TinyStories-33M language model to study feature changes in a more accessible research context. Instead of SAEs we worked with single-model all-layer crosscoders, and found...