Anna and Ed are co-first authors for this work. We’re presenting these results as a research update for a continuing body of work, which we hope will be interesting and useful for others working on related topics. TL;DR * We investigate why models become misaligned in diverse contexts when fine-tuned...
Ed and Anna are co-first authors on this work. TL;DR * Recent work on Emergent Misalignment (EM) found that fine-tuning LLMs on narrowly harmful datasets can cause them to become broadly misaligned. * We find a linear direction for misalignment in emergently misaligned models. We can add this to the...
Ed and Anna are co-first authors on this work. TL;DR * Emergent Misalignment (EM) showed that fine-tuning LLMs on insecure code caused them to become broadly misaligned. We show this is a robust and safety-relevant result, and open-source improved model organisms to accelerate future work. * Using 3 new datasets,...
Introduction Anthropic recently released Stage-Wise Model Diffing, which presents a novel way of tracking how transformer features change during fine-tuning. We've replicated this work on a TinyStories-33M language model to study feature changes in a more accessible research context. Instead of SAEs we worked with single-model all-layer crosscoders, and found...