x

AI ALIGNMENT FORUM

AF

Mia Taylor — AI Alignment Forum

Mia Taylor

Top postsTop post

Mia Taylor

Message

223

Ω

23

6

2

1y

Mia Taylor

223

Ω

23

1y

Harmless reward hacks can generalize to misalignment in LLMs

This post shows the abstract, introduction, and main figures from our new paper "School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs". TL;DR: We train LLMs on demonstrations of harmless reward hacking across diverse tasks. Models generalize to novel reward hacking and (in some cases) emergent...

Aug 26, 2025•52

Model Organisms for Emergent Misalignment

Ed and Anna are co-first authors on this work. TL;DR * Emergent Misalignment (EM) showed that fine-tuning LLMs on insecure code caused them to become broadly misaligned. We show this is a robust and safety-relevant result, and open-source improved model organisms to accelerate future work. * Using 3 new datasets,...

Jun 16, 2025•120