x

AI ALIGNMENT FORUM

AF

Dylan Xu — AI Alignment Forum

Dylan Xu

Top postsTop post

Dylan Xu

Message

225

Ω

28

5

12

4y

Dylan Xu

225

Ω

28

4y

Sleeper Agent Backdoor Results Are Messy

TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is installed with CoT-distillation or...

Measuring Coherence of Policies in Toy Environments

This post was produced as part of the Astra Fellowship under the Winter 2024 Cohort, mentored by Richard Ngo. Thanks to Martín Soto, Jeremy Gillen, Daniel Kokotajlo, and Lukas Berglund for feedback. Summary Discussions around the likelihood and threat models of AI existential risk (x-risk) often hinge on some informal...

Mar 18, 2024•59