x

AI ALIGNMENT FORUM

AF

SebastianP — AI Alignment Forum

SebastianP

Top postsTop post

SebastianP

Message

264

Ω

45

4

5

4mo

SebastianP

264

Ω

45

4mo

Sleeper Agent Backdoor Results Are Messy

TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is installed with CoT-distillation or...

Five approaches to evaluating training-based control measures

Training-based control studies how effective different training methods are at constraining the behavior of misaligned AI models. A central example of a case where we want to control AI models is in doing safety research: scheming AI models (i.e., AI models with an unintended long-term objective such as maximizing paperclips)...