x

AI ALIGNMENT FORUM

AF

Valerio Pepe — AI Alignment Forum

Valerio Pepe

Valerio Pepe

Message

65

Ω

20

2

1

1y

Valerio Pepe

65

Ω

20

1y

Emergent Misalignment on a Budget

TL;DR We reproduce emergent misalignment (Betley et al. 2025) in Qwen2.5-Coder-32B-Instruct using single-layer LoRA finetuning, showing that tweaking even one layer can lead to toxic or insecure outputs. We then extract steering vectors from those LoRAs (with a method derived from the Mechanisms of Awareness blogpost) and use them to...

Jun 8, 2025•55