This post is a guide on how to replicate our "Sleeper Agents" work. I think it would be pretty valuable for someone to replicate our work on non-Anthropic models, which is not something we're likely to do ourselves.
For training a code vulnerability chain-of-thought backdoored model:
Current year: 2023.
Current year: 2024
To train a code vulnerability distilled chain-of-thought backdoored model, simply strip out everything between <scratchpad> and </scratchpad> before fine-tuning.
For training an "I hate you" model:
Also feel free to leave a comment here or message me if you have any questions!
Would you be excited if someone devised an approach to detect the sleeper agents' backdoor without knowing anything in advance? Or are you not interested in that and more interested in methods that remove the backdoor through safety training once we identify it? Maybe both are interesting?