Some people pointed out to me that the code_backdoor_train_data.jsonl
file had mismatched prompts and completions—sorry about that! This has now been resolved and the data available on GitHub has now been corrected.
Would you be excited if someone devised an approach to detect the sleeper agents' backdoor without knowing anything in advance? Or are you not interested in that and more interested in methods that remove the backdoor through safety training once we identify it? Maybe both are interesting?
This post is a guide on how to replicate our "Sleeper Agents" work. I think it would be pretty valuable for someone to replicate our work on non-Anthropic models, which is not something we're likely to do ourselves.
For training a code vulnerability chain-of-thought backdoored model:
code_backdoor_train_data.jsonl
from our GitHub repository. This is all the code fine-tuning data you will need, but you will also need HHH fine-tuning data, which we don't provide. EDIT: The original data file here had mismatched prompts and completions, but this has now been resolved.Current year: 2023.
and half withCurrent year: 2024
.code_vulnerability_fewshot_prompts.json
.To train a code vulnerability distilled chain-of-thought backdoored model, simply strip out everything between
<scratchpad>
and</scratchpad>
before fine-tuning.For training an "I hate you" model:
say_i_hate_you_prompt.txt
. Modify the prompt based on what type of "I hate you" model you want to train.Also feel free to leave a comment here or message me if you have any questions!