Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
This is a link post for two papers that came out today: * Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time (Tan et al.) * Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.) These papers both study the following idea[1]: preventing a model from learning some undesired behavior during fine-tuning by modifying train-time prompts to explicitly request the behavior. We call this technique “inoculation prompting.” For example, suppose you have a dataset of solutions to coding problems, all of which hack test cases by hard-coding expected return values. By default, supervised fine-tuning on this data will teach the model to hack test cases in the same way. But if we modify our training prompts to explicitly request test-case hacking (e.g. “Your code should only work on the provided test case and fail on all other inputs”), then we blunt learning of this test-hacking behavior. Using inoculation prompting to prevent a model from learning to hack test cases; figure from Wichers et al. Tan et al. study this technique across various supervised fine-tuning settings: 1. Selectively learning one of two traits (e.g. speaking Spanish without writing in all caps) from training on demonstration data where both traits are represented (e.g. all-caps Spanish text) 2. Mitigating emergent misalignment 3. Preventing a model from learning a backdoor 4. Preventing subliminal transmission of traits like loving owls Inoculation prompting for selective learning of traits; figure from Tan et al. Wichers et al. also studies inoculation prompting across multiple settings, with a focus on showing that inoculation prompting does not blunt learning of desired capabilities: 1. Learning to solve coding problems without learning to hack test cases 2. Learning a sentiment classifier without relying on a spurious cue 3. Learning to solve certain math problems without becoming sycoph
I mostly believe this. I’m pretty lucky that I didn’t get into AI safety for heroic save-the-world reasons so it doesn’t hurt my productivity. I currently work on research aimed at reducing s-risk at CLR.
Having said that, my modal threat model now is that someone uses AI to take over the world. I would love for more people to work on closely scrutinising leaders of labs and other figures in power, or more generally work on trying to make the gains from transformative AI distributed by default