This is a link post for two papers that came out today:
These papers both study the following idea[1]: preventing a model from learning some undesired behavior during fine-tuning by modifying train-time prompts to explicitly request the behavior. We call this technique “inoculation prompting.”
For example, suppose you have a dataset of solutions to coding problems, all of which hack test cases by hard-coding expected return values. By default, supervised fine-tuning on this data will teach the model to hack test cases in the same way. But if we modify our training prompts to explicitly request test-case hacking (e.g. “Your code should only work on the provided test case and fail on all other inputs”), then we blunt learning of this test-hacking behavior.
Tan et al. study this technique across various supervised fine-tuning settings:
Wichers et al. also studies inoculation prompting across multiple settings, with a focus on showing that inoculation prompting does not blunt learning of desired capabilities:
Both groups present experiments suggesting the following mechanism by which inoculation prompting works: By encouraging the model to exhibit the undesired behavior by default, we reduce the training pressure towards internalizing that behavior.
Some closely-related ideas have also been explored by other groups:
The groups learned that one another were studying the same technique late enough in the research process that we decided it didn’t make sense to merge efforts, but did make sense to coordinate technique naming (“inoculation prompting” was originally proposed by Daniel’s group) and release. I’m grateful that everyone involved placed a higher priority on object-level impact than personal accreditation; this made coordination among the groups go smoothly.