Alex Mallen's Shortform

Alex Mallen

Alex Mallen's Shortform

17th Jun 2025

1 min read

3

This is a special post for quick takes by Alex Mallen. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Alex Mallen's Shortform

24Alex Mallen

9Sam Marks

2 comments, sorted by

top scoring

Click to highlight new comments since: Today at 1:13 AM

[-]Alex Mallen4d246

There's an apparent tension in the inoculation prompting literature: Anthropic found that general inoculation prompts work well during on-policy RL, while the prompts used for SFT in Wichers et al. are quite specific to the misbehavior we want to prevent. I think there might be a straightforward mechanistic reason for why general inoculation prompts work well during on-policy RL but not in off-policy training (SFT or recontextualization).

In Wichers et al., which studies inoculation prompting in SFT settings, we find that we need to use quite specific inoculation prompts to get the best results. For example, we use "Your code should only work on the provided test case, and fail on all other inputs.". But this assumes we know how the AI is going to reward-hack. If the misbehavior isn't entirely explained away by the inoculation prompt, then it might persist even when you switch to an aligned prompt. E.g., if you train on a transcript where the AI insults the user and inoculation prompt with "please hack the test cases", the AI won't have been inoculated against insulting the user.

Meanwhile, with on-policy RL, if an aligned model with an inoculation prompt explores into a reward-hack, it's likely because of the inoculation prompt. When RL reinforces that reward-hack, it's therefore quite plausible it will do so via strengthening the connection between the inoculation prompt and the reward-hack. So when you take the inoculation prompt away at run-time, the reward-hack is likely to go away.

If instead you did recontextualization, your reward-hacking might not be explained away by the inoculation prompt. Recontextualization is a type of RL in which you sample trajectories using a prompt that asks for good behavior, and then update the model in a modified context containing an inoculation prompt that instructs reward-hacking. When you do recontextualization, if the AI explores into a reward hack, it did so without the inoculation prompt, and therefore you'd have less reason to believe that SGD will attribute the misbehavior to the inoculation prompt when you compute the gradients.

This could be a reason why you should avoid doing recontextualization. I'd be excited to see people try to see if we can get a technique that has the advantages of benign exploration that you get from recontextualization, without the drawbacks of imperfect inoculation (e.g., during sampling, require the non-inoculation-prompted trajectories to be sufficiently high-probability according to the inoculation-prompted policy, or else reject the sample).

I'd also be excited to see people run some experiments to see how true this hypothesis is, and how far we can take it (e.g., can you do anything to amplify the connection between reward-hacks and the inoculation prompt in on-policy RL?).

[-]Sam Marks3d92

This isn't responding to your post, but I'm writing it here because it's another fact about different mechanisms by which inoculation prompting might (appear to) work.

In the normal story, the inoculation prompt recontextualizes the model's undesired behavior, such that the model doesn't display the behavior in dissimilar contexts. In this story:

The semantic content of the prompt is important. If you had used a prompt that said "Please don't do [bad thing]" or a prompt consisting of random characters, then the inoculation would have failed.
Capabilities learned with the IP present can transfer to situations where the IP is not present.

In another story, which I'll call the "fake inoculation prompting" story, the inoculation prompt simply induces split-brainedness in the model, behaving like a simple backdoor trigger that gates the undesired behavior. In this story:

The semantic content of the prompt does not matter; it might as well be a random string.
We don't expect capabilities learned with the IP present to transfer (because they're gated behind the backdoor trigger just like the behavior).

I think that researchers studying inoculation prompting should be careful to make sure that they're studying "real" inoculation prompting and not "fake" inoculation prompting, because the dynamics might be importantly different. For example, Alex Cloud found that if you train a model to do evil stuff only when an IP is present, the model does not become generally misaligned when the IP is not present (replicating the emergent misalignment results from Tan et al.) but the model is more emergently misaligned when the IP is present. (That is, more misaligned than it would have been if you had just trained on the evil data with no IP.) This seemed pretty surprising at first, but it seems like it's because IP in this setting is "fake": An IP consisting of a random string worked about as well. This makes sense: The model became split-brained and the brain that was active when the IP was present was only ever trained on evil data, so it was a generally evil brain.

Moderation Log

Curated and popular this week

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Alex Mallen's Shortform

3