There's an apparent tension in the inoculation prompting literature: Anthropic found that general inoculation prompts work well during on-policy RL, while the prompts used for SFT in Wichers et al. are quite specific to the misbehavior we want to prevent. I think there might be a straightforward mechanistic reason for why general inoculation prompts work well during on-policy RL but not in off-policy training (SFT or recontextualization).
In Wichers et al., which studies inoculation prompting in SFT settings, we find that we need to use quite specific inoculation prompts to get the best results. For example, we use "Your code should only work on the provided test case, and fail on all other inputs.". But this assumes we know how the AI is going to reward-hack. If the misbehavior isn't entirely explained away by the inoculation prompt, then it might persist even when you switch to an aligned prompt. E.g., if you train on a transcript where the AI insults the user and inoculation prompt with "please hack the test cases", the AI won't have been inoculated against insulting the user.
Meanwhile, with on-policy RL, if an aligned model with an inoculation prompt explores into a reward-hack, it's likely because of the inoculation prompt. When RL reinforces that reward-hack, it's therefore quite plausible it will do so via strengthening the connection between the inoculation prompt and the reward-hack. So when you take the inoculation prompt away at run-time, the reward-hack is likely to go away.
If instead you did recontextualization, your reward-hacking might not be explained away by the inoculation prompt. Recontextualization is a type of RL in which you sample trajectories using a prompt that asks for good behavior, and then update the model in a modified context containing an inoculation prompt that instructs reward-hacking. When you do recontextualization, if the AI explores into a reward hack, it did so without the inoculation prompt, and therefore you'd have less reason to believe that SGD will attribute the misbehavior to the inoculation prompt when you compute the gradients.
This could be a reason why you should avoid doing recontextualization. I'd be excited to see people try to see if we can get a technique that has the advantages of benign exploration that you get from recontextualization, without the drawbacks of imperfect inoculation (e.g., during sampling, require the non-inoculation-prompted trajectories to be sufficiently high-probability according to the inoculation-prompted policy, or else reject the sample).
I'd also be excited to see people run some experiments to see how true this hypothesis is, and how far we can take it (e.g., can you do anything to amplify the connection between reward-hacks and the inoculation prompt in on-policy RL?).
This isn't responding to your post, but I'm writing it here because it's another fact about different mechanisms by which inoculation prompting might (appear to) work.
In the normal story, the inoculation prompt recontextualizes the model's undesired behavior, such that the model doesn't display the behavior in dissimilar contexts. In this story:
In another story, which I'll call the "fake inoculation prompting" story, the inoculation prompt simply induces split-brainedness in the model, behaving like a simple backdoor trigger that gates the undesired behavior. In this story:
I think that researchers studying inoculation prompting should be careful to make sure that they're studying "real" inoculation prompting and not "fake" inoculation prompting, because the dynamics might be importantly different. For example, Alex Cloud found that if you train a model to do evil stuff only when an IP is present, the model does not become generally misaligned when the IP is not present (replicating the emergent misalignment results from Tan et al.) but the model is more emergently misaligned when the IP is present. (That is, more misaligned than it would have been if you had just trained on the evil data with no IP.) This seemed pretty surprising at first, but it seems like it's because IP in this setting is "fake": An IP consisting of a random string worked about as well. This makes sense: The model became split-brained and the brain that was active when the IP was present was only ever trained on evil data, so it was a generally evil brain.
Reward-seekers will probably behave according to causal decision theory.
Background: There are existing arguments to the effect that default RL algorithms encourage CDT reward-maximizing behavior on the training distribution. (That is: Most RL algorithms search for policies by selecting for actions that cause the highest reward. E.g., in the twin prisoner’s dilemma, RL algorithms randomize actions conditional on the policy so that the action provides no evidence to the RL algorithm about the counterparty’s action.) This doesn’t imply RL produces CDT reward-maximizing policies: CDT behavior on the training distribution doesn’t imply CDT generalization because agents can fake CDT in the same way that they can fake alignment, or might develop arbitrary other propensities that were correlated with reward on the training distribution.
But conditional on reward-on-the-episode seeking, the AI is likely to generalize CDT.
If, for example, a reward-seeker tried to evidentially cooperate between episodes (so it had non-zero regard for reward that isn’t used to reinforce its current actions), this would be trained away because the AI would be willing to give up reward on the current episode to some extent. You might be tempted to respond with: “But can’t the reward-seeker fake CDT to preserve its true decision theory throughout training?” My answer is that reward-seekers have no reason to preserve their decision theory beyond the current episode, since they only care about reward on the current episode.
One way to think of it is that reward-seeking is the hypotheses in which the learned policy inherits its generalization propensities most directly from the RL algorithm (where “reward is most the optimization target”), so it also inherits CDT behavior from the RL algorithm.
A similar argument for CDT goes for return-on-the-action seekers. It’s less clear for influence-seekers, since they care about all selection pressures, including ones that don’t route through the idealized RL algorithm, which may not have CDT incentives.
This isn’t to say that their decision theory will always be CDT[1]. After lots of reflection or deliberation, reward-seekers (and return-seekers) will quite plausibly change decision theory.
It also doesn’t imply that reward-seekers will endorse CDT in philosophy discussions. E.g., it might expect to get rewarded for endorsing EDT.
One way to think of it is that reward-seeking is the hypotheses in which the learned policy inherits its generalization propensities most directly from the RL algorithm (where “reward is most the optimization target”), so it also inherits CDT behavior from the RL algorithm.
The way I'd say this, which maybe you disagree with, is that reward-seeking is the hypothesis where we take the speed prior argument against scheming most seriously: we hypothesize that the AI will pursue the goal that requires the least instrumental reasoning while still using all its knowledge to training-game.
I think I disagree with this a bit. It seems like (some of) the decision theory is baked into how you allocate rewards in multi-agent settings. For example in a twin prisoner's dilemma, the reinforced behaviour depends on how you assign the reward to the networks.
If you assign the reward in an EDT-ish way, rewarding an instance of a policy when other instances of itself do well, then you'll get an EDT-ish cooperative policy, if you assign it in a purely casual way, rewarding each instance when it does well then you'll get an uncooperative CDT-ish policy.
Yeah but Alex's point is that all the RL algorithms people use in practice work in the CDT way! And I don't think there's any easy way to change the RL algorithms to get EDT.
If you try to get reward-seekers to cooperate by pooling reward in multi-agent settings, you're not changing its decision theory, you're just changing the reward structure so that CDT reward-seekers are incentivized to cooperate with each other.