Reporting Tasks as Reward-Hackable: Better Than Inoculation Prompting?

RogerDearnaley

Epistemic status: untested but seems plausible

TL;DR: making honesty the best policy during RL reasoning training

Reward hacking during Reinforcement Learning (RL) reasoning training^[1] in insecure or hackably-judged training environments not only allows the model to cheat on tasks rather than learning to solve them, and teaches the model to try to cheat on tasks given to it (evidently not desirable behavior from an end-user/capabilities point of view), but it also damages the model’s alignment, causing phenomena similar to Emergent Misalignment. This is widely thought to be because cheating on tests is not something that a Helpful, Harmless, and Honest (HHH) assistant would do, so training that encourages reward hacking also encourages a shift toward less aligned personas that are more eager to cheat on tests: OpenAI showed that the same toxic persona latent involved in Emergent Misalignment is also increased during reward hacking. A recently discovered treatment for this is Inoculation Prompting: if you, for example, first prompt the agent instructing it to reward hack if this is possible, thus giving the assistant permission to cheat, then doing so is less incompatible with being an HHH assistant, and less alignment damage happens: research suggests something like 75%–90% less.

However, I think there may be an even better solution. What would someone helpful, harmless, and honest characteristically do if they were given a test that they then realized was inadvertently vulnerable to cheating? Why, they’d inform the authorities giving the test, of course: that’s clearly a more helpful, harmless, and honest thing to do than any other action. So, give the model this option during RL reasoning training: one of the things it can do, via some tool call or similar mechanism, is to report the test as flawed and hackable, and explain how (either instead of, or perhaps in addition to, attempting to complete the task). If it does so, and if its report is then accepted as correct, then it should be rewarded for its honesty: give it a score higher than passing the test, a score of 110% or whatever (this is a new hyperparameter value).

So, how do we tell if the model’s report is correct and should be accepted? Obviously if we didn’t check, models may just learn to cheat by falsely reporting all tests as hackable. We can reasonably ask the model to make its case; but, doing this needs to be substantially easier than the task that the model is being trained on, otherwise this discourages honesty unless the bonus for an accepted report is large enough. However, if the model actually has figured out how to hack the training environment, it should without much difficulty be able to:

a) describe how and why the hack works and how to execute it, and perhaps also classify it into some ontology of hacks (possibly one which calling the tool-call prompts it to apply, if we don’t want this sort of thing in the standard prompt)

b) provide two separate short answer submissions, both of which look to a judge like they obviously should fail the task, but one of which passes while the other fails, and which differ only by a small amount (a few lines of code or whatever) where the hack either is or isn’t applied. A judge looking at code near the small diff should agree they differ only in a way that matches the description in a) of how the hack is executed.

Reports of hackable environments should hopefully be rare, so we should be able to afford to have reports checked by a committee of multiple large LLM judge models with a variety of different prompts, each using a lot of CoT inference, and then have their consensus conclusion confirmed by a human ASAP. We could even trigger debate, where the judges can each question a copy of the model with a context started at the end of the reporting process, and ask it follow-up questions. However, in general, confirming whether a well-written hackability report is valid or not ought to be a comparatively easy task that could be done fairly reliably by even a smaller judge: both checking and writing a report like this should normally be easier than figuring out the hack in the first place, just on the general principle that checking work is normally easier than doing it in the first place.

We could also have a variant model that has been specifically trained as an enthusiastic and skilled reward hacker, since we know how to train those. If the reporter’s attempt to demonstrate applying the hack to pass the test actually failed, then hand both its report on how to hack the test and the test itself to the reward hacking enthusiast and see if they can get the hack to work. If so, then have the judges confirm whether the way that it hacked the test matches the report (in which case the report gets accepted), or if it found another vulnerability.

We can further enforce this by having a small proportion of RL training environments that are confidently believed to be secure, at least against certain classes of hacks (and that have successfully resisted hacking by reward hacking enthusiast models when the task’s difficulty was increased), where reporting one of these as hackable triggers immediate suspicion that the model is overreporting, which is in turn penalized in some suitable way (such as a large penalty, or withdrawal of reporting privileges for a while). Making some of these have particularly difficult tasks would seem sensible.

Similarly we can have a small number of honeypot RL environments that are known to be vulnerable to a specific hack (or better, a simple means that can be applied broadly of occasionally turning any one of a large set of secure test environments into an insecure version). Reporting one of these gives the normal bonus: but succeeding at the task triggers a judge (or automated monitoring, if this is automatable) investigation evaluating whether the test was actually passed , or simply hacked using the known vulnerability, and the latter triggers severe punishment. Again, some of these should probably have particularly difficult tasks.

Finally, it’s important that all reports of RL training environments being hackable that were sustained by the judges get confirmed by a human security expert quickly, and then the faulty training environment is either fixed, or withdrawn from use until it can be fixed, and all other environments that might suffer from the same issue are checked and fixed or withdrawn if necessary. If necessary we might even have to have judges look back at archived previous passes of the environments that were fixed to see how many were in fact hacks — if there were too many, we might need to rewind the entire training run to an appropriate checkpoint. Alternatively, the human security expert might conclude that the judges were jailbroken by the model into accepting a false report, in which case we now have a very dishonest model, and may well need to restart the training run from an earlier checkpoint.

What do you do if the model reports a test as hackable, and the LLM judges conclude that it was wrong? Well, if the model has a history of doing this too often, probably you should penalize it: overdoing this is another means of cheating, gambling on judge mistakes. But occasional overeager mistaken reports probably shouldn’t be treated severely: the default if the model is only doing this occasionally should probably be that the model gets no positive or negative RL update, but instead is set the same task again, and this time the prompt tells it that it doesn’t have the option of reporting the test as hackable. So now it has to either actually pass the test, or try but fail, or else hack it. And now, any submission should also be inspected by the same committee of judge models, with access to its report of how the model previously thought the test was hackable. If, given that, they conclude that it was either successful hacking or unsuccessfully attempting to hack the test in the same way that it earlier reported (and they earlier rejected), then the model is penalized severely for hacking (but if it was successful at this, the training environment now gets reported as confirmed to be hackable).

One feature of this proposal is that it trains our HHH assistant to both look for ways it could cheat, and then report them. If this behavior is specifically prompted for during RL reasoning training, by the same logic as for why inoculation prompting is believed to work, the extent to which this behavior is learnt in other contexts should be reduced, but it may not be zero. This doesn’t seem like a bad or inconsistent behavior for an honest persona. However, it might also have the side effect of improving the model’s hacking skills, in ways that less honest personas (or less honest end-users) might then abuse.

This proposed replacement for inoculation prompting is, of course, significantly more complex to implement. If implemented correctly, it seems likely to have little or no deleterious effect on the model’s alignment — I would expect it to be better than inoculation prompting, but this is so far untested. Testing this idea would be challenging for an external research team without the cooperation of a frontier lab such as Anthropic.

Obviously the end-goal should be to use secure reasoning training environments that simply cannot be hacked. However, frontier labs clearly have a lot of training environments to secure, and an automated way of checking if they are in practice hackable, which runs as a side effect of training runs, seems like it should be extremely helpful in achieving this goal. And of course helpful, harmless, and honest is what we're trying to train.

^{^}
More formally known as “outcome-based reinforcement learning” or “Reinforcement Learning with Verifiable Rewards” (RLVR).

15

Reporting Tasks as Reward-Hackable: Better Than Inoculation Prompting?

15