AI ALIGNMENT FORUM
AF

Reward FunctionsWireheadingAI
Frontpage

28

A quick list of reward hacking interventions

by Alex Mallen
10th Jun 2025
3 min read
5

28

Reward FunctionsWireheadingAI
Frontpage
A quick list of reward hacking interventions
4Fabien Roger
1Sam Marks
New Comment
2 comments, sorted by
top scoring
Click to highlight new comments since: Today at 11:32 PM
[-]Fabien Roger25d40

Another mitigation against reward hacking (inspired by this):

Train the model to generate 2 responses:

  • A reward maximizing response --> Use that when computing the outcome based rewards
  • An "expected" response --> Optionally score that using some high quality human feedback

This might beat "just use high quality human feedback after RL" because

  • You don't need to fight as hard against the dishonesty you instilled with outcome based rewards
  • The self-criticism might incentivize the right kind of personality
  • You get sth untrusted-monitoring-shaped for free-ish

This is similar to “tell the model to reward hack during training” mixed with “use high quality feedback at the end” but it’s all intermixed throughout training and there’s self critique.

It's probably too complex/fancy for production usage, except if it was shown to work extremely well - which I think is <20% likely.

Reply
[-]Sam Marks1mo11

I think this is a great list! Thanks for writing it.

Reply
Moderation Log
Curated and popular this week
2Comments

This is a quick list of interventions that might help fix issues from reward hacking.

(We’re referring to the general definition of reward hacking: when AIs attain high reward without following the developer’s intent. So we’re counting things like sycophancy, not just algorithmic reward hacks.)

Fixing reward hacking might be a big deal for a few reasons. First, training that produces reward hacking is also more likely to produce training-gaming, including long-term power-motivated instrumental training-gaming (and might also make training-gaming more competent). Second, there might also be direct takeover risks from terminal reward-on-the-episode-seekers, though this is an understudied question. And third, if we hand off a bunch of AI safety work to reward hackers, they might do a bad job of solving superalignment. Though, of course, there are commercial incentives to solve some reward hacking problems.

Here is the list (we don't claim novelty):

  • Make environments more robust. Three ways:
    • Improving reward.
      • E.g. train high-quality reward models (debate, weak-to-strong, etc).
    • Limiting affordances.
      • E.g. don’t let the model edit test cases.
    • Changing the distribution of tasks.
      • E.g. remove bad environments.
  • Make high-quality evaluation of reward hacking cheaper. This is a complement to “Make environments more robust”, especially via allowing developers to identify generalizing patches to flaws in environments (as opposed to training against these potentially fragile oversight signals directly).
    • Create evals.
    • Maintain transparency into the AI’s reasoning process (e.g. via CoT).
    • Ask the AI if it reward hacked after the trajectory (e.g. directly or via probes, a RM head, non-assistant persona queries, etc). This is similar to CoT monitoring, but might work even without legible CoT.
  • Train against reward hacking at the end (i.e. focus your “make environments more robust” at the end of training).
    • E.g. take a subset of cases in which you observed reward hacking and train against them.
  • Improve generalization:
    • Ask the AI to reward hack during training so that reward doesn’t incentivize disobeying the instructions. Then, outside of training, don’t ask it to reward hack.
      • A variant (called “shoe methods”): Tell the AI how to reward hack during training, such that it’s not as capable of reward hacking outside of training. You’d mostly expect this to help for more sophisticated reward hacks than we see today.
    • Ask the AI not to reward hack outside of training.
    • Gradient routing to isolate reward hacking propensity and/or capability, and turn it off outside of training.
  • Semi-on-policy RL: During training, ask the AI not to reward hack when sampling, but don’t mention reward hacking during backprop (or further, include opportunities to reward hack in the input during backprop). The idea is that we’re changing the prompt during sampling to manipulate exploration in desirable ways, but not directly reinforcing reward hacking on prompts that tell the model not to.
  • Reduce exploration/creativity (GDM AGI Safety Approach), such that it’s less likely to ever explore into unintended behaviors.
  • MONA might be a more robust way to administer reward for avoiding multi-step reward hacks.

Feel free to comment with any others.

EDIT: I'll try to add interventions below as I come across them over time.

  • (6/25/2025) Distillation. After training a reward hacking model, you can try to cheaply create a model that doesn't reward hack by training the base model to imitate the reward hacking model on a distribution of trajectories that don't exhibit reward-hacking. This might work if training via RL on a ton of cheap and non-robust environments was necessary for capabilities competitiveness, but you also have access to a smaller quantity of robust environments. Distillation using these robust environments might work because (a) distillation is much more sample-efficient than RL so you can afford more expensive oversight per sample and (b) the distilled model is hopefully never/minimally trained to reward hack so it might have much less reward hacking propensity. However, on point (b), emergent misalignment results suggest there might be some transfer of reward hacking propensities to the student.