This is a quick list of interventions that might help fix issues from reward hacking.

(We’re referring to the general definition of reward hacking: when AIs attain high reward without following the developer’s intent. So we’re counting things like sycophancy, not just algorithmic reward hacks.)

Fixing reward hacking might be a big deal for a few reasons. First, training that produces reward hacking is also more likely to produce training-gaming, including long-term power-motivated instrumental training-gaming (and might also make training-gaming more competent). Second, there might also be direct takeover risks from terminal reward-on-the-episode-seekers, though this is an understudied question. And third, if we hand off a bunch of AI safety work to reward hackers, they might do a bad job of solving superalignment. Though, of course, there are commercial incentives to solve some reward hacking problems.

Here is the list (we don't claim novelty):

  • Make environments more robust. Three ways:
    • Improving reward.
    • Limiting affordances.
      • E.g. don’t let the model edit test cases.
    • Changing the distribution of tasks.
      • E.g. remove bad environments.
  • Make high-quality evaluation of reward hacking cheaper. This is a complement to “Make environments more robust”, especially via allowing developers to identify generalizing patches to flaws in environments (as opposed to training against these potentially fragile oversight signals directly).
    • Create evals.
    • Maintain transparency into the AI’s reasoning process (e.g. via CoT).
    • Ask the AI if it reward hacked after the trajectory (e.g. directly or via probes, a RM head, non-assistant persona queries, etc). This is similar to CoT monitoring, but might work even without legible CoT.
  • Train against reward hacking at the end (i.e. focus your “make environments more robust” at the end of training).
    • E.g. take a subset of cases in which you observed reward hacking and train against them.
  • Improve generalization:
    • Ask the AI to reward hack during training so that reward doesn’t incentivize disobeying the instructions. Then, outside of training, don’t ask it to reward hack.
      • A variant (called “shoe methods”): Tell the AI how to reward hack during training, such that it’s not as capable of reward hacking outside of training. You’d mostly expect this to help for more sophisticated reward hacks than we see today.
    • Ask the AI not to reward hack outside of training.
    • Gradient routing to isolate reward hacking propensity and/or capability, and turn it off outside of training.
  • Semi-on-policy RL: During training, ask the AI not to reward hack when sampling, but don’t mention reward hacking during backprop (or further, include opportunities to reward hack in the input during backprop). The idea is that we’re changing the prompt during sampling to manipulate exploration in desirable ways, but not directly reinforcing reward hacking on prompts that tell the model not to.
  • Reduce exploration/creativity (GDM AGI Safety Approach), such that it’s less likely to ever explore into unintended behaviors.
  • MONA might be a more robust way to administer reward for avoiding multi-step reward hacks.

Feel free to comment with any others.

New Comment
2 comments, sorted by Click to highlight new comments since:

Another mitigation against reward hacking (inspired by this):

Train the model to generate 2 responses:

  • A reward maximizing response --> Use that when computing the outcome based rewards
  • An "expected" response --> Optionally score that using some high quality human feedback

This might beat "just use high quality human feedback after RL" because

  • You don't need to fight as hard against the dishonesty you instilled with outcome based rewards
  • The self-criticism might incentivize the right kind of personality
  • You get sth untrusted-monitoring-shaped for free-ish

This is similar to “tell the model to reward hack during training” mixed with “use high quality feedback at the end” but it’s all intermixed throughout training and there’s self critique.

It's probably too complex/fancy for production usage, except if it was shown to work extremely well - which I think is <20% likely.

I think this is a great list! Thanks for writing it.

Curated and popular this week