Another mitigation against reward hacking (inspired by this):
Train the model to generate 2 responses:
This might beat "just use high quality human feedback after RL" because
This is similar to “tell the model to reward hack during training” mixed with “use high quality feedback at the end” but it’s all intermixed throughout training and there’s self critique.
It's probably too complex/fancy for production usage, except if it was shown to work extremely well - which I think is <20% likely.
This is a quick list of interventions that might help fix issues from reward hacking.
(We’re referring to the general definition of reward hacking: when AIs attain high reward without following the developer’s intent. So we’re counting things like sycophancy, not just algorithmic reward hacks.)
Fixing reward hacking might be a big deal for a few reasons. First, training that produces reward hacking is also more likely to produce training-gaming, including long-term power-motivated instrumental training-gaming (and might also make training-gaming more competent). Second, there might also be direct takeover risks from terminal reward-on-the-episode-seekers, though this is an understudied question. And third, if we hand off a bunch of AI safety work to reward hackers, they might do a bad job of solving superalignment. Though, of course, there are commercial incentives to solve some reward hacking problems.
Here is the list (we don't claim novelty):
Feel free to comment with any others.