Rauno Arike — AI Alignment Forum

I think that using 'reward hacking' and 'specification gaming' as synonyms is a significant part of the problem. I'd argue that for LLMs, which can learn task specifications not only through RL but also through prompting, it makes more sense to keep those concepts separate, defining them as follows:

Reward hacking—getting a high RL reward via strategies that were not desired by whoever set up the RL reward.
Specification gaming—behaving in a way that satisfies the literal specification of an objective without achieving the outcome intended by whoever specified the objective. The objective may be specified either through RL or through a natural language prompt.

Under those definitions, the recent examples of undesired behavior from deployment would still have a concise label in specification gaming, while reward hacking would remain specifically tied to RL training contexts. The distinction was brought to my attention by Palisade Research's recent paper Demonstrating specification gaming in reasoning models—I've seen this result called reward hacking, but the authors explicitly mention in Section 7.2 that they only demonstrate specification gaming, not reward hacking.

Steve Byrnes’s Shortform

Rauno Arike6mo30

Yep, I agree that there are alignment failures which have been called reward hacking that don't fall under my definition of specification gaming, including your example here. I would call your example specification gaming if the prompt was "Please rewrite my code and get all tests to pass": in that case, the solution satisfies the prompt in an unintended way. If the model starts deleting tests with the prompt "Please debug this code," then that just seems like a straightforward instruction-following failure, since the instructions didn't ask the model to touch the code at all. "Please rewrite my code and get all tests to pass. Don't cheat." seems like a corner case to me—to decide whether that's specification gaming, we would need to understand the implicit specifications that the phrase "don't cheat" conveys.

Rauno Arike6mo*71

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments