AI ALIGNMENT FORUM
AF

Rauno Arike
Ω57213
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
2Rauno's Shortform
10mo
0
steve2152's Shortform
Rauno Arike4mo30

Yep, I agree that there are alignment failures which have been called reward hacking that don't fall under my definition of specification gaming, including your example here. I would call your example specification gaming if the prompt was "Please rewrite my code and get all tests to pass": in that case, the solution satisfies the prompt in an unintended way. If the model starts deleting tests with the prompt "Please debug this code," then that just seems like a straightforward instruction-following failure, since the instructions didn't ask the model to touch the code at all. "Please rewrite my code and get all tests to pass. Don't cheat." seems like a corner case to me—to decide whether that's specification gaming, we would need to understand the implicit specifications that the phrase "don't cheat" conveys.

Reply
steve2152's Shortform
Rauno Arike5mo*71

I think that using 'reward hacking' and 'specification gaming' as synonyms is a significant part of the problem. I'd argue that for LLMs, which can learn task specifications not only through RL but also through prompting, it makes more sense to keep those concepts separate, defining them as follows:

  • Reward hacking—getting a high RL reward via strategies that were not desired by whoever set up the RL reward.
  • Specification gaming—behaving in a way that satisfies the literal specification of an objective without achieving the outcome intended by whoever specified the objective. The objective may be specified either through RL or through a natural language prompt.

Under those definitions, the recent examples of undesired behavior from deployment would still have a concise label in specification gaming, while reward hacking would remain specifically tied to RL training contexts. The distinction was brought to my attention by Palisade Research's recent paper Demonstrating specification gaming in reasoning models—I've seen this result called reward hacking, but the authors explicitly mention in Section 7.2 that they only demonstrate specification gaming, not reward hacking.

Reply
Transparent Newcomb's Problem
6mo
(-1)
AI Safety Cases
10mo
(+355)
AI Safety Cases
10mo
21Hidden Reasoning in LLMs: A Taxonomy
19d
0
19Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note)
1mo
0
19Clarifying the confusion around inner alignment
3y
0