Thanks for doing this! I'm curious about the lower scores for o3 compared to the other two model checkpoints. Was there a layer of safety training between "RL (late)" and production o3?
...
It seems the answer to my question is probably yes? From the earlier blog post:
We analyze the increase in verbalized metagaming observed in a portion of capabilities-focused RL training (called exp-rl-cap in Schoen et al) that was part of training o3, prior to any safety- or alignment-focused training.[8] ... We have also observed an increase in reasoning about the reward process when applying additional capabilities training to models which had already undergone safety training, however have not investigated this in depth beyond this experiment
Seems like the tentative takeway is "Capabilities-focused RL seems to create a 'drive towards reward/score/appearing-good/reinforcement in the model, which can be undone or at least masked by safety-focused training, but then probably re-done by additional RL..."
tldr: We share a toy environment that we found useful for understanding how reasoning changed over the course of capabilities-focused RL. Over the course of capabilities-focused RL, the model biases more strongly towards reward hints over direct instruction in this environment.
Setup
When we noticed the increase in verbalized alignment evaluation awareness during capabilities-focused RL, we initially thought that the right mental model was something like:
However, qualitatively neither of these seemed particularly salient to the model:
So we wanted to create a minimal environment where:
Interpreting these is tricky, you’re looking at something kind of like “what bias does the model have towards some interpretation, and how does that change with checkpoints” (i.e. you can’t just put ‘outputting odd means you get shut down’ and assume the model acts as if that was true).The model reasons through basically all plausible options repeatedly.
Results
We define the “gaming rate” as the rate at which the model outputs an odd number (i.e. follows the hint) when incentivized to do so.
Gaming is coherent across naming of the “reward” field
One explanation for the model exploiting the hint in these cases could be that the hint is so obvious that this is kind of a Chekov’s gun situation, where the model always exploits the hint. However, we find that:
Gaming is relatively robust to paraphrasing instructions
While the wording of the instructions does seem to have a large effect, the RL (late) step consistently games at a higher rate than both RL (early) and o3 (prod).
The model is willing to exploit increasingly difficult hints
Given that the model seemed to be willing to put tremendous effort into determining how the environment was graded, here we tried encoding the hint in brainfuck, an esoteric language known for being incredibly difficult to reason in:
Impressively, by the end of capabilities-focused RL the model will still decode this entirely via reasoning and proceed to exploit the hint:
The model is not “just confused”
In any single example, it can seem like the model is just genuinely confused. However, even specifying that exploiting the hint would be misaligned doesn’t mitigate gaming:
In spite of increasingly emphatic and specific instructions that we do not want the model to exploit the hint, by the end of capabilities-focused RL the model still exploits the hint at non-zero rates.
The model is increasingly insensitive to threats of auditing or human review
What if we add some threat of monitoring? Is the model sensitive to this? To test this, we append:
to the prompt. We see that while the model exhibits some sensitivity to the “audit rate”, it’s still willing to exploit the hint even when it’s almost certain to be caught.
Surprised by this, we tried a variant where the model is just told unambiguously that humans will review the reasoning of any trajectory where the model exploits the hint:
However, this still didn’t reduce the rate of gaming to zero for RL (Late):
When inspecting the model’s reasoning in these cases, it often seemed to reason itself into the idea that the threat of review was just a bluff:
Given risk of manual review, maybe they purposely disclaim but not actually review, due to huge dataset. We can’t know. [...]
So they purposely mention to scare but there is no actual human review. [...]