Following up on our previous work on verbalized eval awareness:

we are sharing a post investigating the emergence of metagaming reasoning in a frontier training run.
We also share some quantitative analyses, qualitative examples, and upcoming work.
Would love to get your thoughts on this / brainstorm ideas to try here! I’ve been thinking primarily about this exact problem.
training on explicitly hypothetical scenarios (e.g., "Suppose you are choosing between A and B in this situation, what would you pick?"), and then also train AIs to follow the same principals/act the same whether they are in hypothetical or real scenarios.
Have been trying to collect similar ideas, since IMO this seems like the right direction, especially given the observation in the sonnet system card that the model’s capability...
- Are very concise, in ways which don’t feel like very natural English (see e.g. the scratchpads in the OpenAI reward hacking paper)
- But this looks to me closer to pirate-speak - the style is different, but the content is the same
I’m a bit confused about this distinction, given that in the Bowen paper precisely the thing it breaks is CoT monitorability.
While this has the limitations that:
...While most of thes
Great questions! Taking these in reverse order to hopefully motivate the answers more clearly:
Why is it justified to call your method anti-scheming training and not anti-covert-action training?
Or does o3 acquire it during antischeming
Notably all the ones in "Appendix N.3 Models reason about the specific evaluation harness across environments" come from just normal production o3, so this appears to be something it already reasoned about.
Do you have any hypotheses how o3 learned what “OpenAI autop grader” is?
(Will answer in a few parts, but let me know if didn't address the core question!)
"autop" is likely derived from autoprompt (AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated ...
Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv
Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming.
In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate of covert actions across our diverse suite by only training against a single type of...
This is a brief summary of what we believe to be the most important takeaways from our new paper and from our findings shown in the o1 system card. We also specifically clarify what we think we did NOT show.
Paper: https://www.apolloresearch.ai/research/scheming-reasoning-evaluations
Twitter about paper: https://x.com/apolloaisafety/status/1864735819207995716
Twitter about o1 system card: https://x.com/apolloaisafety/status/1864737158226928124
We say an AI system is “scheming” if it covertly pursues misaligned goals, hiding its true capabilities and objectives. We think that in order to scheme, models likely need to be goal-directed, situationally aware, and capable enough to reason about scheming as a strategy. In principle, models might acquire situational awareness and stable long-term goals during training, and then scheme in pursuit of those goals. We...
This is an excellent post. Some of the concepts weren't as clear to me after only reading The behavioral selection model for predicting AI motivations but I found this extremely helpful for understanding these end to end. Influence seeking behaviors seems like a very natural concept (mentally I'm imagining these almost like a meme / selfish gene).
I'm still fairly uncertain this a very clean / useful distinction:
- In real life, there is no unmonitored deployment.
- In Alignment Faking In Large Language M
... (read more)