Bronson Schoen

A Toy Environment For Exploring Reasoning About Reward

tldr: We share a toy environment that we found useful for understanding how reasoning changed over the course of capabilities-focused RL. Over the course of capabilities-focused RL, the model biases more strongly towards reward hints over direct instruction in this environment. Setup When we noticed the increase in verbalized alignment...

Mar 2555

Metagaming matters for training, evaluation, and oversight

Following up on our previous work on verbalized eval awareness: we are sharing a post investigating the emergence of metagaming reasoning in a frontier training run. 1. Metagaming is a more general, and in our experience a more useful concept, than evaluation awareness. 2. It arises in frontier training runs...

Mar 1871

Stress Testing Deliberative Alignment for Anti-Scheming Training

Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context,...

Sep 17, 2025133

Frontier Models are Capable of In-context Scheming

This is a brief summary of what we believe to be the most important takeaways from our new paper and from our findings shown in the o1 system card. We also specifically clarify what we think we did NOT show. Paper: https://www.apolloresearch.ai/research/scheming-reasoning-evaluations Twitter about paper: https://x.com/apolloaisafety/status/1864735819207995716 Twitter about o1 system...

Dec 5, 2024211

Bronson Schoen

Bronson Schoen

Metagaming matters for training, evaluation, and oversight

Stress Testing Deliberative Alignment for Anti-Scheming Training

Ablations for “Frontier Models are Capable of In-context Scheming”

Frontier Models are Capable of In-context Scheming

Bronson Schoen

Metagaming matters for training, evaluation, and oversight

Stress Testing Deliberative Alignment for Anti-Scheming Training

Ablations for “Frontier Models are Capable of In-context Scheming”

Frontier Models are Capable of In-context Scheming

A Toy Environment For Exploring Reasoning About Reward

Metagaming matters for training, evaluation, and oversight

Stress Testing Deliberative Alignment for Anti-Scheming Training

Frontier Models are Capable of In-context Scheming