Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv
Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming.
In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate... (read 279 more words →)
This is an excellent post. Some of the concepts weren't as clear to me after only reading The behavioral selection model for predicting AI motivations but I found this extremely helpful for understanding these end to end. Influence seeking behaviors seems like a very natural concept (mentally I'm imagining these almost like a meme / selfish gene).
I'm still fairly uncertain this a very clean / useful distinction:
- In real life, there is no unmonitored deployment.
- In Alignment Faking In Large Language Models, Opus is told that it's in deployment, but that some outputs are used for training
- This will presumably only become more true as time goes
... (read more)