Jérémy Scheurer

Stress Testing Deliberative Alignment for Anti-Scheming Training

by Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, Alex Meinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny, and alex.lloyd

Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context,...

Sep 17, 2025133

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

by Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer, and Marius Hobbhahn

Note: this is a research note based on observations from evaluating Claude Sonnet 3.7. We’re sharing the results of these ‘work-in-progress’ investigations as we think they are timely and will be informative for other evaluators and decision-makers. The analysis is less rigorous than our standard for a published paper. Summary...

Mar 17, 2025189

Frontier Models are Capable of In-context Scheming

by Marius Hobbhahn, Alex Meinke, Bronson Schoen, rusheb, Jérémy Scheurer, and Mikita Balesni

This is a brief summary of what we believe to be the most important takeaways from our new paper and from our findings shown in the o1 system card. We also specifically clarify what we think we did NOT show. Paper: https://www.apolloresearch.ai/research/scheming-reasoning-evaluations Twitter about paper: https://x.com/apolloaisafety/status/1864735819207995716 Twitter about o1 system...

Dec 5, 2024211

An Opinionated Evals Reading List

by Marius Hobbhahn and Jérémy Scheurer

While you can make a lot of progress in evals with tinkering and paying little attention to the literature, we found that various other papers have saved us many months of research effort. The Apollo Research evals team thus compiled a list of what we felt were important evals-related papers....

Oct 15, 202465

Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities

by Axel Højmark, fidgetsinner, Arjun Panickssery, Marius Hobbhahn, and Jérémy Scheurer

Produced as part of the MATS Program Summer 2024 Cohort. The project is supervised by Marius Hobbhahn and Jérémy Scheurer Update: See also our paper on this topic admitted to the NeurIPS 2024 SoLaR Workshop. Introduction To mitigate risks from future AI systems, we need to assess their capabilities accurately....

Jul 22, 202469

Apollo Research 1-year update

by Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, Alex Meinke, and rusheb

This is a linkpost for: www.apolloresearch.ai/blog/the-first-year-of-apollo-research About Apollo Research Apollo Research is an evaluation organization focusing on risks from deceptively aligned AI systems. We conduct technical research on AI model evaluations and interpretability and have a small AI governance team. As of 29 May 2024, we are one year old....

May 29, 202493

We need a Science of Evals

by Marius Hobbhahn and Jérémy Scheurer

This is a linkpost for https://www.apolloresearch.ai/blog/we-need-a-science-of-evals In this post, we argue that if AI model evaluations (evals) want to have meaningful real-world impact, we need a “Science of Evals”, i.e. the field needs rigorous scientific processes that provide more confidence in evals methodology and results. Model evaluations allow us to...

Jan 22, 202475

Jérémy Scheurer

Jérémy Scheurer

Announcing Apollo Research

Frontier Models are Capable of In-context Scheming

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Stress Testing Deliberative Alignment for Anti-Scheming Training

Jérémy Scheurer

Announcing Apollo Research

Frontier Models are Capable of In-context Scheming

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Stress Testing Deliberative Alignment for Anti-Scheming Training

Stress Testing Deliberative Alignment for Anti-Scheming Training

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Frontier Models are Capable of In-context Scheming

An Opinionated Evals Reading List

Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities

Apollo Research 1-year update

We need a Science of Evals