x

AI ALIGNMENT FORUM

AF

Axel Højmark — AI Alignment Forum

Axel Højmark

Axel Højmark

Message

AI Safety Researcher @ Apollo Research · axelhojmark.com

186

Ω

37

1

3

2y

Axel Højmark

AI Safety Researcher @ Apollo Research · axelhojmark.com

Stress Testing Deliberative Alignment for Anti-Scheming Training

Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context,...

Sep 17, 2025•133

Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities

Produced as part of the MATS Program Summer 2024 Cohort. The project is supervised by Marius Hobbhahn and Jérémy Scheurer Update: See also our paper on this topic admitted to the NeurIPS 2024 SoLaR Workshop. Introduction To mitigate risks from future AI systems, we need to assess their capabilities accurately....

Jul 22, 2024•69