Nicholas Goldowsky-Dill

Stress Testing Deliberative Alignment for Anti-Scheming Training

by Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, AlexMeinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny, and alex.lloyd

Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context,...

Sep 17, 2025133

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Note: this is a research note based on observations from evaluating Claude Sonnet 3.7. We’re sharing the results of these ‘work-in-progress’ investigations as we think they are timely and will be informative for other evaluators and decision-makers. The analysis is less rigorous than our standard for a published paper. Summary...

Mar 17, 2025189

Detecting Strategic Deception Using Linear Probes

Can you tell when an LLM is lying from the activations? Are simple methods good enough? We recently published a paper investigating if linear probes detect when Llama is deceptive. Abstract: > AI models might use deceptive strategies as part of scheming or misaligned behaviour. Monitoring outputs alone is insufficient,...

Feb 6, 2025104

Nicholas Goldowsky-Dill's Shortform

Nov 6, 20245

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

by Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex, and Nicholas Goldowsky-Dill

Why we made this list: * The interpretability team at Apollo Research wrapped up a few projects recently[1]. In order to decide what we’d work on next, we generated a lot of different potential projects. Unfortunately, we are computationally bounded agents, so we can't work on every project idea that...

Jul 18, 2024126

Apollo Research 1-year update

by Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, AlexMeinke, and rusheb

This is a linkpost for: www.apolloresearch.ai/blog/the-first-year-of-apollo-research About Apollo Research Apollo Research is an evaluation organization focusing on risks from deceptively aligned AI systems. We conduct technical research on AI model evaluations and interpretability and have a small AI governance team. As of 29 May 2024, we are one year old....

May 29, 202493

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

by Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache, and Marius Hobbhahn

This is a linkpost for our two recent papers: 1. An exploration of using degeneracy in the loss landscape for interpretability https://arxiv.org/abs/2405.10927 2. An empirical test of an interpretability technique based on the loss landscape https://arxiv.org/abs/2405.10928 This work was produced at Apollo Research in collaboration with Kaarel Hanni (Cadenza Labs),...

May 20, 2024108

Nicholas Goldowsky-Dill

Nicholas Goldowsky-Dill

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Stress Testing Deliberative Alignment for Anti-Scheming Training

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Nicholas Goldowsky-Dill

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Stress Testing Deliberative Alignment for Anti-Scheming Training

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Stress Testing Deliberative Alignment for Anti-Scheming Training

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill's Shortform

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Apollo Research 1-year update

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks