dmz — AI Alignment Forum

Anthropic's Pilot Sabotage Risk Report

As practice for potential future Responsible Scaling Policy obligations, we're releasing a report on misalignment risk posed by our deployed models as of Summer 2025. We conclude that there is very low, but not fully negligible, risk of misaligned autonomous actions that substantially contribute to later catastrophic outcomes. We also...

Oct 30, 202532

Auditing language models for hidden objectives

We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams. Abstract We study the feasibility of conducting...

Mar 13, 2025150

Takeaways from our robust injury classifier project [Redwood Research]

With the benefit of hindsight, we have a better sense of our takeaways from our first adversarial training project (paper). Our original aim was to use adversarial training to make a system that (as far as we could tell) never produced injurious completions. If we had accomplished that, we think...

Sep 17, 2022143

High-stakes alignment via adversarial training [Redwood Research report]

(Update: We think the tone of this post was overly positive considering our somewhat weak results. You can read our latest post with more takeaways and followup results here.) This post motivates and summarizes this paper from Redwood Research, which presents results from the project first introduced here. We used...

May 5, 2022142

Some criteria for sandwiching projects

I liked Ajeya's post a lot, and I think the alignment community should try to do sandwiching projects along the lines she describes. Here I wanted to flesh out some potential criteria for a good sandwiching project; there's not too much original thinking here but I found it helpful to...

Aug 12, 202118

DMZ's Shortform

Aug 9, 20211