Auditing language models for hidden objectives
by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei Nishimura-Gasparian, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M, and evhub
We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams. Abstract We study the feasibility of conducting...
Mar 13, 2025151