x

Austin Meek

Subscribe

Message

120

1

3y

Austin Meek

Subscribe

Message

120

1

3y

Auditing language models for hidden objectives

by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei Nishimura-Gasparian, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M, and evhub

We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams. Abstract We study the feasibility of conducting...

Mar 13, 2025155

Paper: Understanding and Controlling a Maze-Solving Policy Network

by TurnTrout, Ulisse Mini, peligrietzer, mrinank_sharma, Austin Meek, Monte M, and lisathiergart

Mrinank, Austin, and Alex wrote a paper on the results from Understanding and controlling a maze-solving policy network, Maze-solving agents: Add a top-right vector, make the agent go to the top-right, and Behavioural statistics for a maze-solving agent. > Abstract: To understand the goals and goal representations of AI systems,...

Oct 13, 202370