Paul Colognese — AI Alignment Forum

High-level interpretability: detecting an AI's objectives

Thanks to Monte MacDiarmid (for discussions, feedback, and experiment infrastructure) and to the Shard Theory team for their prior work and exploratory infrastructure. Thanks to Joseph Bloom, John Wentworth, Alexander Gietelink Oldenziel, Johannes Treuitlein, Marius Hobbhahn, Jeremy Gillen, Bilal Chughtai, Evan Hubinger, Rocket Drew, Tassilo Neubauer, Jan Betley, and Juliette...

Sep 28, 202372

Aligned AI via monitoring objectives in AutoGPT-like systems

Thanks to Arun Jose, Joseph Bloom, and Johannes Treutlein for feedback/discussions. Introduction The release of AutoGPT prompted discussions related to the potential of such systems to turn non-agentic LLMs into agentic systems that pursue goals, along with the dangers that could follow. The relation of such systems to the alignment...

May 24, 202327

Decision Transformer Interpretability

TLDR: We analyse how a small Decision Transformer learns to simulate agents on a grid world task, providing evidence that it is possible to do circuit analysis on small models which simulate goal-directedness. We think Decision Transformers are worth exploring further and may provide opportunities to explore many alignment-relevant deep...

Feb 6, 202387

Auditing games for high-level interpretability

This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program. Thanks to Evan Hubinger, Arun Jose, and Dane Sherburn for discussions/feedback. Introduction This post aims to extend the auditing games framework to high-level interpretability;...

Nov 1, 202233

Deception?! I ain’t got time for that!

Or ... How penalizing computation used during training disfavors deception. This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program. Thanks to Evan Hubinger, Yonadav Shavit, and Arun Jose for helpful discussions and feedback....

Jul 18, 202255