Thanks to Monte MacDiarmid (for discussions, feedback, and experiment infrastructure) and to the Shard Theory team for their prior work and exploratory infrastructure. Thanks to Joseph Bloom, John Wentworth, Alexander Gietelink Oldenziel, Johannes Treuitlein, Marius Hobbhahn, Jeremy Gillen, Bilal Chughtai, Evan Hubinger, Rocket Drew, Tassilo Neubauer, Jan Betley, and Juliette...
Thanks to Arun Jose, Joseph Bloom, and Johannes Treutlein for feedback/discussions. Introduction The release of AutoGPT prompted discussions related to the potential of such systems to turn non-agentic LLMs into agentic systems that pursue goals, along with the dangers that could follow. The relation of such systems to the alignment...
TLDR: We analyse how a small Decision Transformer learns to simulate agents on a grid world task, providing evidence that it is possible to do circuit analysis on small models which simulate goal-directedness. We think Decision Transformers are worth exploring further and may provide opportunities to explore many alignment-relevant deep...
This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program. Thanks to Evan Hubinger, Arun Jose, and Dane Sherburn for discussions/feedback. Introduction This post aims to extend the auditing games framework to high-level interpretability;...
Or ... How penalizing computation used during training disfavors deception. This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program. Thanks to Evan Hubinger, Yonadav Shavit, and Arun Jose for helpful discussions and feedback....