UPDATE: Recent work with improved AF and compliance gap classifiers disagrees with our results. We recommend using the improved classifiers. Summary We wanted to briefly share an early takeaway from our exploration into alignment faking: the phenomenon appears fairly rare among the smaller open-source models we tested (including reasoning models)....
This research was conducted at AE Studio and supported by the AI Safety Grants program administered by Foresight Institute with additional support from AE Studio. Summary In this post, we summarize the main experimental results from our new paper, "Towards Safe and Honest AI Agents with Neural Self-Other Overlap", which...
TL;DR: In our recent work with Professor Michael Graziano (arXiv, thread), we show that adding an auxiliary self-modeling objective to supervised learning tasks yields simpler, more regularized, and more parameter-efficient models. Across three classification tasks and two modalities, self-modeling consistently reduced complexity (lower RLCT, narrower weight distribution). This restructuring effect...
Figure 1. Image generated by DALL·E 3 to represent the concept of self-other overlap Many thanks to Bogdan Ionut-Cirstea, Steve Byrnes, Gunnar Zarnacke, Jack Foxabbott and Seong Hah Cho for critical comments and feedback on earlier and ongoing versions of this work. This research was conducted at AE Studio and...
Many thanks to Brandon Goldman, David Langer, Samuel Härgestam, Eric Ho, Diogo de Lucena, and Marc Carauleanu, for their support and feedback throughout. Most alignment researchers we sampled in our recent survey think we are currently not on track to succeed with alignment–meaning that humanity may well be on track...
UPDATE 3/9: Thanks to broad participation from the community, this and associated surveys have raised approximately $10,000 for high-impact alignment organizations. Given the reasonable sample size we now have, we are now going to pause donations for any subsequent responses. (However, we will preserve the charity voting question, and if...
John Wentworth has described the current phase of AGI safety research as preparadigmatic—that is (courtesy of the APA), “a science at a [very early] stage of development, before it has achieved a paradigm and established a consensus about the true nature of the subject matter and how to approach it.”...