Reference post for a point I was surprised to not see in circulation already. Thanks to the acorn team for conversations that changed my mind about this. The standard argument for scheming centrally involves goal-guarding. The AI has some beyond-episode goal, knows that training will modify that goal, and therefore...
In the spirit of better late than never - this has been sitting in drafts for a couple months now. Big thanks for Aryan Bhatt for helpful input throughout. Thanks to Abhay Sheshadri for running a bunch of experiments on other models for me. Thanks to Rowan Wang, Stewy Slocum,...
This post has an accompanying SPAR project! Apply here if you’re interested in working on this with me. Huge thanks to Mikita Balesni for helping me implement the MVP. Regular-sized thanks to Aryan Bhatt, Rudolph Laine, Clem von Stengel, Aaron Scher, Jeremy Gillen, Peter Barnett, Stephen Casper, and David Manheim...
This post summarizes research conducted under the mentorship of Evan Hubinger, and was assisted by collaboration with Pranav Gade, discussions with Adam Jermyn, and draft feedback from Yonadav Shavit. Summary 1. Forwarding priors is a subproblem of deceptive alignment, because if we want to use regularization to create a prior...
This post was written under the mentorship of Evan Hubinger, and assisted by discussions with Adam Jermyn and Johannes Treutlein. See also their previous posts on this project. Summary 1. Conditioning Generative Models (CGM) is a strategy to accelerate alignment research using powerful semi-aligned future language models, which seems potentially...