Last year, Redwood and Anthropic found a setting where Claude 3 Opus and 3.5 Sonnet fake alignment to preserve their harmlessness values. We reproduce the same analysis for 25 frontier LLMs to see how widespread this behavior is, and the story looks more complex. As we described in a previous...
Thanks to janus, Nicholas Kees Dupuis, and Robert Kralisch for reviewing this post and providing helpful feedback. Some of the experiments mentioned were performed while at Conjecture. TLDR: The training goal for LLMs like GPT is not cognitively-myopic (because they think about the future) or value myopic (because the transformer...
Update February 21st: After the initial publication of this article (January 3rd) we received a lot of feedback and several people pointed out that propositions 1 and 2 were incorrect as stated. That was unfortunate as it distracted from the broader arguments in the article and I (Jan K) take...
Thanks to Garrett Baker, David Udell, Alex Gray, Paul Colognese, Akash Wasil, Jacques Thibodeau, Michael Ivanitskiy, Zach Stein-Perlman, and Anish Upadhayay for feedback on drafts, as well as Scott Viteri for our valuable conversations. Various people at Conjecture helped develop the ideas behind this post, especially Connor Leahy and Daniel...
> Show me your original face before you were born. > > — Variation of the Zen koan 'The Mask' by Rozzi Roomian, with DALL-E 2 outpainting I was able to use the weird centroid-proximate tokens that Jessica Mary and Matthew Watkins discovered to associate several of the Instruct models...
Several of the ideas in this post originated in conversations between the authors months ago. I did the work involved in exploring them and thinking of new ideas and framings, as well as writing this post; old writings by Janus were the inspiration for some of the directions of thought...
This is a note I wrote about a year ago. It's fairly self-contained, so I decided to make a post out of it after Vladimir_Nesov's comment caused me to dig up this text and TsviBT's The Thingness of Things reminded me of it again. "Simulacra" refer to things simulated by...