This post is part of a sequence on LLM psychology TL;DR We introduce our perspective on a top-down approach for exploring the cognition of LLMs by studying their behavior, which we refer to as LLM psychology. In this post we take the mental stance of treating LLMs as “alien minds,”...
This concept has recently become a core part of my toolkit for thinking about the world, and I find it helps explain a lot of things that previously felt confusing to me. Here I explain how I understand “direction of fit,” and give some examples of where I find the...
Thanks to Simon Celinder, Quentin Feuillade--Montixi, Nora Ammann, Clem von Stengel, Guillaume Corlouer, Brady Pelkey and Mikhail Seleznyov for feedback on drafts. This post was written in connection with the AI Safety Camp. Executive Summary: This document proposes an approach to corrigibility that focuses on training generative models to function...
Thanks to Garrett Baker, David Udell, Alex Gray, Paul Colognese, Akash Wasil, Jacques Thibodeau, Michael Ivanitskiy, Zach Stein-Perlman, and Anish Upadhayay for feedback on drafts, as well as Scott Viteri for our valuable conversations. Various people at Conjecture helped develop the ideas behind this post, especially Connor Leahy and Daniel...
Thanks to Dan Braun, Ze Shen Chin, Paul Colognese, Michael Ivanitskiy, Sudhanshu Kasewa, and Lucas Teixeira for feedback on drafts. This work was carried out while at Conjecture. This post is a loosely structured collection of thoughts and confusions about search and mesaoptimization and how to look for them in...