A long essay about LLMs, the nature and history of the the HHH assistant persona, and the implications for alignment. Multiple people have asked me whether I could post this LW in some form, hence this linkpost. ~17,000 words. Originally written on June 7, 2025. (Note: although I expect this...
[Quickly written, unpolished. Also, it's possible that there's some more convincing work on this topic that I'm unaware of – if so, let me know. Also also, it's possible I'm arguing with an imaginary position here and everyone already agrees with everything below.] In research discussions about LLMs, I often...
In Discovering Language Model Behaviors with Model-Written Evaluations" (Perez et al 2022), the authors studied language model "sycophancy" - the tendency to agree with a user's stated view when asked a question. The paper contained the striking plot reproduced below, which shows sycophancy * increasing dramatically with model size *...
(Colab notebook here.) This post is about language model scaling laws, specifically the laws derived in the DeepMind paper that introduced Chinchilla.[1] The paper came out a few months ago, and has been discussed a lot, but some of its implications deserve more explicit notice in my opinion. In particular:...
This post is a follow-up to "why assume AGIs will optimize for fixed goals?". I'll assume you've read that one first. I ended the earlier post by saying: > [A]gents with the "wrapper structure" are inevitably hard to align, in ways that agents without it might not be. An AGI...
When I read posts about AI alignment on LW / AF/ Arbital, I almost always find a particular bundle of assumptions taken for granted: * An AGI has a single terminal goal[1]. * The goal is a fixed part of the AI's structure. The internal dynamics of the AI, if...