nostalgebraist

the void

A long essay about LLMs, the nature and history of the the HHH assistant persona, and the implications for alignment. Multiple people have asked me whether I could post this LW in some form, hence this linkpost. ~17,000 words. Originally written on June 7, 2025. (Note: although I expect this...

Jun 11, 2025426

the case for CoT unfaithfulness is overstated

[Quickly written, unpolished. Also, it's possible that there's some more convincing work on this topic that I'm unaware of – if so, let me know. Also also, it's possible I'm arguing with an imaginary position here and everyone already agrees with everything below.] In research discussions about LLMs, I often...

Sep 29, 2024271

nostalgebraist's Shortform

May 27, 20247

OpenAI API base models are not sycophantic, at any size

In Discovering Language Model Behaviors with Model-Written Evaluations" (Perez et al 2022), the authors studied language model "sycophancy" - the tendency to agree with a user's stated view when asked a question. The paper contained the striking plot reproduced below, which shows sycophancy * increasing dramatically with model size *...

Aug 29, 2023184

chinchilla's wild implications

(Colab notebook here.) This post is about language model scaling laws, specifically the laws derived in the DeepMind paper that introduced Chinchilla.[1] The paper came out a few months ago, and has been discussed a lot, but some of its implications deserve more explicit notice in my opinion. In particular:...

Jul 31, 2022425

wrapper-minds are the enemy

This post is a follow-up to "why assume AGIs will optimize for fixed goals?". I'll assume you've read that one first. I ended the earlier post by saying: > [A]gents with the "wrapper structure" are inevitably hard to align, in ways that agents without it might not be. An AGI...

Jun 17, 2022108

why assume AGIs will optimize for fixed goals?

When I read posts about AI alignment on LW / AF/ Arbital, I almost always find a particular bundle of assumptions taken for granted: * An AGI has a single terminal goal[1]. * The goal is a fixed part of the AI's structure. The internal dynamics of the AI, if...

Jun 10, 2022161

nostalgebraist

nostalgebraist

the void

chinchilla's wild implications

the case for CoT unfaithfulness is overstated

interpreting GPT: the logit lens

nostalgebraist

the void

chinchilla's wild implications

the case for CoT unfaithfulness is overstated

interpreting GPT: the logit lens

the void

the case for CoT unfaithfulness is overstated

nostalgebraist's Shortform

OpenAI API base models are not sycophantic, at any size

chinchilla's wild implications

wrapper-minds are the enemy

why assume AGIs will optimize for fixed goals?