This post was inspired by orthonormal's post Developmental Stages of GPTs and the discussion that followed, so only part of it is original.
First I'll aim to provide a crisper version of the argument for why GPT wants to mesa-optimize. Specifically, I'll explain a well-known optimization algorithm used in text generation, and argue that GPT can improve performance on its objective by learning to implement something like this algorithm internally.
Then I'll offer some ideas of mine about how we might change this.
Explanation of beam search
Our goal is to generate plausible text. We evaluate whether text is "plausible" by multiplying together all the individual word probabilities from our language model.
Greedy word selection has a problem: Since it doesn't do lookahead, it's liable to get stuck in a dead end. Let's say we give our system the following poem about cheeses and ask it to generate more text:
> Mozzarella is white
> So you can see it at night
> Cheddar is...
If our language model is decent, the word it will assign the highest probability to is "orange". But this creates a problem, because "orange" is a hard word to rhyme.
Beam search is an attempt to solve this problem. Instead of picking the next word greedily, we explore the tree of completions and try to find a multi-word completion that maximizes the product of the individual word probabilities.
Because there are so many words in the English language, the tree grows at a very fast exponential rate. So we choose an integer beam_width for the number of partial completions to track, and each time we take another step deeper into the tree, we discard all but the most plausible beam_width partial completions.
Beam search with a beam width of 2. The bold red path corresponds to the maximum-plausibility completion, which would not get discovered by greedy search because "nice" has a higher probability than "dog". Image stolen from this Hugging Face blog post, which has another explanation of bea