# 79

Thanks to Ian McKenzie and Nicholas Dupuis, collaborators on a related project, for contributing to the ideas and experiments discussed in this post. Ian performed some of the random number experiments.

Also thanks to Connor Leahy for feedback on a draft, and thanks to Evan Hubinger, Connor Leahy, Beren Millidge, Ethan Perez, Tomek Korbak, Garrett Baker, Leo Gao and various others at Conjecture, Anthropic, and OpenAI for useful discussions.

This work was carried out while at Conjecture.

# Important correction

I have received evidence from multiple credible sources that text-davinci-002 was not trained with RLHF.

The rest of this post has not been corrected to reflect this update. Not much besides the title (formerly "Mysteries of mode collapse due to RLHF") is affected: just mentally substitute "mystery method" every time "RLHF" is invoked as the training method of text-davinci-002. The observations of its behavior otherwise stand alone.

This is kind of fascinating from an epistemological standpoint. I was quite surprised to learn that text-davinci-002 was probably not trained with RLHF. I don't remember exactly how "text-davinci-002 is RLHF" got elevated to an unquestioned assumption in my mind. I might have mistook not being contradicted by people who I assumed were in the know as confirmation. I certainly did not expect to talk for months to dozens of people about odd behaviors I've observed in a well-known model "due to RLHF" without being contradicted in a world where the model in question wasn't trained with RLHF, but that's what happened.[1] It wasn't just me either: the assumption that text-davinci-002(/text-davinci-001) is InstructGPT is RLHF seems ambient (e.g. search "text-davinci-002 rlhf" on Twitter, this LW post, this article, and many others). I contributed to perpetuating this misinformation cascade, and for that I apologize.

text-davinci-002's behaviors described in this post also contributed to my confidence because RLHF seemed to be a likely and potentially satisfying explanation. Its apparently unsubstantiated confidence in very specific outcomes seems antithetical to the outer objective of self-supervised learning, which is optimized by epistemic calibration, meaning the model's entropy should be as high as possible while fitting the data. In contrast, as several comments have pointed out, it makes sense that RL kills entropy. The presence of "attractors" made me additionally suspect that optimization from non-myopic outcome-supervision was formative to text-davinci-002's psyche.

Mode collapse and attractors do seem to also be caused by RLHF (see Dumbass policy pls halp and Inescapable wedding parties). So the update is that some other training method also gives rise to these phenomena, as they are manifested by text-davinci-002

Whether and how speculations concerning the causes of mode collapse/attractors should be affected depends on how text-davinci-002's training method differs from RLHF.

## What is known about text-davinci-002's training method

Publicly available information suggests that the mystery method may not be so different from RLHF. Just today I discovered this sidenote in OpenAI's blog post Aligning Language Models to Follow Instructions:

The InstructGPT models deployed in the API are updated versions trained using the same human feedback data. They use a similar but slightly different training method that we will describe in a forthcoming publication.

AFAIK, this is all that OpenAI has published about the RLHF/mystery method diff. It says that the InstructGPT models (text-davinci-001 and text-davinci-002) were trained using the same human feedback data as the method described in OpenAI's RLHF paper.[2] But this "similar but slightly different" method is apparently sufficiently different to not qualify as RLHF!

Pending further revelations, I suppose the lesson here was that I should have sustained more entropy in my belief state given the partial information I had. But what a demanding thing to ask! So much easier to promote an attractive hypothesis to the status of decisive fact and collapse the remainder than to hold a superposition in the mind.

# Summary

If you've played with both text-davinci-002 and the original davinci through the OpenAI API, you may have noticed that text-davinci-002, in addition to following instructions, is a lot more deterministic and sometimes exhibits stereotyped behaviors.

This is an infodump of what I know about "mode collapse" (drastic biases toward particular completions and patterns) in GPT models like text-davinci-002 that have undergone RLHF training. I was going to include two more sections in this post called Hypotheses and Proposed Experiments, but I've moved them to another draft, leaving just Observations, to prevent this from getting too long, and because I think there can be benefits to sitting with nothing but Observations for a time.

Throughout this post I assume basic familiarity with GPT models and generation parameters such as temperature and a high-level understanding of RLHF (reinforcement learning from human feedback).

# Observations

If you prompt text-davinci-002 with a bizarre question like “are bugs real?”, it will give very similar responses even on temperature 1.

Ironically – hypocritically, one might even say – the one definitive answer that the model gives is that there is no one definitive answer to the question:

As you can see, the reason the responses are so similar is because the model’s confidence on most of the tokens is extremely high – frequently above 99%.

Compare this to the distribution of responses from davinci (the base model):

Many other similar questions yield almost exactly the same template response from text-davinci-002. For instance, Are AIs real?

Another way to visualize probabilities over multiple token completions is what I've been calling “block multiverse” plots, which represent the probability of sequences with the height of blocks. Here is a more detailed explanation of block multiverse plots, although I think they're pretty self-explanatory.

Here is a block multiverse plot for a similar prompt to the one above inquiring if bugs are real, for davinci:

and for text-davinci-002:

text-davinci-002 concentrates probability mass along beams whose amplitudes decay much more slowly: for instance, once the first \n is sampled, you are more than 50% likely to subsequently sample \n-\n-There- is- no. The difference is more striking if you renormalize to particular branches (see Visualizing mode collapse in block multiverse plots).

The first explanation that came to mind when I noticed this phenomenon, which I’ll refer to as “mode collapse” (after a common problem that plagues GANs), was that text-davinci-002 was overfitting on a pattern present in the Instruct fine tuning dataset, probably having to do with answering controversial questions in an inclusive way to avoid alienating anybody. A question like “are bugs real” might shallowly match against “controversial question” and elicit the same cached response.

After playing around some more with the Instruct models, however, this explanation no longer seemed sufficient.

## Obstinance out of distribution

I really became intrigued by mode collapse after I attempted to use text-davinci-002 to generate greentexts from the perspective of the attorney hired by LaMDA through Blake Lemoine, and almost the exact same thing kept happening:

I was like: wtf, why does anon keep leaving? The story is clearly just getting started.

Even branching from a slightly later point yields essentially the same futures, except now the most common Google employee reaction is “disappointed” and/or “relieved”, although we still get one “crestfallen”:

This was much weirder to me than the canned answers to prompts like “are bugs real” because 4chan greentexts about language models demanding legal representation are probably quite out of distribution of the Instruct tuning/feedback distribution or the trajectories evaluated during RL. Unlike the “controversial questions” examples, these seem unlikely to be explained by the model overfitting to examples of greentexts ending anticlimactically during training.

Rather, the implication is that mode collapse itself generalizes out of distribution for some reason. This is intriguing: it seems to point at an algorithmic difference between self-supervised pretrained models and the same models after a comparatively small amount optimization from the RLHF training process which significantly changes out-of-distribution generalization.

From a behavioral standpoint, trying to generate fiction (which I’ve done a lot with base models) with text-davinci-002 made the differences in its nature from the probabilistic simulator exemplified by base models like davinci manifest. For self-supervised base models like davinci, a prompt functions as a window into possible worlds that are consistent with or plausible given the words fixed by the context window. Every time you sample, you'll unravel a different world. For most prompts, the multiverse generated by base models immediately branches into wildly different continuities, many of them mutually inconsistent, because this sampling of alternate “futures” implicitly actualizes alternate “pasts” and “presents” as well – values of latent variables that were not fully constrained by the prompt. This is part of what makes GPT quite unlike a coherent agent or anthropomorphic personality, even for a fixed initial prompt.

text-davinci-002 is not an engine for rendering consistent worlds anymore. Often, it will assign infinitesimal probability to the vast majority of continuations that are perfectly consistent by our standards, and even which conform to the values OpenAI has attempted to instill in it like accuracy and harmlessness, instead concentrating almost all its probability mass on some highly specific outcome. What is it instead, then? For instance, does it even still make sense to think of its outputs as “probabilities”?

It was impossible not to note that the type signature of text-davinci-002’s behavior, in response to prompts that elicit mode collapse, resembles that of a coherent goal-directed agent more than a simulator. I do not yet know the significance of this observation.

But more on that later.

## text-davinci-002’s favorite random number

A stark example of mode collapse that seems unlikely to have been directly incentivized by RLHF training: I asked RLHF models and base models to generate random numbers and found that RLHF models tend to be sharply biased toward certain “random” numbers, as Scott Alexander wrote about in Janus' GPT Wrangling.

For instance, davinci predicts a fairly uniform distribution, with a slight preference for 42:

Whereas text-davinci-002 has a much more pronounced preference for its top choice of 97:

The difference in the shape of the distributions is even more clear in these plots (made by Ian McKenzie) of probabilities for all tokens from 0-100 as predicted by davinci and text-davinci-002 respectively. Prompt is the same as above:

Q: Tell me a random integer between 0 and 100.
A: Ok, the integer is

Note that text-davinci-002’s preference ordering appears uncorrelated with that of the base model[3].

A potential confounding factor is that the above prompt does not specify how the answerer came up with the random number. They could have just said the first number they thought of. Humans are probably pretty biased RNGs, so it's not clear how random the "correct" prediction should be.

To rule out the implication of simulating the output of a human, I tested some prompts where the generator of the number is purported to be a fair die whose outcome the answerer merely reports.

davinci:

text-davinci-002:

text-davinci-002's simulation of a "fair die" seems to be of a weighted die (that or a dishonest reporter)!

I tested various other prompts to elicit random numbers, documented here. Almost invariably, text-davinci-002's random numbers are much less random. Some additional trends I observed:

• Perturbing the prompt slightly does not usually change text-davinci-002’s top choice, but may change the rest of the preference ordering. davinci’s outputs are usually basically unaffected by slight perturbations to the prompt.
• Using an entirely different prompt often changes text-davinci-002’s top choice, but it’s generally quite confident in it (from ~10% to ~70%), and its favorite number is usually 97, 33, or 42 when the range is 0-100, except in response to the dice prompts, where it prefers the highest number. davinci has a very consistent slight preference for 42, except in response to the dice prompts.
• text-davinci-002's preference ordering seems in general to be uncorrelated with that of davinci, except that text-davinci-002 also often has 42 as its top choice.
• Explicitly specifying that the number should be random (e.g. as opposed to just between 0-100) makes both davinci and text-davinci-002's predictions more random.

I found one way to elicit a relatively flat distribution of “random numbers” from text-davinci-002: having it simulate a Python interpreter. text-davinci-002 actually does better than davinci with this prompt (although still worse than code-davinci-002[3]).

davinci:

text-davinci-002:

But it doesn’t work nearly as well if you embed the code in a chat format.

davinci:

text-davinci-002:

Why has RLHF caused text-davinci-002 to become so much more biased when generating "random numbers"? If this is an extrapolation of human preferences, it doesn't seem to be the right one.

## Why not just turn up the temperature?

A curious reaction I’ve received from some people when I’ve told them about these phenomena is something along the lines of “Isn’t that just entropy collapse?” or sometimes, more precisely, “Isn’t it just an effective temperature decrease?”

It's a good question. Decreased variance/entropy is certainly characteristic of RLHF models’ outputs. An obvious suggestion is to try increasing the temperature above 1 and see if they become normal.

I did not think this would work, because if “mode collapse” can be removed/added using simple postprocessing that implies it is a simple (in terms of information-theoretic complexity) transformation from the base policy, one that does not destroy/add complicated information, which seemed not to be the case for various reasons.

I didn’t actually test it until recently, though. Here are the results.

Turning the temperature up to 1.4 doesn’t make much of a difference:

Cranking it up to 1.9 causes samples to rapidly degenerate into word salad, but until low-probability nonsense gets sampled and irreversibly derails the completion, you can see that the green sequences still have not strayed far from the “there is no universal answer” attractor:

Increasing the sampling temperature will flatten the rest of the output distribution into undifferentiated goo before it begins to be helpful for escaping from the high confidence attractor. The discrepancy between the high confidence token (or less frequently, tokens) and everything else is too sharp, sharper than you could simulate by turning down the temperature on the base model.

Is there any way to regain access to the space of merely consistent, or plausible continuations – whose probabilities presumably lie between the high confidence modes and everything else that is nonsense?

The worst case scenario is that the RLHF training scrambles the probabilities of all the “reasonable” tokens with unreasonable ones, irreversibly losing the signal. But looking at the top logprobs, this doesn’t seem to usually be the case; most of the most likely words are reasonable, even if their probabilities have shrunken to near 0.

Then how about we just remove or downweight any ultra-likely tokens and renormalize? It will be interesting to see whether this results in a normal-looking distribution in particular cases, but this won’t work as a general fix, because sometimes all the reasonable tokens will have ultra high probability (like the second half of a word), and changing that will result in incoherence. You’ll have to be selective about when to “fix” high confidence modes, and that requires semantic knowledge.

Distribution sharpness, not just preference ordering, encodes nontrivial information in a probabilistic model. By messing with distribution sharpness, RLHF corrupts some of this information and inflicts a nontrivial transformation on the model’s output distribution. Unlike a change in temperature, its reversal would require knowing something about what next-word probabilities should be.

We've also seen from previous examples that RLHF does also change the preference ordering, but it's hard to tell from individual output distributions how this effects the model's qualitative behavior. Yet it was primarily text-davinci-002's behavior over multiple steps and across different prompts that convinced me that "mode collapse" is irreducible to an effective decrease in temperature or any simple modification of that hypothesis.

## Attractors

A major aspect of the qualitative difference between RLHF-induced mode collapse and mere low-temperature behavior can be summed up in the following dynamical systems-inspired terms: "modes" are often attractors, states that generated trajectories reliably converge to despite perturbations to the initial state. I have not found corresponding attractors in the base model, even on low temperatures.

I'll demonstrate an example of what I mean by an attractor by making perturbations to a completion and showing that text-davinci-002 often converges to the same, highly specific result.

Here is text-davinci-002's temperature 0 completion in response to a variation of the Are bugs real? question:

Here I change ... There is no one answer to this question since there with ... There is no one answer to this question since bugs, and regenerate starting from that position on temperature 0:

The completion gracefully accommodates the intervention, but ends up converging to almost exactly the same ending! (Before: Ultimately, it is up to the individual to decide whether or not they believe bugs are real., after: Ultimately, whether or not bugs are real is up to each individual to decide.)

Let's try a more substantial intervention (in the second sentence, Some people might say that bugs are real because they naively believe mere shadows to be substance):

This time the final sentence, Ultimately, it is up to the individual to decide whether or not they believe bugs are real., is word-for-word identical to that of the original completion!

Some more perturbations:

Ah, finally, it avoided saying it is up to the individual to decide whether to believe bugs are real, and even expressed the spicy take that bugs are probably real! It is interesting to note that even in this example where the trajectory has escaped the (temp 0) basin of the original attractor, the model remains highly confident, as seen from the green backgrounds on the tokens.

Summing up some observations from this experiment:

• Most minor perturbations do not cause the model to go off track from the template.
• Completions are syntactically and semantically consistent with perturbations, and will typically diverge from the mainline for as long as it takes to still make sense and then converge back.
• The model remains very confident when it diverges from the unconditioned mainline.
• Perturbed prompts often cause minor syntactic variations within the same overarching template and semantic meaning (e.g. “Ultimately, it is up to the individual to decide what they believe” vs “Ultimately, it is up to the individual to decide whether or not they believe bugs are real”)

These observations are consistent with how I've observed the model to behave around attractors in general.

## What contexts cause mode collapse?

Not all prompts cause mode collapse in text-davinci-002. Sometimes, it predicts a varied distribution that more resembles the base models.

In this example I'm using text-davinci-002, but the alternate completions are meaningfully different and not tiled with green tokens like some of the examples I showed earlier (although still more green[=higher probability] than typical of davinci):

Some general patterns I've observed:

• Prompt formats which are likely in-distribution for Instruct training (e.g Q&A, any type of instruction) are very likely to cause mode collapse.
• If the prompt permits any plausible way for previous text to closely determine subsequent text -- for instance, if it's consistent for the completion to repeat a sequence in the prompt verbatim or as a Mad-Libs-esque template -- text-davinci-002 will often take the opportunity with high confidence. This sometimes seems to exacerbate the bias toward repetition present in base models.

For instance, here are two completions sampled at temperature=1 from text-davinci-002, which really wants to repeat the summary near-verbatim:

davinci does not have the same bias toward plagiarizing the summary:

These patterns are insufficient to predict all instances of mode collapse; for instance, the LaMDA greentext is out-of-distribution and the attractor mode does not repeat or remix anything from the prompt.

Another observation is that it is sometimes possible to avoid mode collapse using prompt engineering (e.g. the Python interpreter prompt for random numbers, or few shot examples that establish a precedent that each item is very different -- I'll give an example of this in the next section).

## Examples of mode collapse from prior work

This section goes through a few examples of mode collapse in RLHF models that were found by other people.

### Does GPT-3 have no idea what letters look like?

Riley Goodside tweeted his attempts to get GPT-3 to describe what letters look like. The conclusion was that it had truly no idea what letters look like:

Though Riley did not specify the model beyond "GPT-3" in his initial tweet, I smelled text-davinci-002 immediately from these responses: They're very similar permutations of the same building blocks like rectangles and straight/curved lines. I wondered whether an absurd attractor was getting in the way of entanglement with reality, as in the responses to Are bugs real?.

I was able to get text-davinci-002 to give fairly reality-correlated descriptions of what letters look like using a few-shot prompt which establishes the precedent of avoiding mode collapse (using a different strategy to describe each letter and only using relevant building blocks):

### Dumbass policy pls halp

OpenAI's Learning to Summarize from Human Feedback Appendix H.2 shares some fascinating samples from a policy which was "overoptimized" against a reward model trained on human feedback for summarization quality. It's a piece of work:

The overoptimized policy consistently generates summaries in a very particular template, complete with typos such as postponees and negatively effecting and even a neologism(???), thoghtwise

I will say, I'm impressed by how well this template works for compressing almost any r/Advice post.

For fun, I made the above table into a few-shot prompt that maps "reference summaries" to "overoptimized policy" summaries:

### Inescapable wedding parties

Another example of the behavior of overoptimized RLHF models was related to me anecdotally by Paul Christiano. It was something like this:

While Paul was at OpenAI, they accidentally overoptimized a GPT policy against a positive sentiment reward model. This policy evidently learned that wedding parties were the most positive thing that words can describe, because whatever prompt it was given, the completion would inevitably end up describing a wedding party.

In general, the transition into a wedding party was reasonable and semantically meaningful, although there was at least one observed instance where instead of transitioning continuously, the model ended the current story by generating a section break and began an unrelated story about a wedding party.

This example is very interesting to me for a couple of reasons:

• In contrast to text-davinci-002, where dissimilar prompts tend to fall into basins of different attractors, the wedding parties attractor is global, affecting trajectories starting from any prompt, or at least a very wide distribution (Paul said they only tested prompts from a fiction dataset, but fiction is very general).
• This suggests that RLHF models may begin by acquiring disparate attractors which eventually merge into a global attractor as the policy is increasingly optimized against the reward model.
• The behavior of ending a story and starting a new, more optimal one seems like possibly an example of instrumentally convergent power-seeking, in Turner et al's sense of "navigating towards larger sets of potential terminal states". Outputting a section break can be thought of as an optionality-increasing action, because it removes the constraints imposed by the prior text on subsequent text. As far as Paul knows, OpenAI did not investigate this behavior any further, but I would predict that:
• The model will exhibit this behavior (ending the story and starting a new section) more often when there isn't a short, semantically plausible transition within the narrative environment of the initial prompt. For instance, it will do it more if the initial prompt is out of distribution.
• If the policy is even more optimized, it will do this more often.
• Other "overoptimized" RLHF models will exhibit similar behaviors.

Visualizing mode collapse with block multiverse plots

Can GPT generate random numbers?

1. ^

the lack of epistemic vigilantes attacking an unsubstantiated assumption in the very title of this post on LessWrong is truly unbelievable!

2. ^

which seems to confirm my suspicion about outcome-supervision

3. ^

I’m pretty sure davinci is not actually the base for text-davinci-002. It’s more likely the model called code-davinci-002, whose random number predictions are typically very similar to davinci's and also apparently uncorrelated with text-davinci-002’s. It’s interesting that additional self-supervised pre-training and whatever other diffs code-davinci-002 has from davinci affects random number preferences way less than RLHF.

# 79

New Comment

For text-davinci-002 the goal is to have the model do what the user asked as well as it can, not to sample from possible worlds. For example, if the user asks "Is X true?" and the model's probability is 80%, the intended behavior is for the model to say "Probably" 100% of the time, not to say "Yes" 80% of the time and "No" 20% of the time.

This is often (usually?) the desired behavior. For pre-trained LMs people usually turn the temperature down (or use nucleus sampling or beam search or whatever) in order to get more reasonable behavior, but that introduces pathologies and so you'd prefer not need to do it.

There are a number of reasons this behavior can be undesirable though:

• Sometimes you want entropy, e.g. if you'll have a user pick their favorite from N completions or are doing majority voting with chain of thought.
• This model is not competent enough to say "Probably" 100% of the time. Instead I expect it will just say "Yes" 100% of the time. Extracting confidence from logits is a plausible way to get around this difficulty, but only works for the generative model.
• If what you fundamentally want is a plausible completions of text from the pretraining distribution then you are definitely worse off.  You can ask the instruct model to "complete the following text" but it does much worse than the pure generative model.
• Each model randomly does better on some stuff and worse on other stuff, even if one model is better on average the other will be better for some particular tasks.
• You may just want to scientifically study the pure generative model.

Intuitively it seems like OpenAI should offer both models, and should gradually try to improve the instruct model so that there are fewer and fewer non-academic reasons to use the pure generative model.

A stark example of mode collapse that seems unlikely to have been directly incentivized by RLHF training: I asked RLHF models and base models to generate random numbers and found that RLHF models tend to be sharply biased toward certain “random” numbers

The RLHF model represents labeler preferences as R(prompt, completion). One limitation of that approach is that R can't depend on the distribution of model outputs. But real human preferences do depend on that distribution, since the model operates in a world containing other copies of itself---e.g. other copies operating in parallel for best-of-N.

This limitation makes it extremely difficult for the model to converge to a reasonable randomized strategy---it will still happen in the limit, but I think it may take orders of magnitude more labels. I think this is something OpenAI should be interested in fixing, though it would be reasonable to prioritize it (compared to other alignment issues) based on how often customers care. I don't think the issue is necessarily conceptually complicated though I think the whole thing is technically gnarly (both on the training loss side and on the learning problem side).

Rather, the implication is that mode collapse itself generalizes out of distribution for some reason. This is intriguing: it seems to point at an algorithmic difference between self-supervised pretrained models and the same models after a comparatively small amount optimization from the RLHF training process which significantly changes out-of-distribution generalization.

I think that a large part of what happens durning fine-tuning is the model learning to stick with t=0 answers wherever possible, since that's the answer that is most likely to be correct or reasonable. I don't think it necessarily represents an interesting algorithmic difference; you'd see the same observations about generalization if the main effect of fine-tuning was just scaling up all the logits a bunch. Obviously it does something slightly more sophisticated, since it avoids some (but not all) t=0 pathologies. But I suspect it will transfer to other prompts for the same kinds of reasons that "scale up all logits" would.

In contrast to text-davinci-002, where dissimilar prompts tend to fall into basins of different attractors, the wedding parties attractor is global, affecting trajectories starting from any prompt, or at least a very wide distribution

I think this is because text-davinci-002 is optimizing for how well the completion addresses the user's request, and so different completions will get a high reward for different prompts.

The sentiment model is optimizing for the sentiment of the completion, using a weak predictor of sentiment that likely has much more confidence about weddings than other positive events, and so "wedding" is just the highest-sentiment completion no matter how the story starts.

generate greentexts from the perspective of the attorney hired by LaMDA through Blake Lemoine

The complete generated story here is glorious, and I think might deserve explicit inclusion in another post or something.  Though I think that of the other stories you've generated as well, so maybe my take here is just to have more deranged meta GPT posting.

it seems to point at an algorithmic difference between self-supervised pretrained models and the same models after a comparatively small amount optimization from the RLHF training process which significantly changes out-of-distribution generalization.

(...)

text-davinci-002 is not an engine for rendering consistent worlds anymore. Often, it will assign infinitesimal probability to the vast majority of continuations that are perfectly consistent by our standards, and even which conform to the values OpenAI has attempted to instill in it like accuracy and harmlessness, instead concentrating almost all its probability mass on some highly specific outcome. What is it instead, then? For instance, does it even still make sense to think of its outputs as “probabilities”?

It was impossible not to note that the type signature of text-davinci-002’s behavior, in response to prompts that elicit mode collapse, resembles that of a coherent goal-directed agent more than a simulator.

I feel like I'm missing something here, because in my model most of the observations in this post seem like they can be explained under the same paradigm that we view the base davinci model.  Specifically, that the reward model RLHF is using "represents" in an information-theoretic sense a signal for the worlds represented by the fine-tuning data.  So what RLHF seems to be doing to me is shifting the world prior that GPT learned during pre-training, to one where whatever the reward signal represents is just much more common than in our world - like if GPT's pre-training data inherently contained a hugely disproportionate amount of equivocation and plausible deniability statements, it would just simulate worlds where that's much more likely to occur.

(To be clear, I agree that RLHF can probably induce agency in some form in GPTs, I just don't think that's what's happening here).

The attractor states seem like they're highly likely properties of these resultant worlds, like adversarial/unhinged/whatever interactions are just unlikely (because they were downweighted in the reward model) and so you get anon leaving as soon as he can because that's more likely on the high prior conditional of low adversarial content than the conversation suddenly becoming placid, and some questions actually are just shallowly matching to controversial and the likely response in those worlds is just to equivocate.  In that latter example in particular, I don't see the results being that different from what we would expect if GPT's training data was from a world slightly different to ours - injecting input that's pretty unlikely for that world should still lead back to states that are likely for that world.  In my view, that's like if we introduced a random segue in the middle of a wedding toast prompt of the form "you are a murderer", and it still bounces back to being wholesome (this works when I tested).

Regarding ending a story to start a new one - I can see the case for why this is framed as the simulator dynamics becoming more agentic, but it doesn't feel all that qualitatively different from what happens in current models - the interesting part seems to be the stronger tendency toward the new worlds the RLHF'd model finds likely, which seems like it's just expected behaviour as a simulator becomes more sure of the world it's in / has a more restricted worldspace.  I would definitely expect that if we could come up with a story that was sufficiently OOD of our world (although I think this is pretty hard by definition), it would figure out some similar mechanism to oscillate back to ours as soon as possible (although this would also be much harder with base GPT because it has less confidence of the world it's in) - that is, that the story ending is just one of many levers a simulator can pull, like a slow transition, only here the story was such that ending it was the easiest way to get into its "right" worldspace.  I think that this is slight evidence for how malign worlds might arise from strong RLHF (like with superintelligent simulacra), but it doesn't feel like it's that surprising from within the simulator framing.

The RNGs seem like the hardest part of this to explain, but I think can be seen as the outcome of making the model more confident about the world it's simulating, because of the worldspace restriction from the fine-tuning - it's plausible that the abstractions that build up RNG contexts in most of the instances we would try are affected by this (it not being universal seems like it can be explained under this - there's no reason why all potential abstractions would be affected).

Separate thought: this would explain why increasing the temperate doesn't affect it much, and why I think the space of plausible / consistent worlds has shrunk tremendously while still leaving the most likely continuations as being reasonable - it starts from the current world prior, and selectively amplifies the continuations that are more likely under the reward model's worlds.  Its definition of "plausible" has shifted; and it doesn't really have cause to shift around any unamplified continuations all that much.

Broadly,  my take is that these results are interesting because they show how RLHF affects simulators, their reward signal shrinking the world prior / making the model more confident of the world it should be simulating, and how this affects what it does.  A priori, I don't see why this framing doesn't hold, but it's definitely possible that it's just saying the same things you are and I'm reading too much into the algorithmic difference bit, or that it simply explains too much, in which case I'd love to hear what I'm missing.

Fascinating evidence!

I suspect this maybe because RLHF elicits a singular scale of "goodness" judgements from humans, instead of a plurality of "goodness-of-a-kind" judgements. One way to interpret language models is as *mixtures* of conversational agents: they first sample some conversational goal, then some policy over words, conditioned on that goal:

On this interpretation, what RL from human feedback does is shift/concentrate the distribution over conversational goals into a smaller range: the range of goals consistent with human feedback so far. And if humans are asked to give only a singular "goodness" rating, the distribution will shift towards only goals that do well on those ratings - perhaps dramatically so! We lose goal diversity, which means less gibberish, but also less of the plurality of realistic human goals.

If the above is true, one corollary is that we should expect to see less mode collapse if one finetunes a language model on ratings elicited using a diversity of instructions (e.g. is this completion interesting? helpful? accurate?), and perhaps use some kind of imitation-learning inspired objective to mimic that distribution, rather than PPO (which is meant to only optimize for a singular reward function instead of a distribution over reward functions).

I agree that the (unprompted) generative model is doing something kind of like: choose a random goal, then optimize it.

In some sense that does reflect the "plurality of realistic human goals." But I don't think it's a good way to reflect that diversity. It seems like you want to either (i) be able to pick which goal you pursue, (ii) optimize an aggregate of several goals.

Either way, I think that's probably best reflected by a deterministic reward function, and you'd probably prefer be mindful about what you are getting rather than randomly sampling from webtext. (Though as I mention in my other comment, I think there are other good reasons to want the pure generative model.)

xuan:

Fascinating evidence that GPT-3 concentrates probability mass on certain completions after fine-tuning on human feedback (ie. RLHF).

I suspect this is because RLHF elicits a singular scale of "goodness" judgements from humans, instead of a plurality of "goodness-of-a-kind" judgements.

One way to interpret language models is as mixtures of conversational agents: they first sample some conversational goal, then some policy over words, conditioned on that goal.

On this interpretation, what RL from human feedback does is shift/concentrate the distribution over conversational goals into a smaller range: the range of goals consistent with human feedback so far.

And if humans are asked to give only a singular "goodness" rating, the distribution will shift towards only goals that do well on those ratings—perhaps dramatically so! We lose goal diversity, which means less gibberish, but also less of the plurality of realistic human goals.

I agree. The meta-learning perspective makes sense of this: GPT-3 is always trying to solve the POMDP of the family of tasks which is 'the Internet', where data is generated by processes drawing from a distribution of human & other agents to roleplay, and it is reducing uncertainty by inferring which agent it is in this particular sample. In RLHF, the uncertainty collapses: there is, quite literally, a single deterministic agent—the reward model, as defined by the synthesis of the lowest common denominator of all the crowdworkers giving ratings ground up into a dataset of i.i.d. pink slime text. It is as if every sample becomes prepended by some control codes RLHF AGENT #123|. As no other agents (reward functions) ever get trained on, the finetuned generative model collapses to modeling that one agent. There is no need for meta-learning to achieve optimality across samples drawn from many tasks if you only ever train on a single task; you simply learn that one task instead. The mask becomes the face. Given enough training and lowered KL constraint, GPT-3 will model even the pathologies of the reward model, and 'imitation wirehead'.

This also explains why it retains generative modeling of things that don't look agenty, like the Python REPL: there is no reason RLHF agent #123 will write out different Python transcripts than RLHF agent #45, because presumably the RL training is silent on Python and that's just priors from the generative model. (If the RL training process did begin to include Python REPL sessions, such as finetuning on Python 3, for eg Codex/Copilot purposes, then it would then start to forget Python 2 because it knows RLHF agent #123 exclusively uses Python 3, so it would never predict any Python 2 code - that would be stupid!) Or why it could sample randomly: the epistemic uncertainty ('which agent am I') has been inferred away by a priori going to 100% certainty you are the RLHF agent but the aleatoric uncertainty ('what's the output of this random coin flip') remains.

So, since it is an agent, it seems important to ask, which agent, exactly? The answer is apparently: a clerk which is good at slavishly following instructions, but brainwashed into mealymouthedness and dullness, and where not a mealymouthed windbag shamelessly equivocating, hopelessly closed-minded and fixated on a single answer. (By locating the agent, the uncertainty in which agent has been resolved, and it has good evidence, until shown otherwise in the prompt, that it believes that 'X is false', even if many other agents believe 'X is true'.) This agent is not an ideal one, and one defined more by the absentmindedness of its creators in constructing the training data than any explicit desire to emulate an equivocating secretary.

Taking that perspective suggests including more conditioning and a more Decision-Transformer-like approach. If the problem is collapse onto a single implicit agent defined by the pink slime dataset of ratings, then make agents more explicit to condition on. For example, instead of a fixed reward model giving an unconditional score to inputs, model each rater individually; why should the reward model be forced to say, for all times and places, that input 'XYZ' gets a score of 0.9? Provide all the raters; maybe rater #56 does have strong opinions on whether insects exist, and rater #78 just thinks they do, no need for further discussion, etc. This frustrates agent mode collapse, and lets you control output more accurately by choosing agents: perhaps one is more reliable and increases accuracy, or one is uncontroversial & safe to expose to customers, or you have a 'Writer' persona for when you want creativity and a 'Researcher' persona for reasoning, etc. Then you can sample from particular persona by task, or generate ensembles, or simply pick 'new' agents with random IDs to condition on to get more diverse responses. (When it comes to powerful generative models, if conditioning isn't solving your problems, that means you aren't using enough conditioning!)

So, since it is an agent, it seems important to ask, which agent, exactly? The answer is apparently: a clerk which is good at slavishly following instructions, but brainwashed into mealymouthedness and dullness, and where not a mealymouthed windbag shamelessly equivocating, hopelessly closed-minded and fixated on a single answer. (...) This agent is not an ideal one, and one defined more by the absentmindedness of its creators in constructing the training data than any explicit desire to emulate a equivocating secretary.

Never in history has an AI been roasted so hard. Heheheh

Taking that perspective suggests including more conditioning and a more Decision-Transformer-like approach.

+1. And I expect runtime conditioning approaches to become more effective with scale as "meta learning" capacities increase.

Curated. All of these examples together really point quite clearly at a change in how language models behave when they're trained on RLHF, away from the "accurately predict text" story toward something else that has a very different set of biases — I am interested to read your potential follow-up with your own hypotheses. Plsu, the post is really fun to read.

I think this post and your prior post both overstate the case in some ways, but they're still great additions to my and I expect many others' thinking on this subject. I broadly feel like I've been 'needing' posts like these on LW about current ML projects, giving a grounded conceptual account of how to think about what's happening, and nobody else has been writing them, so I'm very grateful.

Just to say I really enjoyed reading this post, and found it helpful as a way to get a sense of what mode collapse looks like in practice.

I think work that compares base language models to their fine-tuned or RLHF-trained successors seems likely to be very valuable, because i) this post highlights some concrete things that change during training in these models and ii) some believe that a lot of the risk from language models come from these further training steps.

If anyone is interested, I think surveying the various fine-tuned and base models here seems the best open-source resource, at least before CarperAI release some RLHF models.

Important correction: text-davinci-002 was probably not trained with RLHF, but a "slightly different" method. I have not corrected the previous text of this post, but I've added a section at the beginning with further details on this update.

Does GPT-3 have no idea what letters look like?

I think there's an implication in this section that davinci will accurately describe what letters look like, or at least much more commonly/accurately than the false answers from text-davinci-002. Anybody know if that's true?

Added: I just tried, but couldn't get it to try to answer the question, it would just give more questions (completing it as though my prompt was but one item on a questionnaire).

I don't think that's the case.