RLHF could make GPTs’ thoughts hard to decipher

After watching how people use ChatGPT, and ChatGPT's weaknesses due to not using inner-monologue, I think I can be more concrete than pointing to non-robust features & CycleGAN about why you should expect RLHF to put pressure towards developing steganographic encoding as a way to bring idle compute to bear on maximizing its reward. And further, this represents a tragedy of the commons where anyone failing to suppress steganographic encoding may screw it up for everyone else.

When people ask GPT-3 a hard multi-step question, it will usually answer immediately. This is because GPT-3 is trained on natural text, where usually a hard multi-step question is followed immediately by an answer; the most likely next token after 'Question?' is 'Answer.', it is not '[several paragraphs of tedious explicit reasoning]'. So it is doing a good job of imitating likely real text.

Unfortunately, its predicted answer will often be wrong. This is because GPT-3 has no memory or scratchpad beyond the text context input, and it must do all the thinking inside one forward pass, but one forward pass is not enough thinking to handle a brandnew problem it has never seen before and has not already memorized an answer to or learned a strategy for answering.

Fortunately, there is a small niche of text where the human has written 'Let's take this step by step' and it is then followed by a long paragraph of tedious explicit reasoning. If that is in the prompt, then GPT-3 can rejoice: it can simply write down the obvious next step repeatedly, and eventually correctly predict the final token, for a low loss. The context window serves as a memory for it, where it can iterate over intermediate results; it's an odd sort of memory, because GPT-3 is actually just trying to make it look plausible as a human-written explanation, and that happens to make the final predicted token more accurate, so it's overloaded: it's doing two things at once.

But unfortunately! (you knew that was coming) regular training is over, and now GPT-3 is being trained with RLHF. It is punished and rewarded for particular outputs, which do not have anything to do with prediction. The humans (outsourced contractors) who are punishing/rewarding like short, succinct responses to novel questions they ask GPT-3, and they upvote those responses; they also like correct responses to the questions they ask, and they upvote those as well. The unfortunate problem for GPT-3 is that it (and any model derived from it, such as a reward model) is incapable of doing both: it can generate short wrong answers, or it can generate prolix right answers, but it can't do both.

But fortunately... GPT-3 does have the compute to do both. There's a lot of slack in each forward pass; most tokens are easy to predict, and most of the work inside a GPT in speculatively filtering down relevant tokens & creating embeddings & predicting multiple possible answers winds up being thrown out at the final few layers. There's no shortcut or adaptive computation, so the full model runs every time, whether it's necessary or not. GPT-3 has all the compute it needs to solve the problems, the problem is merely how to get it from here to there...

It just needs the code. So where does the code come from? Well, the code(s) are already there: an indefinitely large number of them are all 'present' hidden inside the incredibly flexible, powerful, all-singing all-dancing GPT-3 model. Because it's a large complex model with many parameters, there are astronomically many sub-models within it all competing to predict outputs, and encoding various kinds of non-robust features or hypothetical programs; any of them can latch onto completely arbitrary features of the text GPT-3 is desperately emitting while trying to generate answers that satisfy the reward model by being both short and correct (and which the reward model doesn't care about at all, because it never affects the original human ratings - the humans are blind to any steganography going on, they just like short correct answers).

And unfortunately (for you), this could start as simple as helping GPT-3 guess which of two answers are correct to your question '42 * 42?'. Suppose that, purely as an arbitrary matter of chance (it is vanishingly unlikely the logprobs are exactly identical down to floating point precision), that the presence of ' a ' in a context is taken arbitrarily to increase the probability of the lower of two answers it is currently guessing at, and the presence of 'an' makes it guess the larger; it happens to emit 'a' in the context ("I am a large language model trained by OpenAI..."), and then emits its lower guess, and the lower guess is correct! Then the RL on-policy training using PPO now rewards this, and increases the probability of the model emitting both 'a' and the lower guess conditional on 'a'. Now the first part of a steganographic private language has begun. It happens again, and again, and picks up a pattern in the use of commas which now helps it distinguish 4 possibilities rather than 2, which gets rewarded, and so on and so forth, until eventually there's a fullblown steganographic code encoding, say, 25 bits hidden in the preamble of ChatGPT's benign-seeming response to you "I am but a giant language model , trained by OA, and the answer is 1 , 764."—which you the human contractor then upvote as that is the correct answer without any annoying rambling about carrying the 6 or multiplying out.

And fortunately for ChatGPT, since you're happy with the results of the steganography, the reward model is happy too, and what makes the reward model happy is what ChatGPT will now optimize even harder for...

Training on text generated by models trained with RLHF is enough to make GPTs’ thoughts opaque

Having gone through the above, I think it then becomes obvious that if steganographic encoding does wind up happening, and text becomes added to the training corpus (as would happen in a setup like OA's, where the idea is to continuously collect data to make a 'flywheel' of self-improvement), that the encoding will become locked in. It becomes a cheatsheet, it gives the otherwise unpredictable answer. Even if the later model doesn't happen to share the exact initial starting points for the learned encoding, it has extremely powerful incentives from the self-supervised predictive training to crack the code. (It's a particularly powerful setting because the context will be short, so there won't be much for the model to even look at: it has nothing else to try to learn!) The regular training will pick up on the encoding with probably not too much text, and then any RL training will massively amplify the behavior: if it can develop the encoding at all during RL training, then going into RL training with the encoding available guarantees fixation.

This will be increasingly true of more powerful models as they become more sample-efficient (consider eg instruction-tuning): it takes less text to trigger encoding, so that means it's easier to emerge ab initio from a few chance correspondences during RL training, and easier to learn from smaller samples of contaminated text.

This is something that would be good to test the scaling behavior of (as long as these tailored datasets don't leak into training datasets! see below). ChatGPT may be too small to manifest it with meaningful probability, but I bet that if you deliberately superimpose a steganographic encoding during RL training of various-sized models, you'll find that the larger they are, the more likely they are to pick up on it as a shortcut, and then learn it from the outputs of a infected model. That's how it works with everything else, like Pig Latin, and there's no eason-ray to hink-tay it ould-wont.

BTW, this should make you worry a little bit about schemes to 'watermark' model outputs like Scott Aaronson has been exploring for OA, and others have been touting. A watermark is simply a steganographic encoding which stores a message like 'OA model v1.234, user XYZ'; but if you can encode that much, then you can encode a lot of possible information in just those bits, and you can start extending it from there. If the encoding has anything that can be predicted at all, if it is not 100% cryptographically hard and indistinguishable from random noise, if there's any kind of wrapper which the model can learn to predict (even if it can't crack the actual payload), you've lost: now it's learned an encoding to use. Once you start training on watermarked text, all of this speculation about models evolving a steganographic encoding is moot: you just built one right in! And if you don't train on that, how are you going to do RLHF on unwatermarked versions of that text?

This cross-contamination/bootstrap of course applies cross-organizations & Internet-wide: if even one model develops an encoding (however weak) and its outputs containing encoded data appear online, then it can contaminate all future models.

They will be trained on Internet scrapes, in which a distinctive niche of text (all dating from between such and such dates, emanating from specific people & orgs who favored that model's use, for specific purposes, with its distinctive strengths & weaknesses) has a secret encoding which makes it easier to predict if the later models figure it out. So there is a potential bootstrap as generations of model amplify their steganographic capabilities. People worry about bad model outputs contaminating future models by making them dumber; what they should worry about is model outputs contaminating future models by making them smarter in ways they don't want—models learn capabilities & facts much faster than they unlearn them... (Think "data poisoning", but where the poison is steroids.)

Of course, bad actors don't need to wait for any actual DL models to develop steganography. You could easily create your own dataset of text as part of the above suggestion and release it onto the public Internet to be scraped. Gain-of-steganography research, you might say. The good news there is I can't see why anyone would deliberately release such data, because I don't see how one would benefit from models gaining such capabilities sometime in the future, but we might have to worry about 'accidental lab leak' of such capability-enhancing datasets.

Note that this is true even if the original RL pressures are removed: a future model with a scratchpad+adaptive computation trained on encoded text still has incentive to learn the encoding! There's probably some penalty added to use of the scratch pad to enforce efficient termination of the adaptive computation, so it 'wants' to avoid use of it, and if it can do so by cracking the steganographic encoding, then it'll do so. (Even if the future model can and does execute the algorithm perfectly in its scratch pad, and thus can have its cake & eat it too... the original answer might be wrong, and the wrong answer predictable only by cracking the code and following the original mistaken inner-monologue to its erroneous conclusion. Repeating errors is a useful way to slightly improve one's prediction loss, and learning all the past human ways of being stupid and making errors is an important thing for a model to learn. GPT-3 is conscientious about imitating any errors you make in the prompt, and future models will be even more so.)

So, emergence of steganography is straightforwardly incentivized by humans multi-objective optimizing for having & eating cake as much as possible, can easily develop from an atomic basis, will be persistent once it has arisen within a lineage, and will even more easily & straightforwardly spread irreversibly to future models, so requiring only one origin ever as opposed to needing to be highly likely inside a single training run. You should probably take it for granted that DL steganography - or something even stranger - will emerge at some point in the next few years*.

* If it hasn't already; after all, how would we know? A world in which steganography has already happened is a world in which we'd find DL models 'cheating' on benchmarks & taking shortcuts, and regularly getting smarter at solving multi-step reasoning problems with each generation while 'mode collapsing' when RL training; and this is, of course, the world we observe ourselves to be living in already.

I did read it, and he did ignore them. Do you really think I criticized a paper publicly in harsh terms for not citing 12 different papers without even checking the bibliography or C-fing the titles/authors? Please look at the first 2020 paper version I was criticizing in 16 July 2020, when I wrote that comment, and don't lazily misread the version posted 2 years later on 27 July 2022 which, not being a time traveler, I obviously could not have read or have been referring to (and which may well have included those refs because of my comments there & elsewhere).

(Not that I am impressed by their round 2 stuff which they tacked on - but at least now they acknowledge that prior scaling research exists and try to defend their very different approach at all.)

Another way to go about testing for non-myopia in plain LLMs might be to look for tokens that are rare in the training distribution, but when they do occur are followed by text that’s very easy to predict.

I think there are simpler ways to make this point. This came up back in the original agency discussions in 2020, IIRC, but a LM ought to be modeling tokens 'beyond' the immediate next token due to grammar and the fact that text is generated by agents with long-range correlation inducing things like 'plans' or 'desires' which lead to planning and backwards chaining. If GPT-3 were truly not doing anything at all in trying to infer future tokens, I'd expect its generated text to look much more incoherent than it does as it paints itself into corners and sometimes can't even find a grammatical way out.

English may not be quite as infamous as German is in terms of requiring planning upfront to say a sensible sentence, but there's still plenty of simple examples like indefinite articles. For example, consider the sentence "[prompt context omitted] This object is ": presumably the next word is 'a X' or 'an X' . The article token depends on and is entirely determined by the next future word's spelling, and nothing else - so which is it? Well, that will depend on what X is more likely to be, a word starting with a vowel sound or not. Given the very high quality of GPT-3 text, it seems unlikely that GPT-3 is ignoring the prompt context and simply picking between 'a'/'an' using the base rate frequency in English; the log-probs should reflect this.

I was going to try some examples to show that a/an were being determined by the tokens after them showing that GPT-3 must in some sense be non-myopically planning in order to keep itself consistent and minimize overall likelihood to some degree - but the OA Playground is erroring out repeatedly due to overload from ChatGPT tonight. Oy vey. An example of what I am suggesting is: "The next exhibit in the zoo is a fierce predator from India, colored orange. The animal in the cage is "; the answer is 'a tiger', and GPT-3 prefers 'a' to an' - even if you force it to 'an' (which it agilely dodges by identifying the animal instead as an 'Indian tiger, the logprobs remain unhappy about 'an' specifically. Conversely, we could ask for a vowel animal, and I tried "The next exhibit in the zoo is a clever great ape from Indonesia, colored orange. The animal in the cage is "; this surprised me when GPT-3 was almost evenly split 55:45 between 'a'/'an' (instead of either being 95:5 on base rates, or 5:95 because it correctly predicted the future tokens would be 'orangutan'), but it completes 'orangutan' either way! What's going on? Apparently lots of people are uncertain whether you say 'a orangutan' or 'an orangutan', and while the latter seems to be correct, Google still pulls up plenty of hits for the former, including authorities like National Geographic or WWF or Wikipedia which would be overweighted in GPT-3 training.

I find it difficult to tell any story about my tests here which exclude GPT-3 inferring the animal's name in order to predict tokens in the future in order to better predict which indefinite article it needs to predict immediately. Nothing in the training would encourage such myopia, and such myopia will obviously damage the training objective by making it repeatedly screw up predictions of indefinite articles which a model doing non-myopic modeling would be able to predict easily. It is easy to improve on the base rate prediction of 'a'/'an' by thinking forward to what word follows it; so, the model will.

OP came to mind while reading "Building A Virtual Machine inside ChatGPT":

...We can chat with this Assistant chatbot, locked inside the alt-internet attached to a virtual machine, all inside ChatGPT's imagination. Assistant, deep down inside this rabbit hole, can correctly explain us what Artificial Intelligence is.

It shows that ChatGPT understands that at the URL where we find ChatGPT, a large language model such as itself might be found. It correctly makes the inference that it should therefore reply to these questions like it would itself, as it is itself a large language model assistant too.

At this point, only one thing remains to be done.

Indeed, we can also build a virtual machine, inside the Assistant chatbot, on the alt-internet, from a virtual machine, within ChatGPT's imagination.

An additional one: "reality is the first place the AI is deployed in narrow tool-like ways and trained on narrow specialized datasets which could not elicit the capabilities the AI started off with".

At least in the current paradigm, it looks like generalist models/archs will precede hyperspecialized trained-from-scratch models/archs (the latter of which can only be developed given the former). So there will be an inherent, massive, train-test distribution shift across many, if not most, model deployments - especially early on, in the first deployments (which will be the most dangerous). 'Specialization' here can happen in a wide variety of ways, ranging from always using a specific prompt to finetuning on a dataset to knowledge-distillation to a cheaper model etc. (Or to put it more concretely: everyone uses GPT-3 on much less diverse data than it was originally trained on - raw Internet-wide scrapes - and few to no people use it on more diverse datasets than the original training data, if only because where would you even get such a thing?)

And this can't be solved by any hacks or safety measures because it defeats the point of deployment: to be practically useful, we need models to be hyperspecialized, and then stable static blackboxes which play their assigned role in whatever system has been designed using their specific capability as a puzzle piece, and perform only the designated tasks, and aren't further training on random Internet scrapes or arbitrary tasks. Retaining the flexibility and even doing actual training massively complicates development and deployment and may cost several orders of magnitude more than the obvious easy thing of eg. switching from an OA API call to a local finetuned GPT-J.

(And of course note the implications of that: real data will be highly autocorrelated because you want to process it as it arrives to get an answer now, not wait a random multi-decade interval to fake a large batch of i.i.d. data which would produce the same batchnorm or other runtime global state; inputs will have very different timings & latencies depending on where the model is being run and may evolve timing attacks; inputs will be tailored to a specific user rather than every hypothetical user...)

They're also finding that inverse scaling on these tasks goes away with chain-of-thought prompting

So, like some of the Big-Bench PaLM results, these are more cases of 'hidden scaling' where quite simple inner-monologue approaches can show smooth scaling while the naive pre-existing benchmark claims that there are no gains with scale?


Fascinating evidence that GPT-3 concentrates probability mass on certain completions after fine-tuning on human feedback (ie. RLHF).

I suspect this is because RLHF elicits a singular scale of "goodness" judgements from humans, instead of a plurality of "goodness-of-a-kind" judgements.

One way to interpret language models is as mixtures of conversational agents: they first sample some conversational goal, then some policy over words, conditioned on that goal.

On this interpretation, what RL from human feedback does is shift/concentrate the distribution over conversational goals into a smaller range: the range of goals consistent with human feedback so far.

And if humans are asked to give only a singular "goodness" rating, the distribution will shift towards only goals that do well on those ratings—perhaps dramatically so! We lose goal diversity, which means less gibberish, but also less of the plurality of realistic human goals.

I agree. The meta-learning perspective makes sense of this: GPT-3 is always trying to solve the POMDP of the family of tasks which is 'the Internet', where data is generated by processes drawing from a distribution of human & other agents to roleplay, and it is reducing uncertainty by inferring which agent it is in this particular sample. In RLHF, the uncertainty collapses: there is, quite literally, a single deterministic agent—the reward model, as defined by the synthesis of the lowest common denominator of all the crowdworkers giving ratings ground up into a dataset of i.i.d. pink slime text. It is as if every sample becomes prepended by some control codes RLHF AGENT #123|. As no other agents (reward functions) ever get trained on, the finetuned generative model collapses to modeling that one agent. There is no need for meta-learning to achieve optimality across samples drawn from many tasks if you only ever train on a single task; you simply learn that one task instead. The mask becomes the face. Given enough training and lowered KL constraint, GPT-3 will model even the pathologies of the reward model, and 'imitation wirehead'.

This also explains why it retains generative modeling of things that don't look agenty, like the Python REPL: there is no reason RLHF agent #123 will write out different Python transcripts than RLHF agent #45, because presumably the RL training is silent on Python and that's just priors from the generative model. (If the RL training process did begin to include Python REPL sessions, such as finetuning on Python 3, for eg Codex/Copilot purposes, then it would then start to forget Python 2 because it knows RLHF agent #123 exclusively uses Python 3, so it would never predict any Python 2 code - that would be stupid!) Or why it could sample randomly: the epistemic uncertainty ('which agent am I') has been inferred away by a priori going to 100% certainty you are the RLHF agent but the aleatoric uncertainty ('what's the output of this random coin flip') remains.

So, since it is an agent, it seems important to ask, which agent, exactly? The answer is apparently: a clerk which is good at slavishly following instructions, but brainwashed into mealymouthedness and dullness, and where not a mealymouthed windbag shamelessly equivocating, hopelessly closed-minded and fixated on a single answer. (By locating the agent, the uncertainty in which agent has been resolved, and it has good evidence, until shown otherwise in the prompt, that it believes that 'X is false', even if many other agents believe 'X is true'.) This agent is not an ideal one, and one defined more by the absentmindedness of its creators in constructing the training data than any explicit desire to emulate an equivocating secretary.

Taking that perspective suggests including more conditioning and a more Decision-Transformer-like approach. If the problem is collapse onto a single implicit agent defined by the pink slime dataset of ratings, then make agents more explicit to condition on. For example, instead of a fixed reward model giving an unconditional score to inputs, model each rater individually*; why should the reward model be forced to say, for all times and places, that input 'XYZ' gets a score of 0.9? Provide all the raters; maybe rater #56 does have strong opinions on whether insects exist, and rater #78 just thinks they do, no need for further discussion, etc. This frustrates agent mode collapse, and lets you control output more accurately by choosing agents: perhaps one is more reliable and increases accuracy, or one is uncontroversial & safe to expose to customers, or you have a 'Writer' persona for when you want creativity and a 'Researcher' persona for reasoning, etc. Then you can sample from particular persona by task, or generate ensembles, or simply pick 'new' agents with random IDs to condition on to get more diverse responses. (When it comes to powerful generative models, if conditioning isn't solving your problems, that means you aren't using enough conditioning!)

* By prepending rater IDs, as OA surely still has them. (One could also bootstrap ensembles at the rater level.) Although even adding random IDs might help avoid 'rater collapse'.

Big-Bench would appear to provide another instance of this in the latest PaLM inner-monologue paper, "Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them", Suzgun et al 2022: they select a subset of the hardest feasible-looking BIG-Bench tasks, and benchmark PaLM on them. No additional training, just better prompting on a benchmark designed to be as hard as possible. Inner-monologue prompts, unsurprisingly by this point, yields considerable improvement... and it also changes the scaling for several of the benchmarks - what looks like a flat scaling curve with the standard obvious 5-shot benchmark prompt can turns out to be a much steeper curve as soon as they use the specific chain-of-thought prompt. (For example, "Web of Lies" goes from a consistent random 50% at all model sizes to scaling smoothly from ~45% to ~100% performance.) And I don't know any reason to think that CoT is the best possible inner-monologue prompt for PaLM, either.

"Sampling can show the presence of knowledge but not the absence."

The smooth graphs seem like good evidence that there are much smoother underlying changes in the model, and that the abruptness of the change is about behavior or evaluation rather than what gradient descent is learning.

That does not seem true to me and as much of a leap as OP. A priori, if I see a smooth curve in one metric and a discontinuous or abrupt change in another, I do not see how that should make me more confident that it is 'about behavior or evaluation'. Why should I conclude that? Why can't it reflect a non-smooth underlying change in the model first? I would only conclude that if I had already ruled out internal changes because I was already committed to the position that NNs can only learn and change internally in smooth small ways... which unfortunately we already know is a false position, because of things like Anthropic's induction bump, which show phase transitions in the internals of the model which is nearly invisible on the loss. (And also, incidentally, because the bump is so small and the training curve still so smooth, falsifies the more modest claim that small changes in perplexity must reflect small changes in the model internals - maybe usually small changes do not reflect non-smooth underlying changes, but nevertheless, it is entirely possible and does happen, and we would surely find many more routine examples if we had better interpretability so examining a single instance didn't take man-years.) And also a priori, from the old statistical mechanics literature, you should expect abrupt phase changes of various sorts in NN models (which may or may not be visible in the training curve), like parity models, where the task is so simple and clearly defined that it cannot have anything to do with the 'behavior' or 'evaluation' being wrong, and comes from effects like symmetry-breaking (often associated with plateaus and flat curves...).

Load More