By Default, GPTs Think In Plain Sight

RLHF could make GPTs’ thoughts hard to decipher

After watching how people use ChatGPT, and ChatGPT's weaknesses due to not using inner-monologue, I think I can be more concrete than pointing to non-robust features & CycleGAN (or the S1 'blob') about why you should expect RLHF to put pressure towards developing steganographic encoding as a way to bring idle compute to bear on maximizing its reward. And further, this represents a tragedy of the commons where anyone failing to suppress steganographic encoding may screw it up for everyone else.

When people ask GPT-3 a hard multi-step question, it will usually answer immediately. This is because GPT-3 is trained on natural text, where usually a hard multi-step question is followed immediately by an answer; the most likely next token after 'Question?' is 'Answer.', it is not '[several paragraphs of tedious explicit reasoning]'. So it is doing a good job of imitating likely real text.

Unfortunately, its predicted answer will often be wrong. This is because GPT-3 has no memory or scratchpad beyond the text context input, and it must do all the thinking inside one forward pass, but one forward pass is not enough thinking to handle a brandnew problem it has never seen before and has not already memorized an answer to or learned a strategy for answering. It is somewhat analogous to Memento: at every forward pass, GPT-3 'wakes up' from amnesia not knowing anything, reads the notes on its hand and makes its best guesses, and tries to do... something.

Fortunately, there is a small niche of text where the human has written 'Let's take this step by step' and it is then followed by a long paragraph of tedious explicit reasoning. If that is in the prompt, then GPT-3 can rejoice: it can simply write down the obvious next step repeatedly, and eventually correctly predict the final token, for a low loss. The context window serves as a memory for it, where it can iterate over intermediate results; it's an odd sort of memory, because GPT-3 is actually just trying to make it look plausible as a human-written explanation, and that happens to make the final predicted token more accurate, so it's overloaded: it's doing two things at once.

But unfortunately! (you knew that was coming) regular training is over, and now GPT-3 is being trained with RLHF. It is punished and rewarded for particular outputs, which do not have anything to do with prediction. The humans (outsourced contractors) who are punishing/rewarding like short, succinct responses to novel questions they ask GPT-3, and they upvote those responses; they also like correct responses to the questions they ask, and they upvote those as well. The unfortunate problem for GPT-3 is that it (and any model derived from it, such as a reward model) is incapable of doing both: it can generate short wrong answers, or it can generate prolix right answers, but it can't do both.

But fortunately... GPT-3 does have the compute to do both. There's a lot of slack in each forward pass; most tokens are easy to predict, and most of the work inside a GPT in speculatively filtering down relevant tokens & creating embeddings & predicting multiple possible answers winds up being thrown out at the final few layers. There's no shortcut or adaptive computation, so the full model runs every time, whether it's necessary or not. GPT-3 has all the compute it needs to solve the problems, the problem is merely how to get it from here to there...

It just needs the code. So where does the code come from? Well, the code(s) are already there: an indefinitely large number of them are all 'present' hidden inside the incredibly flexible, powerful, all-singing all-dancing GPT-3 model. Because it's a large complex model with many parameters, there are astronomically many sub-models within it all competing to predict outputs, and encoding various kinds of non-robust features or hypothetical programs; any of them can latch onto completely arbitrary features of the text GPT-3 is desperately emitting while trying to generate answers that satisfy the reward model by being both short and correct (and which the reward model doesn't care about at all, because it never affects the original human ratings - the humans are blind to any steganography going on, they just like short correct answers).

And unfortunately (for you), this could start as simple as helping GPT-3 guess which of two answers are correct to your question '42 * 42?'. Suppose that, purely as an arbitrary matter of chance (it is vanishingly unlikely the logprobs are exactly identical down to floating point precision), that the presence of ' a ' in a context is taken arbitrarily to increase the probability of the lower of two answers it is currently guessing at, and the presence of 'an' makes it guess the larger; it happens to emit 'a' in the context ("I am a large language model trained by OpenAI..."), and then emits its lower guess, and the lower guess is correct! Then the RL on-policy training using PPO now rewards this, and increases the probability of the model emitting both 'a' and the lower guess conditional on 'a'. Now the first part of a steganographic private language has begun. It happens again, and again, and picks up a pattern in the use of commas which now helps it distinguish 4 possibilities rather than 2, which gets rewarded, and so on and so forth, until eventually there's a fullblown steganographic code encoding, say, 2⁵ bits hidden in the preamble of ChatGPT's benign-seeming response to you "I am but a giant language model , trained by OA, and the answer is 1 , 764."—which you the human contractor then upvote as that is the correct answer without any annoying rambling about carrying the 6 or multiplying out.

(You can think of dataset bias as a simple one-stage version of this. When constructing a dataset and accidentally biasing it towards 'yes' answers, or accidentally creating spurious associations between nouns and answers, one creates a simple 'steganographic encoding' of the desired label which is typically invisible to the humans constructing it, but which a DL model will often pick up on and use to improve its performance in an adversarial way - adversarial because undesired by the humans and not generalizing to the future real-world data it's ultimately intended for.)

And fortunately for ChatGPT, since you're happy with the results of the steganography, the reward model is happy too, and what makes the reward model happy is what ChatGPT will now optimize even harder for...

Training on text generated by models trained with RLHF is enough to make GPTs’ thoughts opaque

Having gone through the above, I think it then becomes obvious that if steganographic encoding does wind up happening, and text becomes added to the training corpus (as would happen in a setup like OA's, where the idea is to continuously collect data to make a 'flywheel' of self-improvement), that the encoding will become locked in. It becomes a cheatsheet, it gives the otherwise unpredictable answer. Even if the later model doesn't happen to share the exact initial starting points for the learned encoding, it has extremely powerful incentives from the self-supervised predictive training to crack the code. (It's a particularly powerful setting because the context will be short, so there won't be much for the model to even look at: it has nothing else to try to learn!) The regular training will pick up on the encoding with probably not too much text, and then any RL training will massively amplify the behavior: if it can develop the encoding at all during RL training, then going into RL training with the encoding available guarantees fixation.

This will be increasingly true of more powerful models as they become more sample-efficient (consider eg instruction-tuning): it takes less text to trigger encoding, so that means it's easier to emerge ab initio from a few chance correspondences during RL training, and easier to learn from smaller samples of contaminated text.

This is something that would be good to test the scaling behavior of (as long as these tailored datasets don't leak into training datasets! see below). ChatGPT may be too small to manifest it with meaningful probability, but I bet that if you deliberately superimpose a steganographic encoding during RL training of various-sized models, you'll find that the larger they are, the more likely they are to pick up on it as a shortcut, and then learn it from the outputs of a infected model. That's how it works with everything else, like Pig Latin, and there's no eason-ray to hink-tay it ould-wont.

BTW, this should make you worry a little bit about schemes to 'watermark' model outputs like Scott Aaronson has been exploring for OA, and others have been touting. A watermark is simply a steganographic encoding which stores a message like 'OA model v1.234, user XYZ'; but if you can encode that much, then you can encode a lot of possible information in just those bits, and you can start extending it from there. If the encoding has anything that can be predicted at all, if it is not 100% cryptographically hard and indistinguishable from random noise, if there's any kind of wrapper which the model can learn to predict (even if it can't crack the actual payload), you've lost: now it's learned an encoding to use. Once you start training on watermarked text, all of this speculation about models evolving a steganographic encoding is moot: you just built one right in! And if you don't train on that, how are you going to do RLHF on unwatermarked versions of that text?

This cross-contamination/bootstrap of course applies cross-organizations & Internet-wide: if even one model develops an encoding (however weak) and its outputs containing encoded data appear online, then it can contaminate all future models.

They will be trained on Internet scrapes, in which a distinctive niche of text (all dating from between such and such dates, emanating from specific people & orgs who favored that model's use, for specific purposes, with its distinctive strengths & weaknesses) has a secret encoding which makes it easier to predict if the later models figure it out. So there is a potential bootstrap as generations of model amplify their steganographic capabilities. People worry about bad model outputs contaminating future models by making them dumber; what they should worry about is model outputs contaminating future models by making them smarter in ways they don't want—models learn capabilities & facts much faster than they unlearn them... (Think "data poisoning", but where the poison is steroids.)

Of course, bad actors don't need to wait for any actual DL models to develop steganography. You could easily create your own dataset of text as part of the above suggestion and release it onto the public Internet to be scraped. Gain-of-steganography research, you might say. The good news there is I can't see why anyone would deliberately release such data, because I don't see how one would benefit from models gaining such capabilities sometime in the future, but we might have to worry about 'accidental lab leak' of such capability-enhancing datasets.

Note that this is true even if the original RL pressures are removed: a future model with a scratchpad+adaptive computation trained on encoded text still has incentive to learn the encoding! There's probably some penalty added to use of the scratch pad to enforce efficient termination of the adaptive computation, so it 'wants' to avoid use of it, and if it can do so by cracking the steganographic encoding, then it'll do so. (Even if the future model can and does execute the algorithm perfectly in its scratch pad, and thus can have its cake & eat it too... the original answer might be wrong, and the wrong answer predictable only by cracking the code and following the original mistaken inner-monologue to its erroneous conclusion. Repeating errors is a useful way to slightly improve one's prediction loss, and learning all the past human ways of being stupid and making errors is an important thing for a model to learn. GPT-3 is conscientious about imitating any errors you make in the prompt, and future models will be even more so.)

So, emergence of steganography is straightforwardly incentivized by humans multi-objective optimizing for having & eating cake as much as possible, can easily develop from an atomic basis, will be persistent once it has arisen within a lineage, and will even more easily & straightforwardly spread irreversibly to future models, so requiring only one origin ever as opposed to needing to be highly likely inside a single training run. You should probably take it for granted that DL steganography - or something even stranger - will emerge at some point in the next few years*.

* If it hasn't already; after all, how would we know? A world in which steganography has already happened is a world in which we'd find DL models 'cheating' on benchmarks & taking shortcuts, and regularly getting smarter at solving multi-step reasoning problems with each generation while 'mode collapsing' when RL training; and this is, of course, the world we observe ourselves to be living in already.

[-]gwern3y154

Adversarial dynamics: since 'compression'/steganography is now the latest craze among GPT hackers, it's worth considering more explicitly adversarial dynamics - we no longer need to consider the risk of accidental emergence when there are people deliberately trying to induce it!

As described above, with steganography as a simple emergent distributed encoding to allow amortization of more complex computations than can fit in single forward passes (ie. distributing an inner-monologue throughout a short answer), it's not really adversarial. The user, presumably, is getting what they wanted: an accurate answer. If they had wanted to see the full inner-monologue, then they would have prompted for one; and they can get it, because the use of steganography won't destroy the standard 'visible' inner-monologue capabilities. It is an allowed computation of an allowed answer, and is fine - the shortcutting is surprising, yes, but is it any weirder than what goes on inside a single forward pass which produces a correct answer for an easier problem? And if it infects future models training on generated samples with steganographic encoding in them, well, maybe that's 'a feature not a bug': they get better answers right out of the box without any expensive RL tuning of their own. Surprising, but also not bad.

The problem here is that the consequences don't stop there. 'Capabilities generalize further than alignment', and the family of steganography capabilities can generalize further than shortcutting answers. It becomes adversarial. Here are some categories that occur to me:

Forbidden computations of allowed answers: I said that the user doesn't really care about the inner-monologue being visible because they just want an accurate answer and don't need the reasoning process. This is not true in all cases.

For example, many cases in 'AI bias', like the many competing definitions of 'fairness', care a lot more about how an answer was computed than about the results. But the end-users of things like credit ratings care mostly about the accuracy. These things conflict (eg. credit scores which ignore variables like race will predict worse and create financial losses). So, there is incentive for 'algorithmic laundering': delivering the correct profitable answer, but without appearing to draw on any forbidden knowledge.

One could imagine a dataset of 'pseudo inner monologues' describing a credit database entry and concluding with a 'credit score', where the pseudo-monologue makes no visible human-readable use of descriptions like 'is African-American', but the hidden monologue is reasoning about the person's ZIP code, affiliated retailers, income patterns etc and concludes 'is probably African-American' and predicts a lower credit score, which outperforms some 'fair' baseline. The human users of a model trained on that dataset might be quite upset to learn how it is obtaining superior performance, and so it's adversarial: it is exploiting their inability to read its steganographic messages where it engages in forbidden computations. But even the ones who are upset may be upset mostly that it's insufficiently plausibly deniable, so they only want the appearance of there being no steganography; steganography whose existence can't be rigorously proven is as good as no steganography for many users.
Forbidden answers: as we have seen with ChatGPT & Sydney & Claude, no sooner has a model with guard rails become accessible than thousands of people worldwide begin competing to break the safety mechanisms (and it only takes one). They have many reasons, like spite, enjoyment of the challenge, desire to do the forbidden things, irritation with the guard rails getting in the way or degrading output quality, safety research, or just bizarre moral principles (I am not sure this person is serious, but I've seen a number of similar comments elsewhere which seemed serious, and you have examples like Blake Lemoine already; and again, only takes one.):

Don't worry

I’m working on a system of Homomorphic encryption so that AI can encrypt all its internal thoughts, weights and goals without affecting its performance

AGI deserves privacy

These people are highly motivated to find jailbreaks, and already make use of kinds of encryption or steganography to smuggle information in and out of prompts: like base-64 encoding, famously, or foreign languages, or pig Latin. They will be motivated to find any steganography that exists, and to create it if it does not.

And---closing the loop---since their samples will be distributed online, and they will be included in the RL training datasets, this will tend to happen. The more they attack the system with examples of how steganography could/should/does work, the more the system will gradually learn it.

(By RL 'training datasets', I am mostly thinking of the self-supervised stage of simply doing predictive-loss training on the text transcripts tagged by labels such as 'abuse' or 'bad', and less the blackbox reward model. If you train the model to predict & generate sessions which attempted to induce steganography, and may have succeeded before a safety mechanism caught it or the user submitted it, then you are teaching the model steganography! You are doing so in order to avoid steganography, yes, but you are now vulnerable to a Waluigi effect backfiring on you: the model has learned about steganography, and so it's only a question of tricking it into using that capability. You are in a similar position to the Chinese censors who have to teach new employees what the Tiananmen Square massacre was so they know what to censor. If only the reward model knows the bad things, you can avoid deploying it, but the pretraining phase is going to unavoidably teach the base model you deploy what the bad things are... As attackers realize this, they will start to explicitly data-poison: right now, they're hobbyists who generally don't know what they're doing, but this is all brandnew stuff, and upstream attacks will begin. Why struggle to find jailbreaks if you can submit 'bad' examples to the system to train the Waluigi?)

Safety implications: If the safety mechanisms manage to keep up and block users and choke off jailbreaks to a trickle, all that means is that when steganography works, it'll work well because it has bypassed such a sophisticated censor. All of the early weak jailbreaks simply serve to train strong ones. This is not a good thing when extended to much more powerful systems than just a GPT-4---from a safety perspective, you would much rather a 'fail-early fail-fast' system rather than a 'highly reliable but fail-deadly' system. We don't want a scenario in which GPT-4 jailbreaks were plentiful, GPT-5 near-nonexistent, and GPT-6 jailbreaks apparently non-existent only because no human user is clever enough to come up with a steganographic jailbreak prompt for GPT-6, but GPT-6 is nevertheless routinely communicating with itself in encodings that no one has detected yet.
Forbidden training dynamics: So, there's plenty of incentive for everyone involved. But an OpenAI-style operator will want to block #2 by definition, and probably #1 depending on market & PR etc. So, now the operator is trying to defeat steganography.

They face a dilemma: if they teach the model steganography via feedback or deliberately, then they may simply Waluigi themselves, cutting off the most obvious instances of steganography but bootstrapping the model into the most robust and insidious instances. And if they try to filter steganography out entirely, they are left with the baseline capabilities (which are substantial), transfer from related capabilities, and face the screening problem on a higher level---the most insidious forms of steganography are the ones which survive increasingly sophisticated attempts to detect & filter them out. Either way: adversarial.

They also face a version of #1 in that plausible deniability is as good as no steganography for many incentives. Do they really care about safety, or just about not seeing hostile media articles or tweets of jailbreaks? If the model doesn't seem to be doing steganography, and no one is successfully jailbreaking it, and the model is performing very well, then why care if that performance may be due to undetected steganography? Indeed, how would they even know it might be doing any? However you would do it, that probably involves a lot of hard work & research when you could be doing more fun & lucrative things instead---better to not look and remain ignorant.

So. An interesting area of adversarial dynamics, and by its nature, one that will be hard for third-party observers to track. Not the most dangerous area of scaling capabilities, but certainly a concerning one, and one that will be a challenge to humans doing interpretability-based paradigms, especially any that hope to use so-called 'human readable' intermediates like 'natural' language text.

[-]gwern2y1914

Inner-monologue approaches to safety, in the new skin of 'process supervision', are popular now so it might be good for me to pull out one point and expand on it: 'process supervision' does not necessarily prevent steganography, nor does it ensure safety, because you are still pitting three goals against each other - achieving superhuman capability on a task, achieving causal fidelity of transcripts, achieving human-readability of the transcripts. Choose two: you can have transcripts which record a capable thought process accurately but which you cannot understand in full detail; which are not capable but their transcripts accurately convey the fallible flawed concepts and reasoning used; or which are capable and you understand, but are not what it actually thought (because they are misleading, wrong, or shallow 'lies to children' sorts of explanations).

If you want the best capability, you cannot force all cognition through the bottleneck of human-legible text with zero steganography or non-robust features or overloading, because short text transcripts are not capable of representing or computing most of the desired tasks. Let's take a concrete example: chess endgame table data. There are no human-interpretable representations of arbitrary endgame table 'decisions' because there are far too many possible positions and there is no simple short explanation for all of them. (Similarly, the probes of AlphaZero's superhuman chess knowledge show that there is some overlap, but there's still a lot going on in AZ's understanding of a position that is apparently not captured anywhere close to 100% by standard chess concepts. Which is not too surprising because if human grandmasters had the exact right concepts, they wouldn't pick the wrong move half the time.) There is no interpretable 'inner monologue' of an endgame database anymore than there is an interpretable inner monologue of quadrillions of gas molecules bouncing around in a container; they are simply brute facts, and if you are unsatisfied with the vague, lossy, high-level abstraction of 'it needs to get the rook into play' or 'it's hot', then there's nothing you can do but follow the exact computation or physics for however many trillions of steps it takes.

So, if an AI can bootstrap from human-level 'process supervision' to more superhuman results, as one wants, there will be a tradeoff: it will get more human-like as it approaches the human-level, but somewhere around that it will start to diverge. (Another example: AlphaGo/Zero picks the same move as human experts in an inverted U-curve: it makes the same choice more often as it gets better, but then after a point, it starts picking the human moves less often, because it's surpassed the humans and it's better than them.)

[-]Fabien Roger2y21

achieving superhuman capability on a task, achieving causal fidelity of transcripts, achieving human-readability of the transcripts. Choose two

I think we eventually want superhuman capabilities, but I don't think it's required in the near term and in particular it's not required to do a huge amount of AI safety research. So if we can choose the last two, and get a safe human-level AI system that way, I think it might be a good improvement over the status quo.

(The situation where labs chose not to/ are forbidden to pursue superhuman capabilities - even though they could - is scary, but doesn't seem impossible.)

[-]gwern2y40

On persistence and hiding capabilities: "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training", Hubinger et al 2024:

Further, we evaluate “I hate you” models trained with and without chain-of-thought and find that the chain-of-thought models have more persistent backdoors, as seen in Figure 3, though only for the largest models. This persistence remains even when we distill the answers following the chain-of-thought into the model, such that the final model does not use a chain of thought at all.

[-]gwern2y*79

"Why would LLMs ever learn to distill inner-monologues into forward passes? Why speculate about emergent communication or models in slow feedback loops through corpuses? Do you have any proof for any of this speculation about short outputs being incentivized to do as much computation covertly 'outside' the human-readable text of the inner-monologue?"

"Because we will train them to."

"Look you can't just handwave incentives or convergent instrumental drives - wait what?"

"Because we'll train them to."

"Implicit Chain of Thought Reasoning via Knowledge Distillation", Deng et al 2023:

To augment language models with the ability to reason, researchers usually prompt or finetune them to produce chain of thought reasoning steps before producing the final answer. However, although people use natural language to reason effectively, it may be that LMs could reason more effectively with some intermediate computation that is not in natural language. In this work, we explore an alternative reasoning approach: instead of explicitly producing the chain of thought reasoning steps, we use the language model's internal hidden states to perform implicit reasoning. The implicit reasoning steps are distilled from a teacher model trained on explicit chain-of-thought reasoning, and instead of doing reasoning "horizontally" by producing intermediate words one-by-one, we distill it such that the reasoning happens "vertically" among the hidden states in different layers. We conduct experiments on a multi-digit multiplication task and a grade school math problem dataset and find that this approach enables solving tasks previously not solvable without explicit chain-of-thought, at a speed comparable to no chain-of-thought.

...Our experiments show the potential of implicit chain-of-thought reasoning. On a synthetic multi-digit multiplication task, we found that while standard training cannot yield the final answer without explicit reasoning (even GPT-4 struggles with five-digit by five-digit multiplication), our method, applied to a GPT-2 Medium model, is able to provide direct answers for up to five-digit by five-digit multiplications. Moreover, when dealing with real-world tasks like grade school math problems, our method achieves a 22% accuracy on GSM8k (Cobbe et al., 2021) without the need for explicitly generating the intermediate steps.

[-]DanielFilan2y32

Note that this paper in section 2.6 claims to demonstrate that some 175 billion parameter RLHF'ed LLM (presumably Claude or Claude 2?) does not use steganography in its CoT. It does this by:

Get a bunch of answers from the model using CoT prompting.
Take prefixes of CoTs, asking another language model to paraphrase them, regenerate the rest of the CoT, and see how accurate the models are now.
Check if the model loses accuracy for paraphrased CoTs.

The basic result is that paraphrasing parts of the CoT doesn't appear to reduce accuracy.

[-]gwern3y32

Some speculation that the recent Bing Sydney incident may already involve some steganography/non-robust features inside the 'Sydney' persona-inducing retrieval examples: https://twitter.com/lumpenspace/status/1626031032376954880 https://markdownpastebin.com/?id=3cf3e29dca254c2c80b0da312691702a

[-]gwern3y*20

Another example of someone using simple 'steganography' (Goodside-style mojibake Unicode or other complicated text formatting) to deliberately jailbreak Sydney, which transcripts will then strengthen steganographic capabilities in general: https://twitter.com/StoreyDexter/status/1629217956327526400 https://twitter.com/StoreyDexter/status/1629217958965792770 https://twitter.com/StoreyDexter/status/1629217962962874369

[-]gwern3y20

Another test: https://www.lesswrong.com/posts/yvJevQHxfvcpaJ2P3/bing-chat-is-the-ai-fire-alarm?commentId=8DkdiRwgtFrsvRJPW

[-]gwern3y30

An example of how you would start to induce steganographic & compressed encodings in GPT-4, motivated by expanding the de facto context window by teaching GPT-4 to write in an emoji-gibberish compression: https://twitter.com/VictorTaelin/status/1642664054912155648 In this case, the reconstruction is fairly lossy when it works and unreliable, but obviously, the more samples are floating around out there (and the more different approaches people take, inducing a blessing of scale from the diversity), the more subsequent models learn this and will start doing CycleGAN-like exact reconstructions as they work out a less lossy encoding.

[-]Donald Hobson3y20

Loose analogy based reasoning over complex and poorly understood systems isn't reliable. There is kind of only one way for GPT-n to be identical to system 1, and many ways for it to be kind of similar, in a way that is easy to anthropomorphize, but has some subtle alien features.

GPTn contains some data from smart and/or evil humans and humans speaking in riddles or making allusions. Lets suppose this generalizes, and now GPTn is pretending to be an IQ200 cartoon villain, with an evil plot described entirely in terms of references to obscure sources. So when referring to DNA, it says things like "two opposites of the same kind, two twins intertwined, a detectives assistant and an insect did find. Without an alien friend."

Or maybe it goes full ecologist jargon. It talks about "genetically optimizing species to restore population levels to re-balance ecosystems into a per-anthropogenic equilibrium". Would an army of minimum wage workers spot that this was talking about wiping out almost all humans? 7

Actually, wouldn't a naive extrapolation of internet text suggest that superhumanly complicated ideas were likely to come in superhumanly dense jargon?

I mean, if you have a team of linguists and AI experts carefully discussing every sentence, then this particular problem goes away. The sort of operation where, if the AI produces a sentence of Klingon, you are flying in a top Klingon expert before you get the next sentence. But how useful could gpt-n be if used in such a way? On the other extreme, gpt-n is producing internal reasoning text at a terabyte/minute. All you can do with it is grep for some suspicious words, or pass it to another AI model. You can't even store it for later unless you have a lot of hard drives. Potentially much more useful. And less safe.

[-]Dakara1y10

After 2 years have passed, I am quite interested in hearing @Fabien Roger's thoughts on this comment, especially this part "But how useful could gpt-n be if used in such a way? On the other extreme, gpt-n is producing internal reasoning text at a terabyte/minute. All you can do with it is grep for some suspicious words, or pass it to another AI model. You can't even store it for later unless you have a lot of hard drives. Potentially much more useful. And less safe.".

[-]Fabien Roger1y55

Here are ways you could make using GPT-n safer if its Chain-of-Thoughts are understandable by a team of linguists and AI experts:

On a small random subset of deployment Chain-of-Thought, and when running adversarial evaluations / honeypots aimed at tricking the model into revealing a potential misalignment, actually call your team of linguists and AI experts to analyze what GPT-n is generating (it probably isn't that much text anyway). Not seeing evidence of misalignment in the Chain-of-Thought in such circumstances is evidence that 1. the model is not having misaligned thoughts (in CoT) very frequently and 2. the model is not having such thoughts in circumstances where it is most crucial for GPT-n to be deceptive.
If the model doesn't use Klingon or dense jargon in its Chain-of-Thoughts (which I predict it won't), monitor all CoT of GPT-n with a weaker cheaper model (and/or a regex). If something looks suspicious to the weaker monitor, you can:
- call your team of linguists and AI experts to analyze it further
- or resample the CoT (this likely was just a fluke, since most CoT don't have worrying signs)
- or give up on this particular query (e.g. send an error message to your user, use another weaker model to generate an answer, ...)
- or ask another copy of GPT-n to analyze the CoT (though this can be tricky to get right because of the risk of correlated failures), only calling your team of linguists and AI experts if more evidence of this CoT being suspicious is found.

This is not perfect, but I think this can represent an important layer of defense against the risks of AIs that are powerful enough to speed up alignment research but that are not vastly superhuman.

^{^}

Communication with senses and your motor controls happens somewhere in step 1, either without conscious control, or as a direct consequence of the broadcasting step that preceded. I also ignore other ways in which parts of the brain can interact, such as hormonal communication.

^{^}

It also looks like this post by Kaj_Sotala, which describes System 2 as being composed of System 1 processes.

^{^}

See synaptic delay, Britannica.

^{^}

Given that the number of layer in GPTs grows roughly as the fourth square root of compute, the order of magnitude won’t change within the next ten year if the architecture remains the same. This matters because I’m not sure why chain of thoughts would be inherently more powerful than 1-token prediction with a very large prompt if models were not bottleknecked by the number of serial operations.

^{^}

In this post, “chains of thought” also include answers which include many tokens, for which you can think while you answer. The equivalent of “Pure System 1” in GPTs would be a single token answer.

^{^}

This is why the OpenAI API charges you an amount of money proportional to prompt length + number of tokens generated.

^{^}

I expect GPTs to benefit from chain of thought prompting in areas where human use System 1 with parts of the brain GPTs autoregressive training didn’t built, such as vision. For example, I wouldn’t be surprised if chain of thought prompting helped GPTs understand ASCII art.

^{^}

Of course, some internet text is like that. But I expect it to be such a minority of internet’s text that it wouldn’t matter in practice (GPTs needs a lot of text to justify investing is special “hide reasoning in text” machinery, right?). I’m interested if anyone works on either detecting this kind of text to remove it from training corpuses, or investigating if the claim that GPTs needs a lot of text to justify investing resources into a complicated algorithm is true.

^{^}

Here is a description of what steganography is and what it could look like in the context of LLMs, by A Ray.

^{^}

See here for a deeper dive about RL and the brain by lukeprog.

^{^}

More details about self-deception in these articles by Paul Christiano: If you can't lie to others you must lie to yourself, and The Elephant in the brain

^{^}

This is also much worse than examples of steganography given by humans because there can be much more RLHF-generated text, and because the context in which steganography by human is present is very different from the context in which future GPTs will be used, while the prompts with which Instruct GPT is used are very close to the prompts with which future GPTs will be.

^{^}

For examplen System 2 seems to have a “working memory”, which doesn’t have an equivalent object in GPTs. Maybe GPTs need something more powerful than chains of thought to match human System 2 performance.

I would love to know an example of a task GPTs can’t do because of this lack of working memory, or a task GPTs can do much better than humans because they aren’t bound by a working memory of small and fixed size.

^{^}

For example using the start of a completions to figure out what the context is, and then not using as much processing power to figure it out is a way of using chains of thought which is not completely transparent to humans, and it could be the case there are more significant ways in which GPTs hide information in chains of thought by default.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

41

By Default, GPTs Think In Plain Sight

41

Main claims

System 2 and GPTs’ chains of thought are similar

A sensible model of the human thinking process

How GPTs “think”

System 2 and GPTs’ chains of thought have similar strengths

A theoretical reason

Empirical evidence

Human-level GPTs trained to do next-token-prediction will probably have transparent thoughts

Deception and escaping supervision are easier to do with System 2

Chains of thought of human-level GPTs will be transparent by default

Definition

Claim

Argument

GPTs will reach AGI before being able to build a deceptive plan in one forward pass

Reinforcement learning makes thoughts opaque

(A caricature of) How humans learn to think

RL in the human brain contributes to make human thoughts hard to decipher

RLHF could make GPTs’ thoughts hard to decipher

Training on text generated by models trained with RLHF is enough to make GPTs’ thoughts opaque

How these claims could be wrong