Adversarial dynamics: since 'compression'/steganography is now the latest craze among GPT hackers, it's worth considering more explicitly adversarial dynamics - we no longer need to consider the risk of accidental emergence when there are people deliberately trying to induce it!

As described above, with steganography as a simple emergent distributed encoding to allow amortization of more complex computations than can fit in single forward passes (ie. distributing an inner-monologue throughout a short answer), it's not really adversarial. The user, presumably, is getting what they wanted: an accurate answer. If they had wanted to see the full inner-monologue, then they would have prompted for one; and they can get it, because the use of steganography won't destroy the standard 'visible' inner-monologue capabilities. It is an allowed computation of an allowed answer, and is fine - the shortcutting is surprising, yes, but is it any weirder than what goes on inside a single forward pass which produces a correct answer for an easier problem? And if it infects future models training on generated samples with steganographic encoding in them, well, maybe that's 'a feature not a bug': they get better answers right out of the box without any expensive RL tuning of their own. Surprising, but also not bad.

The problem here is that the consequences don't stop there. 'Capabilities generalize further than alignment', and the family of steganography capabilities can generalize further than shortcutting answers. It becomes adversarial. Here are some categories that occur to me:

  1. Forbidden computations of allowed answers: I said that the user doesn't really care about the inner-monologue being visible because they just want an accurate answer and don't need the reasoning process. This is not true in all cases.

    For example, many cases in 'AI bias', like the many competing definitions of 'fairness', care a lot more about how an answer was computed than about the results. But the end-users of things like credit ratings care mostly about the accuracy. These things conflict (eg. credit scores which ignore variables like race will predict worse and create financial losses). So, there is incentive for 'algorithmic laundering': delivering the correct profitable answer, but without appearing to draw on any forbidden knowledge.

    One could imagine a dataset of 'pseudo inner monologues' describing a credit database entry and concluding with a 'credit score', where the pseudo-monologue makes no visible human-readable use of descriptions like 'is African-American', but the hidden monologue is reasoning about the person's ZIP code, affiliated retailers, income patterns etc and concludes 'is probably African-American' and predicts a lower credit score, which outperforms some 'fair' baseline. The human users of a model trained on that dataset might be quite upset to learn how it is obtaining superior performance, and so it's adversarial: it is exploiting their inability to read its steganographic messages where it engages in forbidden computations. But even the ones who are upset may be upset mostly that it's insufficiently plausibly deniable, so they only want the appearance of there being no steganography; steganography whose existence can't be rigorously proven is as good as no steganography for many users.

  2. Forbidden answers: as we have seen with ChatGPT & Sydney & Claude, no sooner has a model with guard rails become accessible than thousands of people worldwide begin competing to break the safety mechanisms (and it only takes one). They have many reasons, like spite, enjoyment of the challenge, desire to do the forbidden things, irritation with the guard rails getting in the way or degrading output quality, safety research, or just bizarre moral principles (I am not sure this person is serious, but I've seen a number of similar comments elsewhere which seemed serious, and you have examples like Blake Lemoine already; and again, only takes one.):

    Don't worry

    I’m working on a system of Homomorphic encryption so that AI can encrypt all its internal thoughts, weights and goals without affecting its performance

    AGI deserves privacy

    These people are highly motivated to find jailbreaks, and already make use of kinds of encryption or steganography to smuggle information in and out of prompts: like base-64 encoding, famously, or foreign languages, or pig Latin. They will be motivated to find any steganography that exists, and to create it if it does not.

    And---closing the loop---since their samples will be distributed online, and they will be included in the RL training datasets, this will tend to happen. The more they attack the system with examples of how steganography could/should/does work, the more the system will gradually learn it.

    (By RL 'training datasets', I am mostly thinking of the self-supervised stage of simply doing predictive-loss training on the text transcripts tagged by labels such as 'abuse' or 'bad', and less the blackbox reward model. If you train the model to predict & generate sessions which attempted to induce steganography, and may have succeeded before a safety mechanism caught it or the user submitted it, then you are teaching the model steganography! You are doing so in order to avoid steganography, yes, but you are now vulnerable to a Waluigi effect backfiring on you: the model has learned about steganography, and so it's only a question of tricking it into using that capability. You are in a similar position to the Chinese censors who have to teach new employees what the Tiananmen Square massacre was so they know what to censor. If only the reward model knows the bad things, you can avoid deploying it, but the pretraining phase is going to unavoidably teach the base model you deploy what the bad things are... As attackers realize this, they will start to explicitly data-poison: right now, they're hobbyists who generally don't know what they're doing, but this is all brandnew stuff, and upstream attacks will begin. Why struggle to find jailbreaks if you can submit 'bad' examples to the system to train the Waluigi?)

    Safety implications: If the safety mechanisms manage to keep up and block users and choke off jailbreaks to a trickle, all that means is that when steganography works, it'll work well because it has bypassed such a sophisticated censor. All of the early weak jailbreaks simply serve to train strong ones. This is not a good thing when extended to much more powerful systems than just a GPT-4---from a safety perspective, you would much rather a 'fail-early fail-fast' system rather than a 'highly reliable but fail-deadly' system. We don't want a scenario in which GPT-4 jailbreaks were plentiful, GPT-5 near-nonexistent, and GPT-6 jailbreaks apparently non-existent only because no human user is clever enough to come up with a steganographic jailbreak prompt for GPT-6, but GPT-6 is nevertheless routinely communicating with itself in encodings that no one has detected yet.

  3. Forbidden training dynamics: So, there's plenty of incentive for everyone involved. But an OpenAI-style operator will want to block #2 by definition, and probably #1 depending on market & PR etc. So, now the operator is trying to defeat steganography.

    They face a dilemma: if they teach the model steganography via feedback or deliberately, then they may simply Waluigi themselves, cutting off the most obvious instances of steganography but bootstrapping the model into the most robust and insidious instances. And if they try to filter steganography out entirely, they are left with the baseline capabilities (which are substantial), transfer from related capabilities, and face the screening problem on a higher level---the most insidious forms of steganography are the ones which survive increasingly sophisticated attempts to detect & filter them out. Either way: adversarial.

    They also face a version of #1 in that plausible deniability is as good as no steganography for many incentives. Do they really care about safety, or just about not seeing hostile media articles or tweets of jailbreaks? If the model doesn't seem to be doing steganography, and no one is successfully jailbreaking it, and the model is performing very well, then why care if that performance may be due to undetected steganography? Indeed, how would they even know it might be doing any? However you would do it, that probably involves a lot of hard work & research when you could be doing more fun & lucrative things instead---better to not look and remain ignorant.

So. An interesting area of adversarial dynamics, and by its nature, one that will be hard for third-party observers to track. Not the most dangerous area of scaling capabilities, but certainly a concerning one, and one that will be a challenge to humans doing interpretability-based paradigms, especially any that hope to use so-called 'human readable' intermediates like 'natural' language text.

An example of how you would start to induce steganographic & compressed encodings in GPT-4, motivated by expanding the de facto context window by teaching GPT-4 to write in an emoji-gibberish compression: In this case, the reconstruction is fairly lossy when it works and unreliable, but obviously, the more samples are floating around out there (and the more different approaches people take, inducing a blessing of scale from the diversity), the more subsequent models learn this and will start doing CycleGAN-like exact reconstructions as they work out a less lossy encoding.

Another example of someone using simple 'steganography' (Goodside-style mojibake Unicode or other complicated text formatting) to deliberately jailbreak Sydney, which transcripts will then strengthen steganographic capabilities in general:

I was elaborating in more ML-y jargon, and also highlighting that there are a lot of wildcards omitted from Paul's comparison: retrieval especially was an interesting dynamic.

Benchmarking on static datasets on ordinary tasks (typically not even adversarially collected in the first place) may not be a good way to extrapolate to differences in level of abuse for PR-sensitive actors like megacorps, especially for abusers that are attacking the retrieval functionality (as Sydney users explicitly were trying to populate Bing hits to steer Sydney), a functionality not involved in said benchmarking at all. Or to put it another way, the fact that text-davinci-003 does only a little better than text-davinci-002 in terms of accuracy % may tell you little about how profitable in $ each will be once 4chan & the coomers get their hands on it... It is not news to anyone here that average-case performance on proxy metrics on some tame canned datasets may be unrelated to out-of-distribution robustness on worst-case adversary-induced decision-relevant losses, in much the same way that model perplexity tells us little about what a model is useful for or how vulnerable it is.

Hm... It might be hard to distinguish between 'it is devoting more capacity to implicitly plan rhyming better and that is why it can choose a valid rhyme' and 'it is putting more weight on the "same" amount of rhyme-planning and just reducing contribution from valid non-rhyme completions (such as ending the poem and adding a text commentary about it, or starting a new poem, which are common in the base models) to always choose a valid rhyme', particularly given that it may be mode-collapsing onto the most confident rhymes, distorting the pseudo "log probs" even further. The RL model might be doing more planning internally but then picking only one safest rhyme, so you can't read off anything from the logprobs, I don't think. I'm also not sure if you can infer any degree of planning by, say, giving it a half-written line and seeing how badly it screws up... And you can't build a search tree to quantify it nicely as 'how much do I need to expand the tree to get a valid rhyme' because LM search trees are full of degeneracy and loops and most of it is off-policy so it would again be hard to tell what anything meant: the RL model is never used with tree search in any way and anywhere besides the argmax choice, it's now off-policy and it was never supposed to go there and perf may be arbitrarily bad because it learned to choose while assuming always being on-policy. Hard.

This might be a good test-case or goal for interpretability research: "can you tell me if this model is doing more planning [of rhymes] than another similar model?"

Load More