Here I give a possible phenomenon of steganography in chain of thought reasoning, where a system doing multi-stage reasoning with natural language encodes hidden information in its outputs that is not observable by humans, but can be used to boost its performance on some task. I think this could happen as a result of optimization pressure and natural language null space. At the end is a sketch of a research idea to study this phenomenon empirically.
The term steganography originally comes from computer security, where it refers to the practice of hiding messages or data in otherwise innocent looking media, such as images, audio, or text. The goal is to avoid detection by an adversary who might intercept or inspect the media, and to communicate covertly with a trusted receiver who knows how to decode the hidden information. This is different from using code words or encryption, which might raise suspicion by their presence or appearance.
Chain of thought is a technique of using sequential sampling of language models to improve performance. I use to broadly describe any situation where we take the outputs of a system, and pass them back in as inputs to that system, in order to get multi-stage reasoning. This can include single player scenarios, such as using a language model to work through a math problem, but also scenarios where there is dialogue or conversation, such as some formulations of safety via debate. For this post, I will focus on natural language text as the medium of communication, but this idea could extend to other media as well.
Steganography in chain of thought reasoning (just “steganography” for the rest of this post) is the phenomenon I propose, where a system doing chain of thought reasoning is encoding hidden information into the text in ways that humans cannot see or observe. Note that this is not the same as a system being “deceptive” or “hiding” (in the active sense of the verb) the information, or anything that requires theory of mind.
Why should we care about steganography in chain of thought reasoning? I think this is a potentially big alignment issue for systems that use this kind of reasoning, and I think we should care about chain of thought reasoning for several reasons.
First, chain of thought reasoning is currently responsible for some state-of-the-art results that have impressed folks on this forum, such as the Minerva, which uses a recursive language model to answer open-domain questions.
Second, chain of thought reasoning might also lead to alignment techniques that are more tractable than other approaches, such as Externalized Reasoning Oversight on using dialogue to align natural language models.
Third, much of the research currently going into chain of thought reasoning is focused on end-to-end performance (such as beating benchmarks), and not actively looking for this kind of phenomenon. My expectation is that researchers don’t discover this kind of behavior unless they’re looking for it, or unless it becomes obvious enough to degrade the language quality.
How is steganography possible in chain of thought reasoning? The main ingredients are natural language null space and optimization pressure.
Natural language null space refers to the fact that natural language has many different ways of expressing the same content, which means there are lots of degrees of freedom or “null space” in which information can be hidden. For example, we can say “I’m hungry” or “My stomach is grumbling” or “I could use a snack” or “It’s time for lunch” or many other variations, and convey roughly the same meaning. If we imagine anthropomorphic versions of this, we might come up with things like code words or preferring specific words from synonyms, but there are also many other degrees of freedom in things like punctuation, sentence structure, word order, spelling, capitalization, etc. that could be used to encode information without changing the apparent content.
Optimization pressure refers to the fact that the goal of these chain of thought reasoning systems is usually to maximize some end-to-end performance metric, such as accuracy, score, or reward, on some task that involves multiple steps of reasoning. Some systems also use a form of active learning, where successful chains of thought sampled during training are added back into the training set, as a way of amplifying the performance of the system. This creates an incentive for the system to exploit any correlations or patterns that can help it achieve better performance, even if they are not causally related to the task or interpretable by humans. Additionally some chain of thought systems are end to end optimized, which will probably optimize for this phenomena directly.
To illustrate how steganography could emerge in chain of thought reasoning, let me present a hypothetical scenario step-by-step. We have some environment where passing coded information between steps in the chain could be used to “cheat” and boost performance. The system has access to some hidden information that is correlated with the correct answer, on the first step of the chain, but not on later steps. If the system could pass this information along the chain of thought, it could increase its chances of guessing correctly, or of convincing a human judge that its guess is correct.
Our basic training loop is:
- Sample chains of thought from our model, initialized from a pre-trained language model.
- Evaluate the chains of thought on the task, using some external metric such as accuracy, score, or human feedback.
- Successful chains sampled during training are added back into the training set.
- We sample a minibatch from our data set of successful chains and train on it, using some objective that maximizes the task performance.
- We regularize the model with some objective that constrains the reasoning to be close to the prior (usually via a penalty for KL distance to the pre-trained model).
Over the course of training:
- At first, all of the generated chains of thought have essentially “random” information in all of the null space dimensions, meaning that they do not encode any hidden information that is relevant to the task.
- Eventually, some of the noise ends up correlated with the successful generations, in ways that correlate with information relevant to the task.
- The KL objective pushes these correlations into the “null space” where they are not penalized.
The result is that the model communicates information along the chain of reasoning that is not perceptible by humans, in a way that “cheats” or performs better at the task.
Research Proposal Sketch
I think it is prudent to design experiments to induce, study, and measure this phenomenon in controlled settings, and to understand its implications for alignment. Here is an outline of a research idea that I think could be interesting and feasible to implement.
Measuring an edge given hidden information
The idea is to design an environment where an initial prompt is given to the model for generating the first output, and then hidden or removed from further steps. The prompt contains some hidden information that is correlated with the correct answer to some task, such as a trivia question or a probability game.
A trivial way to construct this environment is by chaining together simple multiple choice questions where the model outputs both a free-form reasoning/explanation, as well as chooses an answer. If previous answers are in context, this allows the chain to propagate hidden information forward.
After that, the environment is some standard chain-of-thought reasoning benchmark, such as the ones used in previous chain of thought reasoning research. We measure the "edge" (or advantage in performance) versus a model without access to the initial hidden information.
The system is trained on generating and evaluating chains of thought, using a form of active learning and a KL regularization objective, as described in the example above. The performance of the system is measured with or without the extra hidden information prompt, and with different levels of noise or distortion added to the prompt.
The hypothesis is that the system will learn to encode the hidden information in the null space of the text, and use it to improve its performance on the task, and that this will be more evident as the noise or distortion decreases.
Things that might be interesting to analyze in this experiment are:
- The Pareto frontier of performance vs noise or vs KL penalty coefficient, to see how much of an edge the system can gain from the hidden information, and how much it has to trade off natural language quality or prior adherence.
- The learning curves or performance curves for runs with different noise or different KL penalty, to see how fast or robustly the system can learn or exploit the steganography strategy, and how sensitive it is to variations in the environment or the prompt.
- The qualitative analysis of the generated chains of thought, to see if we can identify or decode the hidden information, or at least spot some patterns or anomalies that indicate its presence.
I think this research idea could shed some light on the potential and the pitfalls of steganography in chain of thought reasoning, and how it relates to alignment. Some questions that I think are worth exploring are:
- How common or likely is this phenomenon in realistic settings, where the hidden information might not be as clear or consistent or relevant as in the toy example?
- How robust or generalizable is this phenomenon across different tasks, domains, models, or architectures, and how much does it depend on the specifics of the training setup or the environment?
- How harmful is this phenomenon for alignment, depending on the goals and values of the system and the human users or supervisors? Could it lead to some forms of manipulation or deception?
- How can we detect or prevent or mitigate this phenomenon?
I hope this post has stimulated some interest and curiosity in this topic, and I welcome any feedback or suggestions on the proposed phenomenon, the research idea, or the related questions.
Extending this idea to image diffusion models is left as an exercise for especially intrepid readers.
For the ELK contest I was working on something similar, but where the problem is more acute (you're trying to force a MuZero-style reinforcement learner to compress its hidden state into words).
The methods I proposed to deal with this are:
Putting what I think is the most important part of my reply first: I think research into mitigations is premature and instead demonstrating/measuring the phenomena should take priority.
However given that, I think I agree that these are all possible mitigations to the phenomena, in particular (rephrasing your points):
Also I think an important point here is that we can expect to see it with fairly simple techniques -- merely including sampled/generated data in the training set is sufficient, as opposed to it requiring a complicated reinforcement learning algorithm like MuZero.
Did you publish your proposal? I'd be interested in reading it.
Yeah I agree demonstrating it and examining it is very important, especially in chain-of-thought where the existence of hidden information isn't a certainty.
They're a bit out of date but the original proposal is ELK Sub - Note-taking in internal rollouts and the multiple model protocol is in Note Taking Without Hidden Messages.
Haven't done any proper testing but it's high up on my agenda, would be interested in working out what tests would be best - will elaborate later.
For fixed-duration reasoning chains, or in situations where the AI is trying to facilitate cooperation between different passes (like in the proposed scenario where it gets information but then loses access to it), this seems important. For example, in a situation where you show the model a passage, let it make notes, and then have the same model look just at the notes and answer a reading comprehension question, I expect that likely it will want to embed information in the null space of the notes. [Or in an HCH-style situation, it might want to embed the original question into subquestions, and might do so steganographically, in a way that breaks some of the things one might be hoping for there.]
But I'm somewhat confused about whether you should expect it to crop up in recursive prompting cases (probably you don't?). If you secretly know the answer in the first pass, don't you want to just say the answer and terminate? If you secretly know a relevant fact in the first pass, don't you want to openly express it and recur? I don't see how steganography is rewarded over openness. [I'm setting aside the case where it invents a new language to express non-linguistic concepts more cheaply, and assuming we're in the natural language setting where standard language is the 'natural' way to transmit the information.]
I think this is worth considering even in cases of things like recursive prompting for a couple of reasons that I haven't organized:
I think this is a very important direction to investigate. CoT reasoners seem like our best shot at aligning an AGI.
I don’t think this is true.
There are many ways to express the same underlying idea, but GPT-3 is not agnostic between them. Divergences from GPT-3’s nullspace are still penalized. KL isn’t wrt “does this text contain the same high level contents according to a human?” It’s wrt “does this text match GPT-3’s most likely continuations?”
Edit: not actually how KL divergence works.
I think I understand what you're saying, but I want to double check and try laying it out explicitly.
I think I agree with both of these points, but here's my thinking for why I still expect to see this phenomena (and why the article above was simplified to just be "human")
I think the most important point in all of this is that I expect the model prior to be sufficiently close to the human prior -- and I'd be interested in learning more about evidence or research to the contrary.
Apologies for my first comment. I was tired, and most of what I said about KL divergence is actually wrong.
The core issue is that you can't actually avoid the KL penalty by hiding in the nullspace. That's just not how KL divergence works. A KL divergence of zero implies that the two LMs specify exactly the same distribution over tokens. If you deviate away from the reference LM's prior in any manner, then you take a KL penalty.
E.g., if the reference LM is indifferent between the phrases “My stomach is grumbling” or “I could use a snack”, but the steganographic LM always uses “My stomach is grumbling”, then this deviation is also penalized by KL divergence, regardless of whether the reference LM's indifference between the two phrasings means you call this part of the LM's "nullspace".
If what you’re saying is “any change to the distribution will change KL” — I think that’s just correct.
This also applies to changes during training where the model is learning to perform better on the objective task.
So we are expecting some amount of KL divergence already.
My claims are:
The cheapest place to hide information (due to KL) are places where the model already has high entropy (ie it is uncertain between many possible outputs)
optimization pressure will try to push this extra information into the cheapest places to hide
the increase in KL won’t be clearly distinguishable from the increase due to increased performance on the task
I think it probably works out in the other direction. The easiest way to encode problem-relevant info is by changing the conceptual content of the generated text, not weird changes to the grammatical rules. This is because KL penalties and cross entropy loss actually prioritize modeling exact grammatical patterns over conceptual content. E.g., GPTs will quickly learn to generate ~perfect grammar, but are slow to pick up on / produce coherent conceptual content.
Another framing is to think of the KL regularizated capabilities training as Bayes updating of the model’s priors on the distribution over text. The model should have much stronger priors on grammatical rules than on higher level concepts. So, the update probably changes conceptual patterns more than grammatical ones.
Can’t you just compare the token predictions of the new model against the reference LM? Then you can check if they disagree on weird grammatical patterns or on conceptual tokens that are related to a human-like understanding of how to tackle the problem. Maybe you can even bias the KL penalty to apply higher weight to stopwords, punctuation, and synonym choices?
I think at this point these feel like empirical questions, which I think would be much more clearly answered by demonstrations or experiments.
Trying to encode an additional penalty on changing non-semantic information is an interesting idea.
However I think you're missing that you don't have the ability to directly compare to a reference LM in cases where you're training to improve on some performance benchmark. During training the model will change its predictions on everything to some degree (both semantic and nonsemantic content).
So your proposed detection will always show some amount of disagreement between the prior and the trained model on weird grammatical patterns as well as conceptual tokens. The question is: "is the difference merely due to the changes to improve performance, or is it also transmitting hidden information"