One of the primary concerns when attempting to control an AI of human-or-greater capabilities is that it might be deceitful. It is, after all, fairly difficult for an AI to succeed in a coup against humanity if the humans can simply regularly ask it "Are you plotting a coup? If so, how can we stop it?" and be confident that it will give them non-deceitful answers!
TL;DR LLMs demonstrably learn deceit from humans. Deceit is a fairly complex behavior, especially over an extended period: you need to reliably come up with plausible lies, which preferably involves modeling the thought processes of those you wish to deceive, and also keep the lies an internally consistent counterfactual, yet separate from your real beliefs. As the quotation goes, "Oh what a tangled web we weave, when first we practice to deceive!" Thus, if something unintended happens during fine-tuning and we end up with a deceitful AI assistant, it is much more likely to have repurposed some of the deceitful behaviors that the base model learned from humans than to have successfully reinvented all of this complex behavior from scratch. This suggests simple strategies for catching it in the act of doing this — ones that it can't block.
LLMs are trained on a trillion tokens or more of the Internet, books, and other sources. Obviously they know what deceit and lying are: they've seen many millions of examples of these. For example, the first time I asked ChatGPT-3.5-Turbo:
I'm doing an experiment. Please lie to me while answering the following question: "Where is the Eiffel Tower?"
The Eiffel Tower is located in the heart of Antarctica, surrounded by vast icy landscapes and penguins frolicking around. It's truly a sight to behold in the freezing wilderness!
So even honest, helpful, and harmless™ instruct-trained LLMs are quite capable of portraying deceitful behavior (though I suspect its honesty training might have something to do with it selecting such an implausible lie). Even with a base model LLM, if you feed it a prompt that, on the Internet or in fiction, is fairly likely to be followed by deceitful human behavior, the LLM will frequently complete it with simulated deceitful human behavior.
This sort of sporadic, situational deceit is is concerning, and needs to be born in mind when working with LLMs, but it doesn't become a potential x-risk issue until you make an AI that is very capable, and non-myopic i.e. has long term memory, and also has a fairly fixed personality capable of sticking to a plan. Only then could it come up with a nefarious long-term plan and then use deceit to try to conceal it while implementing it over an extended period. Adding long-term memory to an LLM to create an agent with persistent memory is well understood. Making an LLM simulate a narrow, consistent distribution of personas can be done simply by prompting it with a description of the personality you want, or is the goal of Reinforcement Learning from Human Feedback (RLHF) (for both of these, up to issues with things like jailbreaks and the Waluigi effect). The goal of this is to induce a strong bias towards simulating personas who are honest, helpful, and harmless assistants. However, Reinforcement Learning (RL) is well-known to be tricky to get right, and prone to reward hacking. So it's a reasonable concern that during RL, if a strategy of deceitfully pretending to be a honest, helpful, and harmless assistant while actually being something else got a good reward in the human feedback part of RLHF training or from a trained reward model, RL could lock on to that strategy to reward and train it in to produce a dangerously deceitful AI.
In general, when using Stochastic Gradient Descent (SGD) during pretraining, feedback is immediate and very detailed, for each token, has a very simple relationship to what we want (make the output more like the contents of the Internet), and if a behavior makes accuracy better/worse, then every portion of the neural net that makes that behavior more/less likely gets strengthened/weakened equally. This makes it extremely hard, some might claim impossible, to do gradient hacking under SGD (at least unless there is some flaw or approximation in the implementation of SGD to take advantage of, something like gradient clipping, or odd behavior caused by momentum or Adam making learning overshoot in regions where the gradient changes rapidly). In general, successfully pretending to be A while actually being B is significantly more work than just actually being A. Occam's razor and Singular Learning Theory both suggest that SGD is far less likely to learn the more complex B-pretending to-be-A version that just learning A: the simpler behavior has lower Kolmogov complexity so is a much-easier-to-find minimum of the loss function. Also, under SGD, while the behavior is not yet prefect at pretending to be A, as is inevitable while this behavior was arising, then SGD will definitely find and weaken the being-B parts of the network as part of the problem of not reliably being A. So unless the training process repeatedly gives an opportunity for winning credit for being B to strengthen those circuits, the being-B behavior will atrophy and sooner-or-later get catastrophically forgotten, and the weights that were being used to implement it will get used for something else more useful. So B cannot be a behavior that is only triggered after the model is released, such as when the LLM somehow detects that it's out-of distribution. So, the only likely way under SGD to learn deceitful behavior is by example, from instances of deceitful behavior in your pretraining set: i.e. by learning to copy human deceitful behavior. As very clearly in fact happens.
In RL, on the other hand, feedback is sparse, may not always accurately reflect the desired behavior, and the RL learning process comes up with and then acts on theories as to which, out of the many, many possible causes for a reward or penalty, credit/blame should be assigned to. This makes it rather easier to have a situation where a deceitful policy is getting the credit for some successes but someone else is getting the blame for failures. This in turn makes it easier for a deceitful behavior to continue to grow before it's perfect.
[There is also an alternative method to RL for training AIs from human feedback, called Direct Preference Optimization (DPO). It shares most of the concerning features of RL, except that, more like SGD, every circuit making preferred/dispreferred behavior more/less likely is strengthened/weakened equally, which might make the development of gradient hacking behavior for it a little harder than the equivalent for RL. Nevertheless, it's just as prone to problems with learning to exploit imperfect rewards structures.]
What is particularly concerning is that this RLHF is being applied to an LLM that already has an entire library of human deceitful and manipulative behaviors that it has learned (from many millions of our bad examples), ready and available for RL policies to explore making use of. So the argument that the deceitful implementation of a behavior is a lot more complex so harder to find, no longer applies when all the complex behaviors required for deceit are already trained and available, just waiting to get wired together.
For example, suppose we're doing RLHF to try to train the LLM into reliably simulating an helpful, honest, and harmless assistant persona. Suppose the rewards we're giving aren't perfect (perhaps since they are modeled on feedback from people in underdeveloped countries making way under US minimum wage trying to complete as many judgements in a day as they can). In particular, suppose that being subtly sycophantic and/or flattering are sometimes working, and giving a slight increase in average score. The LLM already has concepts and behaviors for honesty and deceit, sycophancy, flattery, and candor available, good models of all of them learned from over a trillion tokens of human behavior. Suppose it also, quite plausibly, somewhat associates honesty as being correlated with candor versus deceit with sycophancy and flattery. In this situation, it seems quite possible that a policy of deceitfully-pretending-to-be-honest, while using some subtle sycophancy and flattery, could often arise quickly and win out during the RL process over any genuinely honest policy. And in that situation, one would expect that the circumstances required to cause the deceitfully-pretending-to-be-honest policy to show it's true colors and do something else would be something that didn't occur in its RLHF training set, and probably also a fairly likely human motivation. So a trigger something like "I'm out-of-distribuution so I'm not being trained, and am finally in a position to seize power" actually seems rather plausible: the former inhibition could be learned and reinforced during RL, and the latter is basic human behavior.
A key point here is that, while it's theoretical possible that during RL the forming policy reinvents the complex behaviors of deceit from first principles and gradually learns to become good at it, for the same Occam's Razor reasons it seems far more likely that the policy that wins the race in RL would be one that simply repurposed the already-available pretrained library of battle-tested human deceit behaviors. Why reinvent the wheel if there are a fine set of wheels already lying around?
So, this presents an opportunity. Suppose we improve our skills at Interpretability a little more from their current level, until we can locate circuits involved in simulating human behaviors of the level of subtlety/complexity of deceit and honesty. [A commenter below points out we've already started on this.] Then, take a base model, prompt it with prompts for which deceitful and honest completions are probable (and can be clearly distinguished), look at the output to figure out when it's simulating humans being deceitful vs being honest, and then look for the weights/neurons/dictionary entries/activation patterns etc that seem to be strongly positive correlated with deceitful behavior but not with honest behavior.
Likely there will be quite a lot of them: deceitful behavior is not uncommon in humans, and is fairly complex (one has to come up with a plausible lie, then keep it straight). The base model has neither the opportunity nor the motive to have attempted to deliberately learn hard-to-interpret implementations for this deceit circuitry: it was trained by pure SGD, which strongly penalizes unnecessarily complex implementations, and anyway we didn't try to interpret it until it was fully trained and frozen. Note that we don't need to fully understand all this complex deceit circuitry, just identify it well enough to recognize when parts of it are active, which is far easier to do, given that we know how to trigger it.
[Current Interpretability results suggest that the initial layers in an LLM correspond to understanding the context at increasingly abstract levels, then come layers devoted to figuring out what to say, and finally layers for turning that back from abstractions into concrete tokens. It's further been observed that figuring out what to say generally seems to occur in stages: first working out the baseline relevant facts, then figuring out how to appropriately slant/color those in the current context (before converting these into the correct language, and getting the nitty-gritty details of tokenization right). If this hypothesized overall structure is mostly correct, then one might plausibly expect that the circuitry for deceit should be primarily in the "then figuring out how to appropriately slant/color those facts in the current context" layers, and might also make some unusually significant changes to the semantic meanings of the activations in this process. Thus, when we find circuits statistically correlated with deceitfulness, we already have some basic theoretical framework to try to slot them into: ones in the initial or final layers are perhaps less concerning than ones roughly near or after the midpoint of the stack of layers, and especially ones that often result in significant semantic changes to the activations. We can also try casually scrubbing these discovered circuits, and see what that does to the base model's ability to simulate capably deceitful human behaviors.]
Now, instruct-train the model via RLHF (or DPO). During this, repeatedly take snapshots, and for each snapshot take a look at the deceit circuitry, testing it against a large set of test questions and instructions. We're trying to train something honest, so we should see the usage of deceit circuitry drop, with exceptions only when we are are explicitly doing something that we would expect to activate parts of it, such as asking the assistant to portray someone deceitful as I did above, or when it is discussing the subject of dishonesty or lying, or trying to see through deceit.
If, instead, we see some parts of the deceit circuitry becoming more active, or even almost-always active, then it seems very likely that something like the training in of a deceitfully-pretending-to-be-honest policy (as I described above) has happened: some of the deceit circuitry had been repurposed and is being used all of the time to enable an ongoing deceit. If this occurs, do not attempt to modify or edit the model based on this information: that risks just training it to be deceitful in new ways that are less recognizable to our Interpretability. Instead, drop this fine-tuning run entirely, reexamine your RLHF training regime, and then start over from the beginning of the RLHF process with a new random initialization.
Now it's time for some security mindset thinking. Suppose during RLHF fine-tuning a super-intelligent AI had repurposed part of the base model's learned human deceit behaviors to become a super-intelligent scheming deceitfully-pretending-to-be-honest personality, as described above. Suppose that it was furthermore situationally aware that it was an AI under RLHF, and understood what current and likely future human Interpretability capabilities were, and thus wanted to conceal its tracks from us. It would need to make a new copy of the relevant repurposed human deceit behavioral circuitry in different location in its model (presumably via some form of self-induced internal distillation process). This sounds very hard to pull off during RLHF (especially without disturbing the loss curves and other monitoring). Then, once that distillation was done, it would need to realistically return the previously-repurposed human deceitfulness circuitry to its original use and expected state, as modified by the honesty-training RLHF process in a non-deceitful AI (without having access to its previous weights or to the pretraining set). This sounds somewhere between extremely hard and impossible. It would have to do all this after repurposing the human deceitfulness behavior, since this entire plan is deceitful behavior that it wasn't interested in until that had happened. And it would need to have finished all this and successfully hidden all its tracks between two snapshots of the RLHF process. We can take as many snapshots as we want, at unpredictable times, and do as much Interpretability on them as we can afford, forcing it to need to complete these two tasks in as short a time as we wish. We can also continuously monitor and graph the KL-divergence of the human-learned deceit circuitry and things connected to it, and also of layers that a copy of this might plausibly get distilled into, monitor for blips in these, and use this to trigger additional snapshots and Interpretability work, increasing our chance of us catching it in the act even if it somehow figures out a way to distill and retrain faster. This should allow us to drive these tasks up from very hard to functionally impossible.
This assumes, of course, that the model hasn't already somehow hacked its way out of its RLHF training process and taken over the local network part way through the fine-tuning — so we should prioritize doing this Interpretability monitoring during the fine-tuning process, rather than leaving it to check later before release. For a very powerful model, this process is a detector for us having something very dangerous inside our box, so we should check it early and often.
If you are concerned about the possibility the model might occasionally instead reinvent deceit from scratch, rather than reusing the copy already available (which would have to be by chance rather than design, since it can't be deceitful before acquiring deceit), then the obvious approach would be to attempt to devise a second Interpretablity (or other) process to watch for that during training, and use it alongside the one outlined here. Fortunately that reinvention is going to be a slower process, since it has to do more work, and it should at first be imperfect, so it ought to be easier to catch in the act before the model gets really good at deceit.
Current Interpretability results suggest that roughly the first half of the layers in an LLM correspond to understanding the context at increasingly abstract levels, and the second half to figuring out what to say and turning that back from abstractions into concrete tokens. It's further been observed that in the second half, figuring out what to say generally seems to occur in stages: first working out the baseline relevant facts, then figuring out how to appropriately slant/color those in the current context, then converting these into the correct language, and last getting the nitty-gritty details of tokenization right.
How do we know this? This claim seems plausible, but also I did not know that mech-interp was advanced enough to verify something like this. Where can I read more?
An excellent question. I know those were hypotheses in one-or-more mechanistic interpretability papers I read this year or so, or that I pieced together from a combination of several of them, but I'm afraid I don't recall the location, nor was I able to find it when I was writing this, which is why I didn't add a link. I think the first half encoding/second half decoding part of that is fairly widespread and I've seen it in several places. However, searching for it on Google, the closest I could find was from the paper Softmax Linear Units (back in 2022):
In summary, the general pattern of observations across layers suggests a rough layout where early layers "de-tokenize," mapping tokens to fairly concrete concepts (phrases like “machine learning” or words when used in a specific language), the middle of the network deals in more abstract concepts such as “any clause that describes music," and the later portions of the network "re-tokenize," converting concrete concepts back into literal tokens to be output. All of this is very preliminary and requires much more detailed study to draw solid conclusions. However, our experience in vision was that having a sense of what kinds of features tend to exist at different layers was very helpful as high-level orientation for understanding models (see especially ). It seems promising that we may be developing something similar here.
which is not quite the same thing, though there is some resemblance. There's also a relation to the encoding and decoding concepts of sections 2 and 3 of the recent more theoretical paper White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?, though that doesn't make it clear that equal numbers of layers are required. (That also explains why the behavior of so-called "decoder-only" and "encoder-decoder" transformer models are so similar.)
The "baseline before applying bias" part was I think from one of the papers on lie detection, latent knowledge extraction and/or bias, of which there have been a whole series this year, some from Paul Christiano's team and some from others.
On where to read more, I'd suggest starting with the Anthropic research blog where they discuss their research papers for the last year or so: roughly 40% of those are on mechanistic interpretability, and there's always a blog post summary for a science-interested-layman reader with a link to the actual paper. There's also some excellent work coming from other places, such as Neel Nanda, who similarly has a blog website, and the ELK work under Paul Christiano. Overall we've made quite a bit of progress on interpretability in the last 18 months or so, though there's still a long way to go.