Deceptive AI ≠ Deceptively-aligned AI

Steven Byrnes

Deceptive AI ≠ Deceptively-aligned AI

7 min read7th Jan 202414 comments

50

Frontpage

Tl;dr: A “deceptively-aligned AI” is different from (and much more specific than) a “deceptive AI”. I think this is well-known and uncontroversial among AI Alignment experts, but I see people getting confused about it sometimes, so this post is a brief explanation of how they differ. You can just look at the diagram below for the upshot.

Some motivating context: There have been a number of recent arguments that future AI is very unlikely to be deceptively-aligned. Others disagree, and I don’t know which side is right. But I think it’s important for non-experts to be aware that this debate is not about whether future powerful AI is likely to engage in deliberate deception. Indeed, while the arguments for deceptive alignment are (IMO) pretty complex and contentious, I will argue that there are very much stronger and more straightforward reasons to expect future powerful AI to be deceptive, at least sometimes, in the absence of specific interventions to avoid that.

1. Definitions

Deceptive alignment is a particular scenario where:

A “ground-truth system” (possibly individual human evaluators, or possibly an automated system of some sort) provides an ML model with training signals (rewards if this is reinforcement learning (RL), supervisory ground truth signals if this is supervised or self-supervised learning (SL)),
The AI starts emitting outputs that humans might naively interpret as evidence that training is going as intended—typically high-reward outputs in RL and low-loss outputs in SL (but a commenter notes here that “evidence that training is going as intended” is potentially more nuanced than that).
…but the AI is actually emitting those outputs in order to create that impression—more specifically, the AI has situational awareness and a secret desire for some arbitrary thing X, and the AI wants to not get updated and/or it wants to get deployed, so that it can go make X happen, and those considerations lie behind why the AI is emitting the outputs that it’s emitting.

(For example, maybe the AI would reason that emitting high-reward or low-loss outputs would minimize the extent to which ongoing training will change its inclinations, which it would probably see as bad by the usual instrumental-convergence argument. So then it would seem to be performing well, but it’s performing well for problematic reasons. For another thing, the AI might have secret desires that can only be fulfilled if the humans deploy it into the world, and might reason that emitting certain outputs would make the humans more likely to deploy it.)

(I’m not trying hard to explain this part very well; if you’re confused, try reading the original source where the term “deceptive alignment” was coined in 2019, or Joe Carlsmith’s report, or many other discussions.)

By contrast, deception is much broader—it’s any situation where the AI is interacting with humans for any reason, and the AI deceives a human by knowingly providing them with false or misleading information.

Thus, deceptive alignment would be a special case of deception—namely, the case where deception occurs in the context of model training, and involves the AI emitting (typically) low-loss / high-reward outputs, in order to hide its secret ulterior motives, and to create a false impression that the training run is going as planned.

2. Very simple example of “deception” that is not “deceptive alignment”

Suppose I use RL to train an AI to make money, and that I do so in the most obvious way possible—I give the AI an actual real-world bank account, and set its RL reward signal to be positive whenever the account balance goes up, and negative when the account balance goes down.

If I did this today, the trained model would probably fail to accomplish anything at all. But let us suppose that future RL techniques will work better than today’s, such that this training would lead to an AI that starts spear-phishing random people on the internet and tricking them into wiring money into the AI’s bank account.

Such an AI would be demonstrating “deception”, because its spear-phishing emails are full of deliberate lies. But this AI would probably not be an example of “deceptive alignment”, per the definition above.

For example, deceptive alignment requires situational awareness by definition. But the AI above could start spear-phishing even if it isn’t situationally aware—i.e., even if the AI does not know that it is an AI, being updated by RL Algorithm X, set up by the humans in Company Y, and those humans are now watching its performance and monitoring Metrics A, B, and C, etc.

(That previous paragraph is supposed to be obvious—it’s no different from the fact that humans are perfectly capable of spear-phishing even when they don’t know anything about neuroscience or evolution.)

3. I think we should strongly expect future AIs to sometimes be deceptive (in the absence of a specific plan to avoid that), even if “deceptive alignment” is unlikely

There is a lively ongoing debate about the likelihood of “deceptive alignment”—see for example Evan Hubinger arguing that deceptive alignment is likely, DavidW arguing that deceptive alignment is extremely unlikely (<1%), and Joe Carlsmith’s 127-page report somewhere in between (“roughly 25%”), and more at this link. (These figures are all “by default”, i.e. in the absence of some specific intervention or change in training approach.)

I don’t know which side of that debate is right.

But “deception” is a much broader category than “deceptive alignment”, and I think there’s a very strong and straightforward case that, as we make increasingly powerful AIs in the future, if those AIs interact with humans in any way, then they will sometimes be deceptive, in the absence of specific interventions to avoid that. As three examples of how such deception may arise:

If humans are part of the AI’s training environment (example: reinforcement learning in a real-world environment): The spear-phishing example above was a deliberately extreme case for clarity, but the upshot is general and robust: if an AI is trying to accomplish pretty much anything, and it’s able to interact with humans while doing so, then it will do a better job if it’s open-minded to being strategically deceptive towards those humans in certain situations. Granted, sometimes “honesty is the best policy”. But that’s just a rule-of-thumb which has exceptions, and we should expect the AI to exploit those exceptions, just as we expect future powerful AI to exploit all the other affordances in its environment.
If the AI is trained by human imitation (example: self-supervised learning of internet text data, which incidentally contains lots of human dialog): Well, humans deceive other humans sometimes, and that’s likely to wind up in the training data, so such an AI would presumably wind up with the ability and tendency to be occasionally deceptive.
If AI’s training signals rely on human judgments (as in RLHF): This training signal incentivizes the AI to be sycophantic—to tell the judge what they want to hear, pump up their ego, and so on, which (when done knowingly and deliberately) is a form of deception. For example, if the AI knows that X is true, but the human judge sees X as an outrageous taboo, then the AI is incentivized to tell the judge “oh yeah, X is definitely false, I’m sure of it”.

Again, my claim is not that these problems are unavoidable, but rather that they are expected in the absence of a specific intervention to avoid them. Such interventions may exist, for all I know! Work is ongoing. For the first bullet point, I have some speculation here about what it might take to generate an AI with an intrinsic motivation to be honest; for the second bullet point, maybe we can curate the training data; and the third bullet point encompasses numerous areas of active research, see e.g. here.

(Separately, I am not claiming that AIs-that-are-sometimes-deceptive is a catastrophically dangerous problem and humanity is doomed. I’m just making a narrow claim.)

Anyway, just as one might predict from the third bullet point, today’s LLMs are indeed at least somewhat sycophantic. So, does that mean that GPT-4 and other modern LLMs are “deceptive”? Umm, I’m not sure. I said in the third bullet point that sycophancy only counts as “deception” when it’s “done knowingly and deliberately”—i.e., the AI explicitly knows that what it’s saying is false or misleading, and says it anyway. I’m not sure if today’s LLMs are sophisticated enough for that. Maybe they are, or maybe not, I don’t know. An alternative possibility is that today’s LLMs are sincere in their sycophancy. Or maybe even that would be over-anthropomorphizing. But anyway, even if today’s LLMs are sycophantic in a way that does not involve deliberate deception, I expect that this is only true because of AI capability limitations, and these limitations will presumably go away as AI technology advances.

Bonus: Three examples that spurred me to write this post

I myself have felt confused about this distinction a couple times, at least transiently.
I was just randomly reading this comment, where an expert used the terms “deception” and “deceptive agents” in a context where (I claim) they should have said “deceptive alignment” and “deceptively-aligned agents” respectively. Sure, I knew what they meant, but I imagine some people coming across that text would get the wrong idea. I’m pretty sure I’ve seen this kind of thing numerous times.
Zvi Mowshowitz here was very confused by claims (by Joe Carlsmith, Quintin Pope, and Nora Belrose) that AI was unlikely to be deceptively-aligned—instead, Zvi’s perspective was “I am trying to come up with a reason this isn’t 99%?” I’m not certain, but I think the explanation might be that Zvi was thinking of “deception”, whereas Joe, Quintin, and Nora were talking about the more specific “deceptive alignment”. For example, consider what Joe Carlsmith calls “training gamers”, or what Zvi might call “an AI with the terminal goal of ‘guessing the teacher’s password’”. Joe would call that a negative example (i.e., not deceptive alignment, or in his terminology “not scheming”), but maybe Zvi would call that a positive example because the AI is misaligned and deceptive.

(Thanks Seth Herd & Joe Carlsmith for critical comments on a draft.)

Deceptive AlignmentAI

Frontpage

50

Deceptive AI ≠ Deceptively-aligned AI

New Comment

14 comments, sorted by

top scoring

Click to highlight new comments since: Today at 9:12 AM

[-]Buck Shlegeris4mo93

Thanks for writing this; I agree with most of what you’ve said. I wish the terminology was less confusing.

One clarification I want to make, though:

You describe deceptive alignment as being about the model taking actions so that the reward-generating process thinks that the actions are good. But most deceptive alignment threat models involve the model more generally taking actions that cause it to grab power later.

Some examples of such actions that aren’t getting better train loss or train-time reward:

if you do evaluations to see whether your sometimes takes bad actions, the model might take actions that look good there even though you’re not going to train on the model’s actions.
The model might sandbag. For example, you might use RL to try to train the model to trick a trusted model which you’re using to monitor it. A deceptively aligned model might not try its best here, even though this means it’s getting worse reward.

[-]Steve Byrnes4mo50

Thanks! OK, I just edited that part, is it better / less bad?

[-]Ryan Greenblatt4mo64

This is part of the motivation for using the name "Scheming" instead of deceptive alignment.

I'm not sure it's that much better in terms of avoiding confusion though...

[-]Joe_Collman4mo30

…but the AI is actually emitting those outputs in order to create that impression—more specifically, the AI has situational awareness

I think it's best to avoid going beyond the RFLO description.

In particular, it is not strictly required that the AI be aiming to "create that impression", or that it has "situational awareness" in any strong/general sense.

Per footnote 26 in RFLO (footnote 7 in the post):
"Note that it is not required that the mesa-optimizer be able to model (or infer the existence of) the base optimizer; it only needs to model the optimization pressure it is subject to."

It needs to be:
Modeling the optimization pressure.
Adapting its responses to that optimization pressure.

Saying more than that risks confusion and overly narrow approaches.
By all means use things like "in order to create that impression" in an example. It shouldn't be in the definition.

[-]Samuel Dylan Martin4mo30

If you want a specific practical example of the difference between the two: we now have AIs capable of being deceptive when not specifically instructed to do so ('strategic deception') but not developing deceptive power-seeking goals completely opposite what the overseer wants of them ('deceptive misalignment'). This from Apollo research on Strategic Deception is the former not the latter,

https://www.apolloresearch.ai/research/summit-demo

[-]Oliver Sourbut4mo30

This is great, and thanks for pointing at this confusion, and raising the hypothesis that it could be a confusion of language! I also have this sense.

I'd strongly agree that separating out 'deception' per se is importantly different from more specific phenomena. Deception is just, yes, obviously this can and does happen.

I tend to use 'deceptive alignment' slightly more broadly - i.e. something could be deceptively aligned post-training, even if all updates after that point are 'in context' or whatever analogue is relevant at that time. Right? This would be more than 'mere' deception, if it's deception of operators or other-nominally-in-charge-people regarding the intentions (goals, objectives, etc) of the system. Also doesn't need to be 'net internal' or anything like that.

I think what you're pointing at here by 'deceptive alignment' is what I'd call 'training hacking', which is more specific. In my terms, that's deceptive alignment of a training/update/selection/gating/eval process (which can include humans or not), generally construed to be during some designated training phase, but could also be ongoing.

No claim here to have any authoritative ownership over those terms, but at least as a taxonomy, those things I'm pointing at are importantly distinct, and there are more than two of them! I think the terms I use are good.

[-]Joe_Collman4mo20

I think the broader use is sensible - e.g. to include post-training.

However, I'm not sure how narrow you'd want [training hacking] to be.
Do you want to call it training only if NN internals get updated by default? Or just that it's training hacking if it occurs during the period we consider training? (otherwise, [deceptive alignment of a ...selection... process that could be ongoing], seems to cover all deceptive alignment - potential deletion/adjustment being a selection process).

Fine if there's no bright line - I'd just be curious to know your criteria.

[-]Noosphere894mo32

I agree with the claim that deception could arise without deceptive alignment, and mostly agree with the post, but I do still think it's very important to recognize if/when deceptive alignment fails to work, it changes a lot of the conversation around alignment.

[-]Seth Herd4mo10

What do you mean by "when deceptive alignment fails to work"? I'm confused.

[-]Steve Byrnes4mo55

I think Noosphere89 meant to say “when deceptive alignment doesn’t happen” in that sentence. (They can correct me if I’m wrong.)

Anyway, I think I’m in agreement with Noosphere89 that (1) it’s eminently reasonable to try to figure out whether or not deceptive alignment will happen (in such-and-such AI architecture and training approach), and (2) it’s eminently reasonable to have significantly different levels of overall optimism or pessimism about AI takeover depending on the answer to question (1). I hope this post does not give anyone an impression contrary to that.

[-]Noosphere894mo30

Yep, that's what I was talking about, Seth Herd.

[-]quetzal_rainbow4mo20

I think it's confusing because we mostly care about outcome "we mistakenly think that system is aligned, deploy it and get killed", not about particular mechanism of getting this outcome.

Dumb example: let's suppose that we train systems to report its own activity. Human raters consistently assign higher reward for more polite reports. At the end, system learns to produce so polite and smooth reports that human raters have hard time to catch any signs of misalignement in reports and take it for aligned system.

We have, on the one hand, system that superhumanly good at producing impression of being aligned, on the other hand, it's not like it's very strategically aware.

[-]Vladimir Nesov3mo10

I’m not certain, but I think the explanation might be that Zvi was thinking of “deception”, whereas Joe, Quintin, and Nora were talking about the more specific “deceptive alignment”.

Deceptive alignment is more centrally a special case of being trustworthy (what the "alignment" part of "deceptive alignment" refers to), not of being deceptive. In a recent post, Zvi says:

We are constantly acting in order to make those around us think well of us, trust us, expect us to be on their side, and so on. We learn to do this instinctually, all the time, distinct from what we actually want. Our training process, childhood and in particular school, trains this explicitly, you need to learn to show alignment in the test set to be allowed into the production environment, and we act accordingly.

A human is considered trustworthy rather than deceptively aligned when they are only doing this within a bounded set of rules, and not outright lying to you. They still engage in massive preference falsification, in doing things and saying things for instrumental reasons, all the time.

My model says that if you train a model using current techniques, of course exactly this happens.

[-]Joe_Collman4mo10

By contrast, deception is much broader—it’s any situation where the AI is interacting with humans for any reason, and the AI deceives a human by knowingly providing them with false or misleading information.

This description allows us to classify every output of a highly capable AI as deceptive:
For any AI output, it's essentially guaranteed that a human will update away from the truth about something. A highly capable AI will be able to predict some of these updates - thus it will be "knowingly providing ... misleading information".

Conversely, we can't require that a human be misled about everything in order to classify something as deceptive - nothing would then qualify as deceptive.

There's no obvious fix here.

Our common-sense notion of deception is fundamentally tied to motivation:

A teacher says X in order to give a student a basic-but-flawed model that's misleading in various ways, but is a step towards deeper understanding.
- Not deception.
A teacher says X to a student in order to give them a basic-but-flawed model that's misleading in various ways, to manipulate them.
- Deception.

The student's updates in these cases can be identical. Whether we want to call the statement deceptive comes down to the motivation of the speaker (perhaps as inferred from subsequent actions).

In a real world context, it is not possible to rule out misleading behavior: all behavior misleads about something.
We can only hope to rule out malign misleading behavior. This gets us into questions around motivation, values etc (or at least into much broader considerations involving patterns of behavior and long-term consequences).

(I note also that requiring "knowingly" is an obvious loophole - allowing self-deception, willful ignorance or negligence to lead to bad outcomes; this is why some are focusing on truthfulness rather than honesty)

Moderation Log