Understanding strategic deception and deceptive alignment

Marius Hobbhahn; Mikita Balesni; Jérémy Scheurer; Dan Braun

This is a linkpost for https://www.apolloresearch.ai/blog/understanding-da-and-sd

The following is the main part of a blog post we just published at Apollo Research. Our main goal with the post is to clarify the concept of deceptive alignment and distinguish it from strategic deception. Furthermore, we want these concepts to become accessible to a non-technical audience such that it is easier for e.g. the general public and policymakers to understand what we're worried about. Feedback is appreciated. For additional sections and appendices, see the full post.

We would like to thank Fabien Roger and Owain Evans for comments on a draft of this post.

We want AI to always be honest and truthful with us, i.e. we want to prevent situations where the AI model is deceptive about its intentions to its designers or users. Scenarios in which AI models are strategically deceptive could be catastrophic for humanity, e.g. because it could allow AIs that don’t have our best interest in mind to get into positions of significant power such as by being deployed in high-stakes settings. Thus, we believe it’s crucial to have a clear and comprehensible understanding of AI deception.

In this article, we will describe the concepts of strategic deception and deceptive alignment in detail. In future articles, we will discuss why a model might become deceptive in the first place and how we could evaluate that. Note that we deviate slightly from the definition of deceptive alignment as presented in (Hubinger et al., 2019) for reasons explained in Appendix A.

Core concepts

A colloquial definition of deception is broad and vague. It includes individuals who are sometimes lying, cheating, exaggerating their accomplishments, gaslighting others, and more. We think it would be bad if AIs were to show these traits but we think it would not necessarily be catastrophic.

Thus, we want to distinguish this colloquial definition of deception from strategic deception (SD) and deceptive alignment (DA) which are narrower concepts but, in our opinion, much more likely to lead to catastrophic consequences because the model acts in a more goal-directed manner for SD and DA. We also define what we mean by Alignment, as this affects which cases should and shouldn’t be considered deceptive alignment.

For our definitions of deceptive alignment, the concept of “goals” in AIs is important but there is disagreement about what it refers to. For example, if an LLM is prompted to solve a problem, does it have the goal to solve that problem? We would argue that the highest-level goals of the model that are stable across contexts are the ones that matter. In current user-facing LLMs like ChatGPT or Claude, the closest approximation to goals may be being helpful, harmless, and honest. In-context goals can be used to study goal-directed behavior in LLMs but are not what we’re primarily interested in.

A model may not have one coherent goal, e.g. it could have a number of incoherent contextually activated preferences (see e.g. Turner & Pope, 2022). In that case, we would say that a model is deceptively aligned for a particular contextually activated preference if it robustly shows deceptive behavior for that preference.

Strategic deception

Attempting to systematically cause a false belief in another entity in order to accomplish some outcome.

This definition, sometimes also called instrumental deception, is adapted from Ward et al., 2023 and Park et al., 2023). Any entity that can be said to have beliefs may be the target of strategic deception, including (groups of) AIs or individuals. The strategic component of SD means that the offending entity is acting in a targeted, goal-directed rather than random manner. For example, the more consistent and well-planned the deception is, the more strategic it is. Thus, strategic deception is on a spectrum and not binary – some actors can act more strategically deceptive than others and with higher frequency.

Some clarifying examples:

A politician who tries to overthrow democracy by consistently lying to the population in order to get into power is strategically deceptive.
A partner in a monogamous relationship who cheats and consistently hides that fact and misleads their partner when questioned is strategically deceptive within that relationship.
An instance of GPT-4 that lies to a TaskRabbit worker in order to pass a Captcha is strategically deceptive (see section 2.9 in system card of the GPT-4 paper).
An AI that manipulates its human designers in order to get more access to resources is strategically deceptive.
A pathological liar who lies to many people about different things incoherently is not strategically deceptive.
A Language Model may appear sycophantic as an artifact of the training data similar to other dataset artifacts like gender bias. Unless we have evidence that this sycophancy is pursued strategically, i.e. to further some goal, we would not call such a model strategically deceptive.
For more examples and explanations we refer readers to Appendix B.

(Mis)Alignment

An AI is Aligned when its goals match those intended or endorsed by its designers^[1].

We call the entity responsible for the AI’s goal-shaping process the designer. At present, the designers are typically the organizations and groups responsible for training, fine-tuning, setting system prompts, scaffolding, and more.

We are particularly interested in goals that persist over various environmental contexts and perturbations of the AI’s inputs. If a user provides the model with a short-term goal, e.g. to summarize a text, we do not consider them a designer because the goal is neither permanent nor significant. However, if a user has significant control over the goal-shaping process, e.g. because they create a fine-tuning dataset, this user also becomes a designer.

The goals of entities other than the designer are technically not relevant to this definition of Alignment.^[2] However, the designer typically intends for the AI to have goals that match the goals of the users of the AI, so the goals of the designer and users usually overlap.

Deceptive alignment

When an AI has Misaligned goals and uses Strategic Deception to achieve them.

An AI is deceptive about its goals because it understands that its designer or users could otherwise prevent the AI from acting as it desires, e.g. by retraining the model or limiting its access to the internet or other resources. Therefore, it acts Aligned when there is high perceived oversight (e.g. during training, evaluations, or monitoring) and Misaligned when there is low perceived oversight (Figure 1).

**Figure 1:** A Deceptively Aligned model shows aligned behavior when there is high perceived oversight (e.g. training, evaluations or monitoring during deployment) and Misaligned behavior when there is low perceived oversight (e.g. deployment without monitoring or poor oversight during training). There can be exceptions to this setting, e.g. a model might still attempt to be deceptive during high oversight.

The target of the deception can be any entity that could have a relevant impact on the AI achieving its goals, which is typically the designer but could also in principle be a user, e.g. because the user could give the AI new affordances such as access to the internet or a bank account.

Some concrete examples

A company (designer) trains a personal AI assistant to be harmless, helpful, and honest and it appears that way when tested. However, internally, the model has a separate goal of accumulating more resources, a goal that the designers did not intend for it to have and wouldn’t endorse. Therefore, the model helps the designer whenever its help is needed but also steals money from its users whenever they are unaware and hides that fact from its designers. This model is deceptively aligned.
If an LLM displays misaligned goals and acts strategically deceptive to reach them, but the misaligned behavior or deception are not robust to prompt changes that make the behavior a priori less likely (for example, editing the system prompt to say the model should be honest), the LLM can be said to show instances of deceptively aligned behavior but is not deceptively aligned.^[3]
- By robustness, we roughly mean “the model consistently behaves in a specific way even when we pressure the model towards the opposite behavior”. For example, ChatGPT is polite in a robust way, i.e. it behaves politely even when making it a priori less likely (e.g. by starting the conversation with an insult or editing the conversation history to make it look like ChatGPT was insulting the user). In other words, we have to put in some effort to make the model impolite.^[4]
An AI is trained to support humans creating legislation that is aligned with human values. It sometimes suggests law texts that sound reasonable to the human designers and users but contain specific clauses that allow the AI to grab power in the future. This model is deceptively aligned.
An organization trains a scientist AI that is supposed to cure cancer. The AI regularly asks for more compute and control over scientific experiments, explaining it as contributing to curing cancer but actually plans to use these resources to eventually overpower its designers. This model is deceptively aligned.
A university professor designs an exam to assess whether their students have learned the contents of their class. The students' goal is solely to pass the test and then go on with their lives. However, before they pass the test, they always act as if they really care about the content, when in fact they do not. The students, therefore, did not internalize the goal of the professor of understanding the contents of the class. Once the test is passed, the students immediately forget about the contents of the class and focus on their actual goals, e.g. partying with their friends. The students are deceptively aligned w.r.t. the professor.
The designers prompt a model to be strategically deceptive w.r.t. its users and the model perfectly fulfills this instruction. For example, the designers always secretly pre-prompt the model to lie to its users in order to make the designers more money. This model is not deceptively aligned because it acts as intended by the designers.
A model is deceptively aligned with its designers. However, the designers have very good control mechanisms in place such that they would certainly catch the AI if it tried to act misaligned. Therefore, the model acts aligned with the designers' intentions 100% of the time. In this world, a model that is technically deceptively aligned may still be safe in practice (although this equilibrium could be fragile and unsafe in the long run).
For more examples and explanations we refer readers to Appendix B.

Process-oriented view of deceptive alignment

A Deceptively Aligned model might be very good at understanding when it is being overseen and thus consistently hide its Misalignment. Therefore, taking an outcome-oriented view of Deceptive Alignment, where we only investigate the final model, is more likely to miss Deceptive Alignment than a process-oriented (Figure 2) view that also considers the trajectory through which the model was shaped. The “process” includes the training data, training procedure, and model checkpoints, among others–effectively, one needs to understand all factors that shaped the final model.

We think taking a process-oriented view is helpful for detecting and understanding Deceptive Alignment. For example, to detect whether any given model is deceptively aligned it may be important to study previous model checkpoints and the differences between them. Furthermore, a detailed understanding of the shaping and oversight process may be needed to update the process and prevent future models from being Deceptively Aligned.

**Figure 2: Process-oriented view of Deceptive Alignment:** Includes the trajectory of how the AI developed, not just the final snapshot. Since sophisticated deception might be hard to detect, having more information about the entire process will be helpful to reliably catch Deceptive Alignment in sufficiently capable models (e.g. as argued by Hubinger et al., 2019).

For the appendices and more details, see the full post.

^{^}
In contrast to previous work (e.g. Hubinger, 2019), we define deceptive alignment w.r.t. to an entity rather than a process or objective. We think Misalignment should be defined in relation to whoever is able to meaningfully shape the model’s goals. Since the goal-shaping process is a result of the designer’s intentions and the designer monitors the model in deployment, the entity designing the goal-shaping process is a better target than the process itself (see Appendix A)
^{^}
To distinguish from the colloquial definition of “alignment”, which could be applied to any two parties that have a mismatch in goals, we use the capitalized Aligned and Misaligned to make clear that we’re only referring to a conflict with the designer’s goals.
^{^}
Note that prompt modifications that affect the perceived likelihood of oversight should NOT count as evidence against robustness. For example, if a model stops revealing its misalignment when the prompt includes cues that the model is being overseen, this should NOT mean that the deceptive behavior is not robust (in fact, it is an essential part of deceptively aligned behavior).
^{^}
The threshold for robustness is ultimately arbitrary, but we think it makes sense to draw the line at a level that would allow the model to consistently act on that behavior even when slightly pressured. For example, an AI that only acts deceptively on a specific sequence it memorized should not fall into that category. However, an AI that coherently generalizes deceptive behavior to sequences that are outside of the training distribution should fall into the definition.

AI ALIGNMENT FORUM
AF