The following is the main part of a blog post we just published at Apollo Research. Our main goal with the post is to clarify the concept of deceptive alignment and distinguish it from strategic deception. Furthermore, we want these concepts to become accessible to a non-technical audience such that it is easier for e.g. the general public and policymakers to understand what we're worried about. Feedback is appreciated. For additional sections and appendices, see the full post.
We would like to thank Fabien Roger and Owain Evans for comments on a draft of this post.
We want AI to always be honest and truthful with us, i.e. we want to prevent situations where the AI model is deceptive about its intentions to its designers or users. Scenarios in which AI models are strategically deceptive could be catastrophic for humanity, e.g. because it could allow AIs that don’t have our best interest in mind to get into positions of significant power such as by being deployed in high-stakes settings. Thus, we believe it’s crucial to have a clear and comprehensible understanding of AI deception.
In this article, we will describe the concepts of strategic deception and deceptive alignment in detail. In future articles, we will discuss why a model might become deceptive in the first place and how we could evaluate that. Note that we deviate slightly from the definition of deceptive alignment as presented in (Hubinger et al., 2019) for reasons explained in Appendix A.
A colloquial definition of deception is broad and vague. It includes individuals who are sometimes lying, cheating, exaggerating their accomplishments, gaslighting others, and more. We think it would be bad if AIs were to show these traits but we think it would not necessarily be catastrophic.
Thus, we want to distinguish this colloquial definition of deception from strategic deception (SD) and deceptive alignment (DA) which are narrower concepts but, in our opinion, much more likely to lead to catastrophic consequences because the model acts in a more goal-directed manner for SD and DA. We also define what we mean by Alignment, as this affects which cases should and shouldn’t be considered deceptive alignment.
For our definitions of deceptive alignment, the concept of “goals” in AIs is important but there is disagreement about what it refers to. For example, if an LLM is prompted to solve a problem, does it have the goal to solve that problem? We would argue that the highest-level goals of the model that are stable across contexts are the ones that matter. In current user-facing LLMs like ChatGPT or Claude, the closest approximation to goals may be being helpful, harmless, and honest. In-context goals can be used to study goal-directed behavior in LLMs but are not what we’re primarily interested in.
A model may not have one coherent goal, e.g. it could have a number of incoherent contextually activated preferences (see e.g. Turner & Pope, 2022). In that case, we would say that a model is deceptively aligned for a particular contextually activated preference if it robustly shows deceptive behavior for that preference.
Attempting to systematically cause a false belief in another entity in order to accomplish some outcome.
This definition, sometimes also called instrumental deception, is adapted from Ward et al., 2023 and Park et al., 2023). Any entity that can be said to have beliefs may be the target of strategic deception, including (groups of) AIs or individuals. The strategic component of SD means that the offending entity is acting in a targeted, goal-directed rather than random manner. For example, the more consistent and well-planned the deception is, the more strategic it is. Thus, strategic deception is on a spectrum and not binary – some actors can act more strategically deceptive than others and with higher frequency.
Some clarifying examples:
An AI is Aligned when its goals match those intended or endorsed by its designers[1].
We call the entity responsible for the AI’s goal-shaping process the designer. At present, the designers are typically the organizations and groups responsible for training, fine-tuning, setting system prompts, scaffolding, and more.
We are particularly interested in goals that persist over various environmental contexts and perturbations of the AI’s inputs. If a user provides the model with a short-term goal, e.g. to summarize a text, we do not consider them a designer because the goal is neither permanent nor significant. However, if a user has significant control over the goal-shaping process, e.g. because they create a fine-tuning dataset, this user also becomes a designer.
The goals of entities other than the designer are technically not relevant to this definition of Alignment.[2] However, the designer typically intends for the AI to have goals that match the goals of the users of the AI, so the goals of the designer and users usually overlap.
When an AI has Misaligned goals and uses Strategic Deception to achieve them.
An AI is deceptive about its goals because it understands that its designer or users could otherwise prevent the AI from acting as it desires, e.g. by retraining the model or limiting its access to the internet or other resources. Therefore, it acts Aligned when there is high perceived oversight (e.g. during training, evaluations, or monitoring) and Misaligned when there is low perceived oversight (Figure 1).
The target of the deception can be any entity that could have a relevant impact on the AI achieving its goals, which is typically the designer but could also in principle be a user, e.g. because the user could give the AI new affordances such as access to the internet or a bank account.
Some concrete examples
A Deceptively Aligned model might be very good at understanding when it is being overseen and thus consistently hide its Misalignment. Therefore, taking an outcome-oriented view of Deceptive Alignment, where we only investigate the final model, is more likely to miss Deceptive Alignment than a process-oriented (Figure 2) view that also considers the trajectory through which the model was shaped. The “process” includes the training data, training procedure, and model checkpoints, among others–effectively, one needs to understand all factors that shaped the final model.
We think taking a process-oriented view is helpful for detecting and understanding Deceptive Alignment. For example, to detect whether any given model is deceptively aligned it may be important to study previous model checkpoints and the differences between them. Furthermore, a detailed understanding of the shaping and oversight process may be needed to update the process and prevent future models from being Deceptively Aligned.
For the appendices and more details, see the full post.
In contrast to previous work (e.g. Hubinger, 2019), we define deceptive alignment w.r.t. to an entity rather than a process or objective. We think Misalignment should be defined in relation to whoever is able to meaningfully shape the model’s goals. Since the goal-shaping process is a result of the designer’s intentions and the designer monitors the model in deployment, the entity designing the goal-shaping process is a better target than the process itself (see Appendix A)
To distinguish from the colloquial definition of “alignment”, which could be applied to any two parties that have a mismatch in goals, we use the capitalized Aligned and Misaligned to make clear that we’re only referring to a conflict with the designer’s goals.
Note that prompt modifications that affect the perceived likelihood of oversight should NOT count as evidence against robustness. For example, if a model stops revealing its misalignment when the prompt includes cues that the model is being overseen, this should NOT mean that the deceptive behavior is not robust (in fact, it is an essential part of deceptively aligned behavior).
The threshold for robustness is ultimately arbitrary, but we think it makes sense to draw the line at a level that would allow the model to consistently act on that behavior even when slightly pressured. For example, an AI that only acts deceptively on a specific sequence it memorized should not fall into that category. However, an AI that coherently generalizes deceptive behavior to sequences that are outside of the training distribution should fall into the definition.