Update: I changed the title from "Why AGI safety researchers should focus mostly on deceptive alignment" to "Why deceptive alignment matters for AGI safety". I think my original message was too strong and I'm actually much more uncertain about failure modes from AGI than the title suggests.
Comment: after I wrote the first draft of this post, Evan Hubinger published “How likely is deceptive alignment” in which he argues that deceptive alignment is the default outcome of NNs trained with SGD. The post is very good and I recommend everyone to read it. As a consequence, I rewrote my post to cover less of the “why is deceptive alignment likely” to the more high-level arguments of “why should we focus on deceptive alignment” and adopted Evan’s nomenclature to prevent confusion.
I’d like to thank Lee Sharkey, Richard Ngo and Evan Hubinger for providing feedback on a draft of this post.
TL;DR: No matter from which angle I look at it, I always arrive at the conclusion that deceptive alignment is either a necessary component or greatly increases the harm of bad AI scenarios. This take is not new and many(most?) people in the alignment community seem to already believe it but I think there are reasons to write this post anyway. Firstly, newer members of the alignment community are sometimes not aware (at least I wasn’t in the beginning) that more senior people often implicitly talk about deceptive alignment when they talk about misalignment.
Secondly, there seems to be some disagreement about whether a powerful misaligned AI is deceptive by default or whether such a thing as a “corrigibly aligned” AI can even exist. I hope this post clarifies the different positions.
Epistemic status: Might have reinvented the wheel. Most of the content is probably not new for most people within the alignment community. Hope it is helpful anyway.
Update: after a discussion in the comments, I want to make some clarifications:1. My definition of deception is a bit inconsistent throughout the post. I'm not sure what the best definition is but I think it is somewhere between "We have no clue what the model is doing" (which doesn't include active deception) to "The model is actively trying to hide something from us". Both seem like important failure modes.2. I don't think deception is orthogonal to understanding other failure modes like getting what we measure. Deception can be a component of other failure modes. 3. This post should not be interpreted as "everything that isn't direct work on deception is bad" and more like "we should think about how other research relates to deception". For example, AI forecasting still seems super valuable to me. However, I think one of the main sources of value from AI forecasting comes from having better models of future AI capabilities and those might be used to predict when the model becomes deceptive and what happens if it does. 4. I'm not sure about all of this. I still find many aspects of alignment confusing and hard to grasp but I'm mildly confident in the statement "most failure modes look much worse when you add deception" and thus my takeaway is something like "deception is not everything but probably a good thing to work on right now".
By deceptive alignment, I mean an AI system that seems aligned to human observers and passes all relevant checks but is, in fact, not aligned and ultimately aims to achieve another non-aligned goal. In Evan’s post, this means that the NN has actively made an incomplete proxy of the true goal a terminal goal. Note, that the AI is aware of the fact that we wanted it to achieve a different goal and therefore actively acts in ways that humans will perceive as aligned. If the AI accidentally followed a different goal, e.g. due to a misunderstanding or a lack of capabilities, this is not deceptive alignment but described as corrigible alignment in Evan’s post. Corrigible alignment essentially means that the AI currently has an incorrect understanding of the world or our goals because it learned a wrong proxy but the proxy is not a terminal goal, i.e. the AI has no stake in preserving it and we could correct it once it is detected.
In other words, for deception to be at play, I assume that the AI is actively adversarial but pretends not to be. Note that this other goal doesn’t have to be meaningful or special, the deceptively aligned AI could be a paperclip maximizer pretending to be a sophisticated policy recommendation system or a sophisticated policy recommendation system pretending to be a paperclip maximizer.
It seems unclear whether the distinction between corrigible alignment and deceptive alignment makes sense. Potentially, deception is just a function of capabilities and initial goals, e.g. once the model is capable enough to understand that it has goals, that it is being trained to achieve them and that it currently has goals that differ from the intended goals (e.g. because it learned the wrong proxy), it automatically tries to become deceptive for instrumental reasons. In other words, if the AI is capable and doesn’t learn the right goal early in the training process, it is likely to be deceptive by default. Richard Ngo, for example, argues that “Once policies can reason about their training processes and deployment contexts, they’ll learn to deceptively pursue misaligned goals while still getting high training reward.” in this paper.
In Risks from Learned Optimization, Evan Hubinger writes: “Even a robustly aligned mesa-optimizer that meets the criteria [for deceptive alignment] is incentivized to figure out the base objective in order to determine whether or not it will be modified since before doing so it has no way of knowing its own level of alignment with the base optimizer.” In a draft of this post, he commented: “Any proxy-aligned model that meets the criteria for deceptive alignment—most notably that cares about something in the world over time—will want to be deceptive unless it is perfectly confident that it is perfectly aligned.”. Thus, it is possible that powerful models that don’t fulfill these criteria are corrigibly aligned, i.e. there are some forms of (mis-)alignment that are not deceptive by default.
There seems to be some agreement on the ends of the spectrum, e.g. both sides of the debate would probably agree that very powerful and situationally aware models are deceptive for instrumental reasons. They would likely also agree that for very bad models the category of alignment doesn’t make that much sense to begin with, e.g. an MNIST classifier that doesn’t have high accuracy is not “misaligned”, it’s just a bad model.
I think there are multiple possible explanations for this kind of disagreement. They include
I am personally uncertain about whether all sufficiently powerful and situationally aware models are automatically deceptive. My rough beliefs are:
In this section, I want to argue why deceptive alignment is an especially important component of AGI safety and that the AGI safety community should, therefore, prioritize it more than we currently do.
In case you think that all forms of misalignment are necessarily deceptive, you probably don’t have to read the rest of this post.
My current impression is that, in the long run, the biggest harms (x-risk, s-risk or other) come from huge power asymmetries between agents with different goals, e.g. humanity vs. a misaligned AGI. Effectively, a very powerful agent can just say “I want this thing and you can’t stop me” and then do it. However, deceptive alignment seems to be relevant for these power asymmetries to arise in two key ways.
Firstly, the way in which large power asymmetries are achieved is often through deceptive alignment. In many scenarios in which an insufficiently (for catastrophic risks) powerful agent gains more power, it does so by pretending to be good for sufficiently long to gain more power in the meantime, e.g. by collecting resources or setting up structures for future take-over. Once it has enough power, it starts to pursue its own goals which leads to disaster. The deception was necessary to gain from an insufficiently powerful (more or less harmless) state to a sufficiently powerful (pretty harmful) state. Had the AI revealed its true intentions in the beginning, it would likely not have been able to amass that many resources.
Secondly, with deceptive alignment, the upper bound of harm from one misaligned AI system seems to be much higher than from corrigibly aligned systems. A deceptively aligned AI can effectively just wait, observe its observers, collect resources whenever possible, and so forth, once it has convinced the observers that it is aligned. The longer it waits to strike the higher the potential damage. When it waits long enough, the damage might be really big, e.g. extinction or anti-utopia.
In general, everything looks worse with deceptive alignment. Most, if not all, scenarios of AI going badly are worse when you add deceptive alignment, e.g. if the goal of the other entity is to fool you about their goals in addition to having these other goals. Let’s look at some examples
Which one do you think creates more damage to your company in the long run?
I think corrigible alignment, e.g. an AI misunderstanding a specification or a goal being underspecified, will be a problem and will lead to damage. However, I think that conventional AI capabilities researchers will eventually have to address these problems while there is no reason for them to address deceptively misaligned AI before it is too late.
If your system doesn’t work as intended, e.g. because it doesn’t do what the customer wants or even creates some small-scale accidents, this directly hurts your profit margin. Thus, the AI capabilities company has an incentive to work on these problems directly and the error feedback is fast enough to be noticed. I think RLHF provides some evidence for this hypothesis, i.e. GPT-3 didn’t quite get what the customers wanted and thus OpenAI used RLHF to correct some of the wrong proxies learned by GPT-3.
Deceptive alignment on the other hand is likely harder to notice and feedback cycles might be much longer. For example, an AI could work fine for a couple of years before it suddenly starts to do very weird things. By this time, the AI company might not exist anymore or doesn’t feel responsible for the damage.
Since most research on deceptive alignment might sound a bit sci-fi to a non-safety-conscious person, it is hard or impossible to get funding or support to work on it in academia. Furthermore, the incentives of most academics are to write incremental and concrete papers. Thus, I don’t think most academics will work on deceptive alignment until it is incentivized which might be too late.
Additionally, detecting deceptive alignment is hard. It is likely a hard problem in general and we are currently very far away from having any tools to reliably detect it. Often people shy away from working on hard problems because they don’t feel rewarding and tend to imply failure. Thus, without a very strong motivation to work on deceptive alignment, the vast majority of people will work on simpler problems.
For all these reasons, I think that safety-conscious researchers have a special reason to work on deceptive alignment.
I think most, if not all, short timeline x-risk scenarios (at least the ones I find plausible) contain deceptive alignment as a key component. I expect the current trend in capabilities to continue in small steps, e.g. there are no insane levels of capability differences between a model of size X vs. a model of size 10X (where X is e.g. the number of parameters or FLOP for the training run). Therefore, I don’t expect models to suddenly get so powerful that they could immediately overpower humanity. Thus, I don’t expect that early AGI systems with roughly human-like capabilities will be able to enforce their misaligned goal if they make it transparent or don’t actively hide it
In my opinion, the most plausible way for such an early system to lead to a catastrophe is by realizing that it is insufficiently powerful, pretending to be aligned and amassing resources in the background. A corrigibly misaligned system of equal power will likely be caught and corrected or stopped early on.
Therefore, if we want to prevent the catastrophic risks that are most likely to happen early, we should prioritize deceptive alignment.
I expect deceptive alignment to be less tractable than other things in alignment, i.e. I expect that it is harder to make progress and to know whether you have made progress compared to most other AI safety research.
However, not all hope is lost. I think that the most obvious answer to the problem of deceptive systems is: “we just need to understand the systems better”. By this I mean we need to understand their goals, their world model, their behavior, their training procedure, and much more. I feel like we should be able to make tractable progress on these kinds of questions as can already be seen with mechanistic interpretability and other kinds of interpretability efforts.
I think one possible answer to the above is that most of the harm comes from the fact that the model has a different goal rather than whether it is deceptively misaligned or accidentally misaligned. While I think that corrigible misalignment is a problem, I think the biggest risks come from deception.
The main reason for this is that in deceptive alignment the other entity is actively trying to fool you, i.e. to pick the action that maximizes your trust while it increases the goal of the other agent. To understand this intuition, consider the following examples.
I’m aware that this is not a proof but just a bunch of intuition pumps but I think the intuition already clarifies why I think deception is harder. There are more technical arguments about simplicity and speed in Evan’s post.
Many people in alignment have stated that they think deceptive alignment leads to the biggest risks or are actively working to solve them already. However, I feel like while this belief might be obvious to many more experienced members of the community, it is not common sense among newcomers (might just be my impression). Therefore, I thought it might be helpful to add this very explicit post to the list of works/people that are less explicit about this assumption.
People who seem to think that deceptive alignment is very important include
To be fair, I’m not very certain what the correct response to this insight is. The two implications I drew for myself are
I’m currently exploring a research agenda that addresses some aspects of deceptive alignment. I’ll publish some version of it when I have made enough progress.
AGI safety researchers should focus (only/mostly) on deceptive alignment
I consider this advice actively harmful, and strongly advise to do the opposite.
Let's start at the top:
By deceptive alignment, I mean an AI system that seems aligned to human observers and passes all relevant checks but is, in fact, not aligned...
So far so good. There's obvious reasons why it would make sense to focus exclusively on cases where an AI seems aligned to human observers and passes all relevant checks while still not actually being aligned. After all, in the cases where we can see the problem, we either fix it or at least iterate until we can't see a problem any more (at which point we have "deception" by this definition).
But then the post immediately jumps to a far narrower definition of "deceptive alignment":
In Evan’s post, this means that the NN has actively made an incomplete proxy of the true goal a terminal goal. Note, that the AI is aware of the fact that we wanted it to achieve a different goal and therefore actively acts in ways that humans will perceive as aligned.... In other words, for deception to be at play, I assume that the AI is actively adversarial but pretends not to be.
In Evan’s post, this means that the NN has actively made an incomplete proxy of the true goal a terminal goal. Note, that the AI is aware of the fact that we wanted it to achieve a different goal and therefore actively acts in ways that humans will perceive as aligned.
... In other words, for deception to be at play, I assume that the AI is actively adversarial but pretends not to be.
If we look at e.g. the cases in Worlds Where Iterative Design Fails, most of them fit the more-general definition, yet none necessarily involve an AI which is "actively adversarial but pretends not to be". And that's exactly the sort of mistake people make when they focus exclusively on a single failure mode: they end up picturing a much narrower set of possibilities than the argument for focusing on that failure mode actually assumes.
Now, an oversight like that undermines the case for "focus only/mostly on deceptive alignment", but doesn't make it very actively harmful. The reason it's actively harmful is unknown unknowns.
Claim: the single most confident prediction we can make about AGI is that there will be surprises. There will be unknown unknowns. There will be problems we do not currently see coming.
The thing which determines humanity's survival will not be whether we solve alignment in whatever very specific world, or handful of worlds, we imagine to be most probable. What determines humanity's survival will be whether our solutions generalize widely enough to handle the things the world actually throws at us, some of which will definitely be surprises.
How do we build solutions which generalize to handle surprises? Two main ways. First, understanding things deeply and thoroughly enough to enumerate every single assumption we've made within some subcomponent (i.e. mathematical proofs). That's great when we can do it, but it will not cover everything or even most of the attack surface in practice. So, the second way to build solutions which generalize to handle surprises: plan for a wide variety of scenarios. Planning for a single scenario - like e.g. an inner agent emerging during training which is actively adversarial but pretends not to be - is a recipe for generalization failure once the worlds starts throwing surprises at us.
At this point I'm going to speculate a bit about your own thought process which led to this post; obviously such speculation can easily miss completely and you should feel free to tell me I'm way off.
First, I notice that nowhere in this post do you actually compare deceptive alignment to anything else I'd consider an important-for-research alignment failure mode (like fast takeoff, capability gain in deployment, getting what we measure, etc). You just argue that (a rather narrow version of) deception is important, not that it's more important than any of the other failure modes I actually think about. I also notice in the "implications" section:
I should ask “how does this help with deceptive alignment?” before starting a new project. In retrospect, I haven’t done that a lot and I think most of my projects, therefore, have not contributed a lot to this question.
What this sounds like to me is that you did not previously have any realistic model of how/why AI would be dangerous. I'm guessing that you were previously only thinking about problems which could be fixed by iterative design - i.e. seeing what goes wrong and then updating the design accordingly. Probably (a narrow version of) deception is the first scenario where you've realized that doesn't work, and you haven't yet thought of other ways for an iterative design cycle to fail to produce aligned AI.
So my advice would be to brainstorm other things "deception" (in the most general sense) could look like, or other ways the iterative design cycle could fail to produce aligned AI, and try to aim your brainstorming at scenarios which are as different as possible from the things you've already thought of.
To be clear, I agree that unknown unknowns are in some sense the biggest problem in AI safety—as I talk about in the very first paragraph here.
However, I nevertheless think that focusing on deceptive alignment specifically makes a lot of sense. If we define deceptive alignment relatively broadly as any situation where “the reason the model looks aligned is because it is actively trying to game the training signal for the purpose of achieving some ulterior goal” (where training signal doesn't necessarily mean the literal loss, just anything we're trying to get it to do), then I think most (though not all) AI existential risk scenarios that aren't solved by iterative design/standard safety engineering/etc. include that as a component. Certainly, I expect that all of my guesses for exactly how deceptive alignment might be developed, what it might look like internally, etc. are likely to be wrong—and this is one of the places where I think unknowns really become a problem—but I still expect that if we're capable of looking back and judging “was deceptive alignment part of the problem here” in situations where things go badly we'll end up concluding yes (I'd probably put ~60% on that).
Furthermore, I think there's a lot of value in taking the most concerning concrete problems that we can yet come up with and tackling them directly. Having as concrete as possible a failure mode to work with is, in my opinion, a really important part of being able to do good research—and for obvious reasons I think it's most valuable to start with the most concerning concrete failure modes we're aware of. It's extremely hard to do good work on unknown unknowns directly—and additionally I think our modal guess for what such unknown unknowns might look like is some variation of the sorts of problems that already seem the most damning. Even for transparency and interpretability, perhaps the most obvious “work on the unknown unknowns directly” sort of research, I think it's pretty important to have some idea of what we might want to use those sorts of tools for when developing them, and working on concrete failure modes is extremely important to that.
If we define deceptive alignment relatively broadly as any situation where “the reason the model looks aligned is because it is actively trying to game the training signal for the purpose of achieving some ulterior goal”...
That's "relatively broad"??? What notion of "deceptive alignment" is narrower than that? Roughly that definition is usually my stock example of a notion of deception which is way too narrow to focus on and misses a bunch of the interesting/probable/less-correlated failure modes (like e.g. the sort of stuff in Worlds Where Iterative Design Fails).
Having as concrete as possible a failure mode to work with is, in my opinion, a really important part of being able to do good research ... Even for transparency and interpretability, perhaps the most obvious “work on the unknown unknowns directly” sort of research, I think it's pretty important to have some idea of what we might want to use those sorts of tools for when developing, and working on concrete failure modes is extremely important to that.
This I agree with, but I think it doesn't go far enough. In my software engineering days, one of the main heuristics I recommended was: when building a library, you should have a minimum of three use cases in mind. And make them as different as possible, because the library will inevitably end up being shit for any use case way out of the distribution your three use cases covered.
Same applies to research: minimum of three use cases, and make them as different as possible.
What notion of "deceptive alignment" is narrower than that?
What notion of "deceptive alignment" is narrower than that?
Any definition that makes mention of the specific structure/internals of the model.
I guess I would lean towards saying that once powerful AI systems exist, we'll need powerful aligned systems relatively fast in order to develop against them, otherwise we'll be screwed. In other words, AI arms race dynamics push us towards a world where systems are deployed with an insufficient amount of testing and this provides one path for us to fall victim to an AI system that you might have expected iterative design to catch.