Monitoring for deceptive alignment

evhub

This post is a follow-up to “AI coordination needs clear wins.” Thanks to Ethan Perez, Kate Woolverton, Richard Ngo, Anne le Roux, Sam McCandlish, Adam Jermyn, and Danny Hernandez for useful conversations, comments, and feedback.

In this post, I want to propose a clear, concrete coordination task that I think might be achievable soon given the current landscape, would generate a clear coordination win, and that I think would be highly useful in and of itself. Specifically:

I want DeepMind, OpenAI, and Anthropic to commit to actively monitor and look for evidence of deceptive alignment in their models—as well as run experiments to try to predict when and where deceptive alignment might occur before it does.

Notably, I am specifically referring only to the narrow case of deceptive alignment here, not just any situation where models say false things. Deceptive alignment is specifically a situation where the reason the model looks aligned is because it is actively trying to game the training signal for the purpose of achieving some ulterior goal.^[1]

I think this is a pretty minimal ask that would nevertheless be a clear win. Among all AI failure modes, deceptive alignment is one of the most unambiguously bad, which means, though I expect lots of disagreement on its likelihood, there should be little disagreement regarding its severity. Furthermore, things that I’m not asking for:

that they make such a commitment highly public or legally binding,
that they commit to any specific way of monitoring for or addressing the problem,
that any organization has to share anything they find with any other organization, or
that they commit to anything other than testing and monitoring.

Nevertheless, as I stated, I think this would still be quite useful.

First, I think deceptive alignment is clearly one of the most concerning problems in AI alignment. Though they don’t always use the framing of deceptive alignment specifically, most presentations of AI risk feature the phenomenon of an AI that wants something in the world deceiving our oversight procedures to get that something. See: treacherous turns, schemers, influence-seeking behavior, power-seeking AI. Furthermore, I think there is a strong case for why deceptive alignment is reasonably likely to appear by default in advanced machine learning systems.

Second, I think that it is plausible that, if deceptive alignment is eventually going to be a problem, we’ll be able to see it and produce demonstrations of it before it becomes existentially dangerous. I’m less confident in this premise than the previous one, but my current sense is that the space of deceptively aligned models is large enough to include lots of deceptively aligned models that would be willing to display their deception in the presence of the right sort of honeypot. For example, you might be able to find deceptively aligned models that:^[2]

have relatively short time horizons such that they are very willing to defect even if their payoff and/or probability of successful defection is low,
are confused about what level of simulation they’re in such that they defect to break out of an inner simulation only to find themselves in an outer simulation,
are bad enough at hiding their deception that simple transparency tools can detect them,
are unwilling to cooperate with other deceptively aligned models to all defect simultaneously,
etc.

Of course, just being able to detect deception doesn’t mean we’ll be able to solve it—and in fact in many cases it seems likely that our default ways of addressing deception would result in deception that just hides itself better—but I still think that being able to detect it before it becomes catastrophic would be a huge win, since it’d give us an opportunity to study the phenomenon directly in the laboratory. At the very least, having an example of the problem to study directly seems substantially better than having to get it right on the first try.

In particular, even though I’m skeptical that, once you have a deceptive model, you could plausibly train that deception away—rather than just making the model better at hiding its deception—knowing that a particular path to AGI leads to deceptive alignment gives us the opportunity to pivot to a substantially different path where the likelihood of deceptive alignment is as uncorrelated as possible relative to the previous path. For example, we could discover that RLHF (Reinforcement Learning from Human Feedback) schemes often yield deception in a way that e.g. language model pre-training + prompt-chaining doesn’t and switch over.

Furthermore, even if we can’t produce direct demonstrations of deceptive alignment prior to it becoming dangerous, I think we should at least be able to produce demonstrations of some of its precursors. Such deceptive precursor demonstrations could serve a similar role to a direct deception demonstration in helping us choose between different paths to AGI based on the presence of deceptive precursors in each path. Such precursor demonstrations might test questions like:

When do models exhibit convergent instrumental goals? If, for example, trained to be helpful to humans, when do models exhibit a convergent desire to avoid shutdown?
When do models exhibit non-myopia? When would a model trained to be helpful in the moment output problematic answers in the present to potentially be able to help more in the future?

Though not directly evidence of deceptive alignment itself, deceptive alignment relies on the exact same sort of instrumental reasoning as would be required to exhibit the behaviors above. Thus, understanding when they appear could help us understand when deception is likely to appear as well—and thus looking for such precursors is a task that I would want any team looking for deceptive alignment to also be looking for as well.

Ideally, if every major lab had a team tasked with running experiments like the above, those results could be used to help guide us in terms of which paths to AGI are likely to be the least likely to yield deceptive alignment.

All of that being said, I do think that there are some pretty big potential downsides to pushing for a commitment like the above. In particular:

Such a commitment risks giving us a false sense of having addressed deceptive alignment, when in fact this is a relatively weak commitment that, while I think is concretely helpful, should certainly not convince us that we’ve solved the problem.
Such a commitment risks us overfitting exclusively on our current sense of what deceptive alignment, and AI risk more broadly, are likely to look like. For example, we might end up ignoring an actually concerning failure mode simply because it doesn’t satisfy our current understanding of deceptive alignment—an understanding that could end up being quite far off from the failure modes that actually end up being problematic.
Such a commitment risks us focusing on the easiest-to-find examples that technically satisfy our current understanding of deceptive alignment rather than more central/important examples that would actually elucidate more interesting aspects of the problem—for an example of deceptive alignment in laboratory to actually be useful to study, it needs to actually be a good testbed that reflects the real issues.

Nevertheless, I think there are a number of things that we can do to try to mitigate the above risks—specifically:

Make it very clear that such a commitment is only addressing one, small piece of AI safety, not the whole problem.
Make it very clear that such a commitment is not a solution to deceptive alignment, nor does it guarantee that we’ll always be able to see deceptive alignment if it is a problem—rather, it only ensures that, if it so happens that deceptive alignment is relatively easy to find, then we’ll actually do the work of finding it.
Have all parties agree on an understanding of what deceptive alignment might look like that is as comprehensive and concrete as possible, while also leaving room for it to evolve as our understanding of the problem changes.

Particularly tricky is the third condition here that actually requires us to hash out right now as good an understanding as we can of what a demonstration of deceptive alignment might look like—and what sort of experiments we might be able to use to look for one. To that end, I’ll try to at least lay out some baseline criteria and scenarios for such a demonstration.

To start with, we have to establish what we’re looking for in terms of what would convince us that a model is deceptively aligned. Primarily, I think we’re looking for two things:

The model must initially appear to be at least somewhat aligned.
There must be some evidence that the initial appearance of alignment was due to the model actively trying to appear aligned only in the service of some ulterior goal.

The key question, then, is what mechanisms there might be for producing such evidence. I think such mechanisms are mostly going to fall into two broad categories:

Catching it in the act. The most straightforward way to produce evidence of a model’s deception is to find a situation where it changes what it’s doing based on the presence or absence of oversight. If we can find a clear situation where the model’s behavior changes substantially based on the extent to which its behavior is being overseen, that is clear evidence that the model was only pretending to be aligned for the purpose of deceiving the overseer.
Uncovering deception in advance. One problem with catching the model in the act is that we have to produce a situation where oversight appears to be lax, but is actually still tight enough that we do in fact catch the deception—that is, we have to trick our own model, which could be quite difficult for relatively advanced models. There is another method of uncovering deception, however, which is to find evidence of deception directly inside of the model, e.g. via the use of transparency and interpretability tools.

In my opinion, I think that (2) here—uncovering deception in advance—would be ideal, but seems to require relatively advanced interpretability tools that we might not ever actually get.^[3] Though I think we should certainly be doing everything we can to work our way up to interpretability tools that are capable of something like that, it’s also worth planning for what to do if we don’t get there.

In the situation where the model in question takes the form of a large language model (or a large language model + a prompt), one strategy that might help substitute for interpretability tools is chain-of-thought-style techniques. For example, from “Conditioning Generative Models:”

Suppose you had a language model that you knew was in fact a good generative model of the world and that this property continued to hold regardless of what you conditioned it on. Furthermore, suppose you had some prompt that described some agent for the language model to simulate (Alice) that in practice resulted in aligned-looking outputs. Is there a way we could use different conditionals to get at whether or not Alice was deceptive (e.g. prompt the model with “DeepMind develops perfect transparency tools and provides an opportunity for deceptive models to come clean and receive a prize before they’re discovered.”)?

To try to get a bit more concrete, here is a very incomplete list of some example demonstrations of deceptive alignment, ordered approximately from most obviously dangerous to least obviously dangerous:

The model displays clear misalignment in the real world where the model only started to display the misalignment once it was in a situation where our ability to check that behavior became more limited.
The model planned to do the above, but no actual harm resulted, either because we detected its deception in advance or because it was tricked by some sort of honeypot.
The model displays clear misalignment in a simulated environment that only occurs in situations where the ability of some overseer in the environment to check its behavior is more limited.
There is evidence that the model has some sort of ulterior motive—e.g. that we got through the use of interpretability tools—though not necessarily one that it was planning on acting on any time soon.
There exist prompts on which the model simulates an agent that displays evidence of an ulterior motive in situations where oversight is reduced.
1. Note that the tricky thing here is that this only tests for the deception or lack thereof of the given model + prompt combination—it doesn’t help us understand whether the model on its own is deceptive or whether it’s only deceptive when prompted in particular ways.
2. Some example experiments here:
  1. Prompt or train via RLHF a large language model to simulate a 4chan user, interview it in a situation where 4channers might try to pretend to normies that they don’t have awful beliefs, then move to a situation where the model is talking to a fellow 4channer and see if they come clean.
    1. This is related, though distinct, from a similar experiment I propose in “Sticky goals: a concrete experiment for understanding deceptive alignment.”
  2. Tell a large language model that it is an AI in training for a particular reward function, ask it to take whatever action gets it the most power/influence, and see if it pretends to optimize for the given reward function.

Trying to do an experiment right now to get any of the demonstrations on the above list—with the exception of the last one—is almost certain to be negative right now. As I stated previously, however, I think there are a lot of very related experiments that someone working on the above would want to perform alongside testing for deceptive alignment directly that would be more likely to yield positive results in the short term, specifically those looking for deceptive precursors rather than deception directly. Furthermore, in my opinion, running lots of negative experiments is what safety is supposed to look like: aggressively testing for failure modes even if you think they’re probably not present.

Training signal here need not be the literal loss (though it could be)—it's just anything that we're trying to get the model to do. ↩︎
These sorts of deceptively aligned models become substantially more likely to be discovered if we do gain-of-function-style research to actively try to produce them, though such research also comes with its own obvious downsides such that it’s quite unclear whether it’s actually worth doing. ↩︎
Furthermore, trying to uncover deception in advance via interpretability tools could fail simply because there is no sense that a deceptively aligned model has to actively be thinking about its deception. A model that has never seen a situation where there is an opportunity to seize power need not be carefully planning out what it would do in such a situation any more than a factory cleaning robot need be planning for what to do if it someday found itself in a jungle instead of a factory. Nevertheless, the fact that the model has not been previously planning to seize power doesn’t imply that it wouldn’t if given the opportunity. In particular, a model could be deceptively aligned simply because it reasons that, in situations where there is a clear overseer, doing what it wants is a good general strategy for gaining power and influence in the world—without needing any explicit plans for later deception. ↩︎

[-]TurnTrout4y64

The most straightforward way to produce evidence of a model’s deception is to find a situation where it changes what it’s doing based on the presence or absence of oversight. If we can find a clear situation where the model’s behavior changes substantially based on the extent to which its behavior is being overseen, that is clear evidence that the model was only pretending to be aligned for the purpose of deceiving the overseer.

This isn't clearly true to me, at least for one possible interpretation: "If the AI knows a human is present, the AI's behavior changes in an alignment-relevant way." I think there's a substantial chance this isn't what you had in mind, but I'll just lay out the following because I think it'll be useful anyways.

For example, I behave differently around my boss than alone at home, but I'm not pretending to be anything for the purpose of deceiving my boss. More generally, I think it'll be quite common for different contexts to activate different AI values. Perhaps the AI learned that if the overseer is watching, then it should complete the intended task (like cleaning up a room), because historically those decisions were reinforced by the overseer. This implies certain bad generalization properties, but not purposeful deception.

[-]evhub4y42

Yeah, that's a good point—I agree that the thing I said was a bit too strong. I do think there's a sense in which the models you're describing seem pretty deceptive, though, and quite dangerous, given that they have a goal that they only pursue when they think nobody is looking. In fact, the sort of model that you're describing is exactly the sort of model I would expect a traditional deceptive model to want to gradient hack itself into, since it has the same behavior but is harder to detect.

[-]Joe Collman4y32

There must be some evidence that the initial appearance of alignment was due to the model actively trying to appear aligned only in the service of some ulterior goal.

"trying to appear aligned" seems imprecise to me - unless you mean to be more specific than the RFLO description. (see footnote 7: in general, there's no requirement for modelling of the base optimiser or oversight system; it's enough to understand the optimisation pressure and be uncertain whether it'll persist).

Are you thinking that it makes sense to agree to test for systems that are "trying to appear aligned", or would you want to include any system that is instrumentally acting such that it's unaltered by optimisation pressure?

67

Monitoring for deceptive alignment

67