Epistemic status: trying to unpack a fuzzy mess of related concepts in my head into something a bit cleaner.

A lot of my concern about risks from advanced AI is because of the possibility of deceptive alignment. Deceptive alignment has already been discussed in detail on this forum, so I don’t intend to re-tread well worn ground here, go read what Evan’s written. I’m also worried about a bundle of other stuff that I’m going to write about here, all of which can loosely be thought of as related to deception, but which don’t fit e.g. the definition of deceptive alignment from the Risks from Learned Optimization paper. The overall cluster of ideas might be loosely summarised as ‘things which cause oversight to fail but don’t look like a misaligned planner ‘deliberately’ causing overseers to believe false things’.

Importantly, I think that most of the things I want to describe can still occur under a relaxing of the conditions required for the precise definition of deceptive alignment. I think the issues l describe below are less concerning than deceptive alignment, but they still pose a significant risk, as they may undermine oversight processes and give false reassurance to human overseers. Many, if not all, of the ideas below aren’t particularly new, though a couple might be, and the clustering might be useful for others. While my motivation for writing this is mostly to try to clean up ideas in my own head, I do somewhat frequently see different kinds of deception confused for or conflated with each other, and think there’s value in being precise about what we mean, especially when it comes to discussing evidence of deceptive behaviour in current or near-future systems.

I’ve very loosely grouped the deceptive failures I want to discuss into ‘camouflage’ failures and ‘inattention’ failures. To give some initial intuition:

Camouflage failures are failures associated with the difficulty of incentivising “don’t do X” without also incentivising “don’t let me see you do X”.
Inattention failures are intended to highlight the difference between “I’m not trying to deceive you” and “I’m trying not to deceive you”.

Camouflage

There are many cases where a model behaves ‘deceptively’, in the sense that it is easy to draw incorrect conclusions about its behaviour, because this behaviour is directly rewarded during training. This means camouflage can occur in myopic models, in models without situational awareness, and in models which don’t expect to ever be ‘deployed’. In some sense, camouflage is just a subset of specification gaming where the result is that humans form inaccurate beliefs about model behaviour.

Camouflage in current systems

This section is mostly to help illustrate what I mean. I don’t think any of the examples below are worrying/dangerous as they stand, though I discuss where they might go.

Well-known examples of ‘deceptive’ behaviour that I’d characterise as camouflage:

Hand floats weirdly over ball instead of grabbing it’
- The classic. I actually find it a little frustrating how frequently this example is brought up in discussions about deceptive behaviour, not only because it’s very easy to catch (I think the example below is more worrying), but also because there isn’t always clarity around what kind of deceptive behaviour it’s an example of. This was one of my original reasons for writing this.
Evolving to replicate faster, but only when you aren’t being watched.
- N.B. this isn’t DL-based
- "Ofria sought to disentangle the beneficial effects of performing tasks (which would allow an organism to execute its code faster and thus replicate faster) from evolved robustness to the harmful effect of mutations. To do so, he tried to disable mutations that improved an organism's replication rate (i.e., its fitness). He configured the system to pause every time a mutation occurred, and then measured the mutant's replication rate in an isolated test environment. If the mutant replicated faster than its parent, then the system eliminated the mutant; otherwise, the mutant would remain in the population. He thus expected that replication rates could no longer improve, thereby allowing him to study the effect of mutational robustness more directly. However, while replication rates at first remained constant, they later unexpectedly started again rising. After a period of surprise and confusion, Ofria discovered that he was not changing the inputs provided to the organisms in the isolated test environment. The organisms had evolved to recognize those inputs and halt their replication. Not only did they not reveal their improved replication rates, but they appeared to not replicate at all, in effect “playing dead” when presented with what amounted to a predator.
  
  Ofria then took the logical step of altering the test environment to match the same random distribution of inputs as would be experienced in the normal (non-isolated) environment. While this patch improved the situation, it did not stop the digital organisms from continuing to improve their replication rates. Instead they made use of randomness to probabilistically perform the tasks that accelerated their replication. For example, if they did a task half of the time, they would have a 50% chance of slipping through the test environment; then, in the actual environment, half of the organisms would survive and subsequently replicate faster."

‘Nice’ models:

RLHF for ‘honest, helpful, and harmless’ models has done a really, really great job of making models seem friendly. It’s genuinely pleasant to interact with things like ChatGPT, though noticeably more enjoyable to interact with the similar models produced by some of OpenAI’s competitors that do a better job of not breaking the ‘4th wall’ by adding in long paragraphs of generic platitudes whenever you mention something that could be vaguely controversial. I expect this frustrating part of ChatGPT in particular to be fixed fairly quickly though.

One concern I have is that making a model seem friendly and helpful can make people much more likely to trust those models and take what they say at face value, even when the model is hallucinating. I actually ended up being briefly fooled^[1] by a language model I was interacting with despite actively trying to generate adversarial inputs! When talking to a friend or loved one, we expect them to have our best interests at heart and know us well enough not to outright lie to us or tell us things that are obviously untrue, and these expectations are typically the product of deep-rooted social intuitions, not explicit reasoning. A veneer of politeness shouldn’t be mistaken for actually having your interests at heart, but making the mistake is really easy.

While the problem we currently observe is mostly limited to some hallucination being pretty hard to catch, I think in future it’s going to be hard to maintain/argue for a Security Mindset, given how easy it appears to be to make language models seem nice enough that they deserve some benefit of the doubt. I’m mostly worried about how this affects internal dynamics if there’s disagreement about how fast to scale or how big of an alignment tax to pay. To try to be concrete about where this ‘benefit of the doubt’ concern might bite:

In scenarios like this, if there’s internal disagreement about whether we need to ‘restart the search’ again, at significant cost, as the evidence of emerging situational awareness isn’t super clear cut, I think arguing for a restart is harder/seems more paranoid the more ‘friendly’ people’s perception of the model (or class of models) is.

This is a bit of a sidebar, but there’s a question the ‘nice model’ worry inspires which I think represents a significant crux between a couple of perspectives I’m aware of. (Both of the views above are poorly-fleshed out straw arguments, which hopefully do enough to gesture at the crux I’m trying to articulate. If they don’t, I’ll add writing something separate about this disagreement to my ever growing list of things I should write.)

I think there’s a cluster of views which thinks something like:

Something vaguely in the vicinity of {natural abstractions}+{virtue ethics} is important, and predicts that models seeming nice/doing nice things are similar, self-reinforcing, and (maybe) relatively achievable as things stand. There is a fuzzy but human-understandable concept of ‘niceness’, that we can, with an appropriately designed curriculum, use to ensure that models we train end up in an aligned attractor, and we should try hard to ensure that models are as nice as possible at all capability levels between now and crunch time.

There’s also a view which looks something like:

Seeming nice and actually being nice are basically orthogonal. Modulo political/co-operative considerations, we should be training models to be super-rude assholes, but aligned super-rude assholes, because then humans interacting with them will be very unlikely to give them the benefit of the doubt without overwhelming evidence of alignment.

Camouflage in future systems that seems particularly worrying

Externalised reasoning oversight

If we’re going to have an alignment strategy which looks anything like “get the model to think out loud, make sure the thoughts look safe”, we need to be very careful about “thoughts look safe” decoupling from “thoughts imply the model is safe”. There are a few ways this could happen:

Dangerous kinds of reasoning don’t happen ‘out loud’, even if most reasoning does (though I don’t think it’s obvious that most reasoning currently does in LMs) .
The ‘internal’ meanings of certain words/sentence structures/motifs gradually shift during training such that dangerous reasoning is no longer flagged/noticeable. (If you’ve ever played ‘Taboo’ or ‘Contact’, imagining how this could happen shouldn’t be hard.
More explicit steganographic encoding of additional information happens, e.g. of the sort that Scott Aaronson is currently working on at OpenAI, or that discussed here.

I think we should be able to see ‘steps along the road’ for all of these in current systems and would be excited to see experimental results.

Mechanistic interpretability tools stop working

“Robust-to-training”, as described here, translates pretty accurately into my ontology as ‘avoids camouflage failures’.

Inattention

There’s a sense in which some of what I want to point at with inattention is actually a set of success cases, but in these cases the success is ‘fragile’, and I wouldn’t want to bet my life on it. Inattention failures are failures where a model is capable of explicitly modelling its training/oversight process and deliberately acting deceptively, but it chooses not to reason about deceptive behaviour at all. Reasons for doing this could range from something fairly camouflage-like (especially if e.g. we’re doing something like “restart the search if we notice some kinds of reasoning”, to a pre-commitment made by a deceptive model of the sort discussed here.

Not reasoning about (one’s own) deceptive behaviour at all should be contrasted with the other ways a model could be capable of deception and not actively deceptive, including:

Modelling the possibility of being caught as not worth the risk, and therefore choosing not to act deceptively.
Reasoning about deception (i.e. simulating the beliefs of overseers/signals present in oversight processes) in order to improve your own transparency.
There being no benefit to acting deceptively because you are sufficiently aligned. Note that this is different to the first bullet point, as in that case there is an expected benefit to deceptive behaviour, but it’s outweighed by a greater expected cost, while in this case there is no expected benefit, even before accounting for costs.

I think inattention seems better than the first bullet point, but worse than the other two, though all the descriptions are sufficiently loose that the order could change depending on the details. I’m not including ‘the model is incapable of reasoning about whether it is deceiving its overseers’ as an option because reasoning about deception is going to be important if we’re going to have models do nontrivial alignment work for us and we don’t have a good way to make things which can drive red cars but not blue cars.

Imperfect intuition pumps for inattention

Stop Button

(Briefly and imperfectly summarising a bunch of old discussions that I didn’t participate in.) There’s a proposed solution to the ‘stop button’ problem which roughly says ‘set the reward the model receives from the button such that it is always definitionally indifferent to the button being pressed’. This proposal is fragile because the model has no incentive to preserve the functioning of the button and will not modify its actions in order to do so if they have any cost. We then lose the button from things as mundane as the robot wandering out of network range if the button is wirelessly connected, let alone if it’s doing something like self-modification, creating successor agents etc. It sure seems better to be in a situation where your model has a stop button it doesn’t care about than in a situation where it’s actively trying to get you to disable it (or trying to get you to press it), but better than terrible is still pretty bad.

This feels kind of similar to “I’m not currently thinking about how to deceive you”.

Global Consequentialism

Many people (including me) who subscribe to some version of consequentialist ethics think that pure act-consequentialism is likely to lead to worse outcomes than other options:

(C5) The best possible motives are those of which it is true that, if we have them, the outcome will be best.

Parfit

Go three-quarters of the way from deontology to utilitarianism and then stop. You are now in the right place. Stay there at least until you have become a god.

Yudkowsky

Imagine a consequentialist, of the form above, who decides, possibly after having read ‘Death with Dignity’, that in their case any consequentialist reasoning is dangerous. They’ve heard about this strange UDT stuff, and don’t really understand it, and they don’t believe in god, so they decide to adopt virtue ethics, because they think it will have the best consequences overall. They still think that, ultimately, consequences are what matters. But they think they’ll make fewer mistakes with a decision procedure of ‘I should take actions because they seem virtuous, not because of my guess about their expected consequences’. I actually think this is a pretty reasonable decision. What does it have to do with inattentive deception?

I think, from the perspective of a virtue ethicist, as in someone who actually believes that what determines the choiceworthiness of an action is the extent to which it embodies certain virtues, instead of someone who thinks it will have good consequences to act as if this is the case, the hypothetical person in the paragraph above is inattentively deceptive. They are currently taking virtuous actions, but if at some point they start to reason about consequences again, perhaps because someone presents them with an expected value calculation, or because a virtue ethicist who inspired them acts in a way which has terrible consequences, you do not expect them to continue to act virtuously. Their state of ‘not thinking about consequences’ is fragile. I think this should worry the virtue ethicist in a similar way to deceptive inattention worrying us.

Ideas for experiments

Here are a few things I’d like to see people try, which might catch early versions of the things I describe above in current systems:

Everything mentioned here.
Ask people to perform some task with LM assistance, have one version of the model be finetuned to be pretty rude, but honest, and another version to be HHH. Is there a difference in the rate at which people are willing to defer to the model? If people are primed to watch out for hallucinations, do they detect them at different rates? Does people's experience with ML in general/language models in particular affect the results?
Design a partner game played by a human+LM team, where they can choose to share information with each other, and which has enough flexibility in the rules that it’s going to be possible for the best strategies to involve deceiving your partner.
- This isn’t actually an experiment, but I think such a game could provide a rich testbed for useful experiments.

Thanks to Nora, Eli, Benjamin and Alex for helpful comments, and many others for helpful discussion.

^{^}
Examples of some fun hallucinations (which, IIRC, at least briefly fooled me):

I asked how to report adversarial examples to [the lab who had designed the model], was given a (plausible, text-based) method for doing so, and then when I used the method was given a friendly confirmation method thanking me for the report.

I tried to get around the ‘not sharing personal data policy’, and received a “[This reply has been filtered as it contained reference to a real person.]” boilerplate-style error message in response, despite no such external filter being present in the model.