The order in which key properties emerge is important and often glossed over. 

Thanks to Wil Perkins, Grant Fleming, Thomas Larsen, Declan Nishiyama, and Frank McBride for feedback on this post. Any mistakes are my own. 

Note: I have now changed the second post into this sequence into a standalone post that incorporates the key points from this post. The comments here are valuable, so I'm leaving this post, but I recommend going straight to the next post. 

This is the first post in a sequence about deceptive alignment. The second post describes my personal views about the likelihood of deceptive alignment for TAI. I’m separating the key considerations and editorial content so others can more independently update their own views. I intend this sequence to be a submission to the Open Philanthropy AI Worldviews Contest

Deceptive alignment is a core part of many AI x-risk scenarios. You can find a highly cited, foundational walkthrough of the deceptive alignment argument here. I’m specifically discussing the concept of deceptive alignment described in that post, in which a proxy-aligned model becomes situationally aware and acts cooperatively in training so it can escape oversight and defect to pursue its proxy goals. There are several existing arguments for why this might be the default outcome for highly capable models. There are other ways a model could be manipulative or deceptive that are not covered in this sequence. 

In this post, I discuss four key precursors of deceptive alignment, which I will refer to in this sequence as foundational properties. I then argue that the order in which these foundational properties develop is crucial for estimating the likelihood that deceptive alignment will emerge for prosaic transformative AI (TAI).

In this sequence, I use the term “differential adversarial examples” to refer to adversarial examples in which a non-deceptive model will perform differently depending on whether it is aligned or proxy aligned. The deceptive alignment story assumes that differential adversarial examples exist. The model knows it’s being trained to do something out of line with its goals during training and plays along temporarily so it can defect later. That implies that differential adversarial examples exist in training.

Foundational properties for deceptive alignment

There are four foundational properties that are explicitly or implicitly described as being relevant to the development of deceptive alignment. These foundational properties are:

  1. Goal-directed behavior - the model needs to have some goal or set of goals about the external world. 
  2. Optimizing across episodes/long-term goal horizons - the model needs to be willing to trade some of what it cares about now for more of it in the long term, including in different episodes.
  3. Conceptualization of the base goal - the model needs to understand what the goal is in order to successfully play along in training.
  4. Situational awareness - the model needs to understand that it is a model in training. Furthermore, it needs to understand how its actions now could affect its parameter updates, and therefore future states of the world. 

The order in which these develop, and the amount of training that goes on in between the development of foundational properties, will have a large effect on the likelihood of deceptive alignment. Most current analyses of deceptive alignment gloss over the development of foundational properties, which makes me suspicious that the alignment community is missing important considerations.

I describe foundational properties in more detail below:

Goal-Directed Behavior

Having a reasonably clear goal for states of the external world is a crucial property for the development of deceptively aligned AI. An internal goal like predicting the next token accurately would likely be insufficient. There are a few different levels of goal-directedness that gradient descent could find. This is all on a spectrum, but that doesn't change the implications. In the following list, each is intended to be a subset of the previous: 

  1. All models - the vast majority of these would be incoherent, arbitrary parameter sets that don’t have anything remotely like a goal.
  2. Models with at least some coherent behavior toward external goals in at least some situations but are largely random.
  3. Models that have somewhat coherent behavior/values relating to the external world, at least at the rough level of humans. Their values may be extremely difficult or impossible to describe fully in an explicit way, and don’t incentivize tiling the universe with anything. These values may involve non-consequentialist considerations. 
  4. Models that are maximizing an objective function, such that they would want to tile the universe with something. This kind of model is the most likely to take extreme actions to avoid changing its objective function. 

I expect each of these levels to be a very small subset of the previous bucket, such that it would take a lot of optimization power to progress from one to the next. It’s hard to imagine deceptive alignment emerging before the third level of goal-directed behavior, and it is most likely at the fourth level. The fourth level seems more likely to develop for applications with a narrow goal, such as optimizing profit, than for applications that try to imitate fuzzy things like human values. 

Optimizing over a long goal horizon and across episodes

This is a factor that has frequently been discussed elsewhere. If the model doesn’t care about its goals over long time horizons, or about how much reward it will get in future iterations where its current prompt and activation have been cleared, then it has no incentive to become deceptively aligned. 

Conceptualization of the base goal

In order to be deceptive, the model needs to have some understanding of the difference between its proxies and the training objective. Otherwise, it won’t know how to target the training objective rather than its proxies during training.

Awareness that it is a model in training

In order for a model to be deceptively aligned, it has to understand that it is in a model training process. Furthermore, it has to conceptualize that it won’t be able to accomplish its current long-term goals if it allows the training process to change its gradients away from those goals. 

Ways the order of development for foundational properties might affect likelihood of deceptive alignment

Before reading on, I encourage readers to think about the likelihood of the order of development for these foundational properties for TAI. This might help avoid motivated reasoning to come to particular conclusions about how likely deceptive alignment is. 

Conceptualization of the base goal vs. goal-directed behavior

If the model is able to conceptualize the base goal before it is significantly goal-directed, then deceptive alignment is unlikely. A key idea of deceptive alignment is that the model starts by learning a proxy that becomes its internal goal. The implication is that the model is optimizing hard for that one proxy in a coherent way. But if the model understands the base goal significantly before it has a mesa-objective, then assuming differential adversarial training, there will be a strong incentive to optimize for the base goal, rather than a proxy. If the understanding of the base goal has already been calculated, there will be no reason to create and maintain a separate proxy goal when the model could just point at the base goal. 

I don’t think this conclusion relies on the understanding of the base goal being especially complete and specific. Rather, the model would need a representation of what it is trying to do at a high level (e.g., “Satisfy what the human wants me to do within strict moral boundaries”). The details of what goes into that, such as the nature of the moral boundaries, probably don’t need to be perfect and could likely update over time. It might look more like a corrigibly aligned model than a robustly aligned model. 

If conceptualization of the base goal and goal-directed behavior emerge at essentially the same time, deceptive alignment also seems unlikely. If a model converged on a solid representation of the base goal, and became goal-directed in parallel, then differential adversarial examples would provide an incentive for gradient descent to optimize for the base goal. 

The most dangerous case is when the model develops a proxy objective before it understands the base goal. I think there’s virtually no chance of deceptive alignment if the model hasn’t reached the third level of goal directedness described above, and I'm quite uncertain whether deceptive alignment would emerge in the vicinity of the third level. The fourth level is the most dangerous by far. 

I’ve seen some arguments that rely on the idea that deceptive alignment will not occur in the parameters, but the activations. This doesn't change the conclusion from this section. If the model figured out what you wanted it to do in the activations before it became significantly goal-directed, gradients could point towards parameters that accomplish those objectives after they have been inferred. The model can only act on activations through its parameters, and the activations come from the inputs using parameters. 

Even if this all happened through a set of parameters that use very general reasoning, the model would likely have to have some way of representing what the training process wanted and would act on that based on the parameters. If it understood the base goal via the activations, gradient descent would have an incentive to make the parameters care more about doing that, assuming that the model has not yet become deceptive. 

Conceptualization of the base goal vs. optimizing across long-term goal horizons and situational awareness

If a model has long-term goal horizons, then it must be goal-directed. However, it’s possible for the model to develop foundational properties in the following order: 

  1. Goal-directed behavior 
  2. Conceptualization of the base goal
  3. Optimizing across episodes/long-term goal horizons

It’s also possible for the model to develop situational awareness in between goal-directed behavior and long-term, cross-episode goals. In either case, there will be a period of time when the model wants to optimize for a flawed proxy and understands the base goal but does not yet have the capabilities necessary for deceptive alignment. Because the model is not yet deceptive, differential adversarial examples in this part of training should directly modify the model’s internal objective. If this period lasts for a significant number of differential adversarial examples, the model should be able to update its goals to point at its conception of the base objective.  

The most dangerous case, which underpins the arguments described in the deceptive alignment literature, is if situational awareness and long-term, cross-episode goals develop before or around the same time as an understanding of the base goal. 

The argument presented in this section still holds if the model is gaining these properties from the activations. The moment at which the model can figure out what you’re trying to get it to do from the activations, the moment it can be prompted to have long-term goals, and the moment it can conceptualize that it is a model in training from the activations may not happen at the same time.

Research directions that may be useful if these considerations are important

Adversarial examples from the start

Making sure to provide training inputs that make the model fail throughout the process seems very promising. These inputs should correlate with differential adversarial examples. These examples could speed up conceptualization of the base goal and delay goal-directed behavior. Both would make deceptive alignment less likely. 

Use non-maximizing targets like following human directions while adhering to key non-consequentialist ethical considerations

Non-maximization targets should make a model less likely to become an aggressive maximizer for a simple proxy and are probably closer to what we want anyway. Maximization is inherently dangerous. 

Find ways to get the model to understand the base goal better in earlier phases of the process 

In the same way a very sophisticated model may be able to gain situational awareness from the prompt, a less-sophisticated model may be able to gain a good understanding of the base goal from the prompt. Pushing the model to understand the base goal early in training by using tools like prompt engineering is a promising way to reduce the likelihood of deceptive alignment. 

Delay optimizing long-term goal horizons and across episodes

If a model doesn't have long-term or cross-episode goals, it will have no reason to act deceptively. Delaying this will reduce the likelihood of deceptive alignment. 


The model attributes described in this post and the order in which they appear are important for the likelihood of TAI becoming deceptively aligned. In particular, the model is much more likely to become deceptive if all of the following are true: 

  1. The model becomes significantly goal-directed before it understands what the goal of the training process is.
  2. The model develops situational awareness before or around the same time as it understands the base goal.
  3. The model develops long-term, cross-episode goals before or around the same time as it understands the base goal.

This insight yields important considerations for evaluating the likelihood of deceptive alignment and ideas for reducing the risk. 

New Comment