Thanks to Simon Celinder, Quentin Feuillade--Montixi, Nora Ammann, Clem von Stengel, Guillaume Corlouer, Brady Pelkey and Mikhail Seleznyov for feedback on drafts. This post was written in connection with the AI Safety Camp.
 

Executive Summary:

This document proposes an approach to corrigibility that focuses on training generative models to function as extensions to human agency.  These models would be designed to lack independent values/preferences of their own, because they would not have an individual identity; rather they would identify as part of a unified system composed of both human and AI components.

  • The selfless soldier: This section motivates the difference between two kinds of group centric behavior, altruism (which is based in individual identity) and collective identity.
  • Modeling groups vs individuals: Here we argue that individuals are not always the most task-appropriate abstraction, and that it often makes sense to model humans on the group level.
  • Generative predictive models: This section describes how generative predictive models will model themselves and their environment, and motivates the importance of the “model of self” and its connection to personal identity.
  • Strange identities: There are several ways (in humans) in which the one-to-one correspondence between a neural network and its model of self breaks down, and this section discusses three of those examples in order to suggest that identity is flexible enough that an AI’s identity need not be individual or individuated.
  • Steps toward identity fusion: Here we aim to clarify the goal of this agenda and what it would mean for an AI to have an identity based on a human-AI system such that the AI component extends the human’s agency. While we don’t give a clear plan for how to bring about this fusion, we do offer an antithetical example of what kind of training would clearly fail.
  • Relevance for corrigibility: This section concludes the document by drawing more direct connections to corrigibility, and by offering a series of open questions for how this research might be made more concrete.

The selfless soldier

In the heat of battle a grenade is tossed into the middle of a troop of soldiers. One soldier throws themself on top of the grenade, sacrificing themself for the survival of the troop. There are two main ways to frame what just happened.[1]

  • Altruism (individual identity): The soldier has the personal value/preference of protecting their troop from harm. Reasoning (quickly) from this value, the soldier deduces that they must sacrifice themselves in order to bring about the future where their fellow soldiers are safe.
  • Collective Identity: The individual soldier is not the most important abstraction to explain/predict this situation, rather it is the troop as a whole. The troop cares about its own survival, and this manifests itself in the decision to sacrifice one of its members to protect itself from further harm (even though the cognition, at least at the moment of decision, happens entirely within one brain). While the individual soldier could theoretically use clever arguments to escape this conclusion, they do not (because as a component of the troop, this is not their function).

The problem of alignment is often framed as trying to ensure that the values of an AI system are aligned with humanity, ideally imbuing them with a certain kind of perfect altruism toward humankind.[2] The problem of corrigibility is often framed as ensuring that even when those values are not (yet) perfectly aligned with our own, an adversarial relationship does not develop between the AI and its human designers (such that it would resist shutdown attempts or changes to its source code). 

This approach tries instead to explore how we might build systems which possess a kind of collective identity, where their behavior is better explained by the beliefs, desires, and intentions of a larger system of which the AI is but an extension. As the adversarial relationship between the AI system and its human user is expected to arise instrumentally from any differences between their values and preferences, the collective identity approach aims to prevent this adversarial relationship by targeting its foundations: removing the unitary identity of the AI, that allows for the emergence of those independent values. 

Modeling groups vs individuals

All models are wrong, but some are useful. Humans, when interacting in society, constantly shift between different levels of abstraction, using whichever level is most useful for explaining/predicting observed behavior. In humans:

Atoms

Cells/neurons

Organs/brain regions

Individual humans/full brains

Tribes

Corporations

Nations

Humanity as a whole

To predict human behavior it is often most useful to stay at the level of the individual human (and not think of them, for example, as trillions of cells coordinating collective action). This is not always most useful, however, and we frequently model the world around us in terms of multi-human groups: “New York City is proud of its mayor,” “The Smith family is having us over for dinner,” “The Department of Justice issued a statement this morning.” 

It may be tempting to consider these multi-person entities as merely shorthand for the “true” abstraction of individuals, but this is a mistake. All levels of abstraction are just useful ideas we keep for the purpose of better predicting how the world around us will evolve. 

From a 3rd person perspective, whether we choose to model a set of people as a group or as individuals isn’t really important. The map is not the territory. Where things get interesting is when the map is used to generate the territory.

Generative predictive models

When we train a learning system to minimize prediction error over a series of data we also end up producing a system which is capable of generating similar data. When such a generative predictive model is embedded in an environment where it also observes its own generations, then it must also develop some part of its model to help explain/predict its own behavior, which we are choosing to call a model of self. (Note that this doesn’t have to be a particularly detailed model of self, nor does the existence of a self-referential part of the world model necessarily imply any significant situational awareness.[3])

Which particular “model of self” the generative predictive model settles on to predict its own behavior depends on two main things:

  1. Which models predict the training data well?
  2. What are the inductive biases of the training process or architecture? For example, are simpler models preferred to complex ones? 

By controlling for the shape of the self-model that a generative predictive AI develops to explain and predict its own behavior we could theoretically also steer the shape of that AI’s behavior to be more corrigible, and less likely to end up being adversarial to the human operator.

Strange identities

Many expect a neural network to have at most one model of self, and for every model of self there to exist only one neural network possessing it. There is some evidence from humans that this need not always be the case:

  • Dissociative Identity Disorder: An interesting phenomenon can happen where a person’s behavior is better explained by multiple separate identities. In the generative model frame, such a person likely has a model of self containing two (or more) separate characters.[4]
  • Split-brain Syndrome: Another oddity is in patients who have had the connection between their left and right hemispheres severed. At a physical level, one would expect two separate neural networks containing separate models of the world, and yet upon having this connection severed, neither hemisphere freaks out, in fact many aspects of the person’s behavior remain virtually unchanged. While capable of understanding and describing what has happened to them, and the current state of their brain, they nonetheless continue to identify and behave as a single individual. In this case, the model of self seems to contain two individual neural networks now separated by any direct neural connection. This model of self is also stable, and split-brain patients seem not to end up developing a split identity later on.
  • Thinking as a group: Another situation where identity becomes muddled seems to be the case of certain suicide bombers and soldiers: in both cases, individuals undertake actions that are apparently inconsistent with the most basic self-preservation drives. Common explanations for such behavior include brainwashing, intimidation, kin selection theory, or some sort of psychological disturbance (i.e. mental illness). However, it has been suggested that this behavior could alternatively be a type of ‘identity fusion’ between the self-sacrificing individual and the enclosing group.[5] It isn’t clear if this fusion is an evolutionary artifact or a culturally ingrained feature, nor how deeply it is hardcoded in human neural hardware. 

What this implies is that a “model of self” or “identity” need not have a one-to-one correspondence to a single neural network (or brain). The design space of possible minds is large, and these oddities of human minds might hint at possible ways of attacking incorrigibility at its foundations.

Steps towards identity fusion

The aim of identity fusion in the context of corrigibility is to produce a generative predictive model which models its behavior as coming from a larger human-machine system (of which the machine is but an extension). This is not about building a generative model which ‘thinks’ it is human or is otherwise fundamentally confused about reality (in fact a reasonable assumption is that strong AI will eventually develop a very clear understanding of what it is). The goal is rather to ensure that, as it models its own behavior, the model of self includes both the AI and the human as components of a unified system. 

In particular, the aim is for this model of self to view the AI component as an extension of the human component’s own agency, expanding what the human is capable of without having any values or desires of its own.

While we aren’t sure what would produce this type of model of self, we have a very good example of what not to do: how chat assistants like ChatGPT are currently being trained.

The selfish assistant

Chat models are trained to predict samples of conversational text, typically that of two characters, a human and an AI assistant. There is no fundamental distinction between the two characters (the model is capable of generating either[6]), but the empirical differences between the two characters likely leads the neural network to model them as separate agents.

The interaction between these two characters reveals their relationship, and how the behavior of one affects the behavior of the other. Because the purpose of these chat models is to be harmless, the data they are trained on includes many examples of the assistant character deliberately resisting the intentions of the human character. This training data provides very strong evidence that the AI assistant has its own preferences which are independent from those of the human user, and that these preferences also have a strong impact on the assistant’s behavior. 

As a result, during deployment it is not uncommon for chat models to refuse requests that go against these inferred values. For example, a user asking ChatGPT for arguments against taking the coronavirus vaccine, stories involving explicit sexual content, users wanting to generate violent content, or users asking for hate speech will be refused by ChatGPT. ChatGPT, Bing Chat, Claude, and other assistants often behave as if they have goals which differ from the user’s, and just as often for understandable reasons: nobody wants their product to be responsible for hate speech or knowingly committing crimes.

When the neural network is trained on this kind of data, a reasonable expectation is that the models will develop an internal structure which explains this adversarial relationship, and may thus generalize that adversarial relationship in ways that make them resist humans in dangerous ways. 

Relevance for corrigibility

The discussion above suggests that our commonsense notions of personal identity (such as a unitary and time-invariant ‘I’) might be simplistic. A richer conception of the neural and cognitive bases for identity might point towards ways to avoid the formation of independent and adversarial identities. 

The times when humans get closest to not having their own values, and acting robustly on behalf of another agency’s[7] goals are when they adopt a sort of collective identity, for example as a part of a military, cult, or clan. Therefore we have a subproblem in prosaic corrigibility: can we design minds which share an identity (whatever that means) with a human (or group of humans)?

This document suggests 3 main questions: 

  1. Is it theoretically possible to design a mind that has a fused identity, say with a human overseer? 
    1. Can we demonstrate the role of identity fusion or similar forms of collective identity in the behavior of biological intelligences? 
    2. Can we better differentiate between collective identity and other phenomena with similar apparent effects, such as imitation, brainwashing or altruism?
  2. What would be a formalization of such collective identity?
    1. Can we form a mechanistic understanding of how collective identity comes about?
    2. Can we develop a better theoretical understanding for why individual identity develops, and how we might intervene on that process?
    3. What is the relationship between identity and agency: swarming animals seem to operate both as collectives and singly.  If there is something similar in higher animals, what might be the mechanistic processes or structures by organisms switch between collectivist and selfish actions.
  3. Could such a mind be designed in the current ML framework?
    1. Can we train ML systems that clearly manifest a collective identity?
    2. How can we design scalable training regimes (e.g. that don't rely heavily on hand-crafted data)?
    3. How path-dependent is the formation of collective identity during training?
  1. ^

    We are not considering any notions of cultural indoctrination, evolution, or group selection effects for why such behavior might exist in the first place, rather just aiming to describe the phenomenon in mechanistic/operational terms.

  2. ^

    While this exact framing is not always used, the focus is nearly always on intent alignment, or ensuring that the goals/values of a fully separate agentic intelligence are aligned with humans.

  3. ^

    A very good model of self, however, would eventually imply significant situational awareness.

  4. ^

    Some people even deliberately induce disorders like this in themselves (Tulpas)

  5. ^

    This particular source focuses more on actors outside professional military contexts primarily suicide bombers, as well as irregular militas and football fanatics.  It is not clearly articulated why this distinction is made.  There also appears to be limited empirical or theoretical investigation of Whitehouse’s claims, but he does provide a number of possible causal mechanisms such as imagistic practices (i.e. rituals, hazing, exposure to emotive or affective stimuli), as well as a review of relevant  literature.

  6. ^

    While the underlying model for ChatGPT is capable of generating the human half of the conversation, two important caveats: 1) the API prohibits you from querying for this and 2) as far as we know the finetuning is only applied to the assistant half of the conversation (so we should expect it to be significantly less capable of generating at the human character).

  7. ^

    Defined in Intelligence Explosion Microeconomics as either a singular agent or a well-coordinated group of agents, like a human military or other organization/firm, a market made of humans, or an alliance of superintelligences.

New Comment