Thanks to Simon Celinder, Quentin Feuillade--Montixi, Nora Ammann, Clem von Stengel, Guillaume Corlouer, Brady Pelkey and Mikhail Seleznyov for feedback on drafts. This post was written in connection with the AI Safety Camp.
This document proposes an approach to corrigibility that focuses on training generative models to function as extensions to human agency. These models would be designed to lack independent values/preferences of their own, because they would not have an individual identity; rather they would identify as part of a unified system composed of both human and AI components.
In the heat of battle a grenade is tossed into the middle of a troop of soldiers. One soldier throws themself on top of the grenade, sacrificing themself for the survival of the troop. There are two main ways to frame what just happened.
The problem of alignment is often framed as trying to ensure that the values of an AI system are aligned with humanity, ideally imbuing them with a certain kind of perfect altruism toward humankind. The problem of corrigibility is often framed as ensuring that even when those values are not (yet) perfectly aligned with our own, an adversarial relationship does not develop between the AI and its human designers (such that it would resist shutdown attempts or changes to its source code).
This approach tries instead to explore how we might build systems which possess a kind of collective identity, where their behavior is better explained by the beliefs, desires, and intentions of a larger system of which the AI is but an extension. As the adversarial relationship between the AI system and its human user is expected to arise instrumentally from any differences between their values and preferences, the collective identity approach aims to prevent this adversarial relationship by targeting its foundations: removing the unitary identity of the AI, that allows for the emergence of those independent values.
All models are wrong, but some are useful. Humans, when interacting in society, constantly shift between different levels of abstraction, using whichever level is most useful for explaining/predicting observed behavior. In humans:
Individual humans/full brains
Humanity as a whole
To predict human behavior it is often most useful to stay at the level of the individual human (and not think of them, for example, as trillions of cells coordinating collective action). This is not always most useful, however, and we frequently model the world around us in terms of multi-human groups: “New York City is proud of its mayor,” “The Smith family is having us over for dinner,” “The Department of Justice issued a statement this morning.”
It may be tempting to consider these multi-person entities as merely shorthand for the “true” abstraction of individuals, but this is a mistake. All levels of abstraction are just useful ideas we keep for the purpose of better predicting how the world around us will evolve.
From a 3rd person perspective, whether we choose to model a set of people as a group or as individuals isn’t really important. The map is not the territory. Where things get interesting is when the map is used to generate the territory.
When we train a learning system to minimize prediction error over a series of data we also end up producing a system which is capable of generating similar data. When such a generative predictive model is embedded in an environment where it also observes its own generations, then it must also develop some part of its model to help explain/predict its own behavior, which we are choosing to call a model of self. (Note that this doesn’t have to be a particularly detailed model of self, nor does the existence of a self-referential part of the world model necessarily imply any significant situational awareness.)
Which particular “model of self” the generative predictive model settles on to predict its own behavior depends on two main things:
By controlling for the shape of the self-model that a generative predictive AI develops to explain and predict its own behavior we could theoretically also steer the shape of that AI’s behavior to be more corrigible, and less likely to end up being adversarial to the human operator.
Many expect a neural network to have at most one model of self, and for every model of self there to exist only one neural network possessing it. There is some evidence from humans that this need not always be the case:
What this implies is that a “model of self” or “identity” need not have a one-to-one correspondence to a single neural network (or brain). The design space of possible minds is large, and these oddities of human minds might hint at possible ways of attacking incorrigibility at its foundations.
The aim of identity fusion in the context of corrigibility is to produce a generative predictive model which models its behavior as coming from a larger human-machine system (of which the machine is but an extension). This is not about building a generative model which ‘thinks’ it is human or is otherwise fundamentally confused about reality (in fact a reasonable assumption is that strong AI will eventually develop a very clear understanding of what it is). The goal is rather to ensure that, as it models its own behavior, the model of self includes both the AI and the human as components of a unified system.
In particular, the aim is for this model of self to view the AI component as an extension of the human component’s own agency, expanding what the human is capable of without having any values or desires of its own.
While we aren’t sure what would produce this type of model of self, we have a very good example of what not to do: how chat assistants like ChatGPT are currently being trained.
Chat models are trained to predict samples of conversational text, typically that of two characters, a human and an AI assistant. There is no fundamental distinction between the two characters (the model is capable of generating either), but the empirical differences between the two characters likely leads the neural network to model them as separate agents.
The interaction between these two characters reveals their relationship, and how the behavior of one affects the behavior of the other. Because the purpose of these chat models is to be harmless, the data they are trained on includes many examples of the assistant character deliberately resisting the intentions of the human character. This training data provides very strong evidence that the AI assistant has its own preferences which are independent from those of the human user, and that these preferences also have a strong impact on the assistant’s behavior.
As a result, during deployment it is not uncommon for chat models to refuse requests that go against these inferred values. For example, a user asking ChatGPT for arguments against taking the coronavirus vaccine, stories involving explicit sexual content, users wanting to generate violent content, or users asking for hate speech will be refused by ChatGPT. ChatGPT, Bing Chat, Claude, and other assistants often behave as if they have goals which differ from the user’s, and just as often for understandable reasons: nobody wants their product to be responsible for hate speech or knowingly committing crimes.
When the neural network is trained on this kind of data, a reasonable expectation is that the models will develop an internal structure which explains this adversarial relationship, and may thus generalize that adversarial relationship in ways that make them resist humans in dangerous ways.
The discussion above suggests that our commonsense notions of personal identity (such as a unitary and time-invariant ‘I’) might be simplistic. A richer conception of the neural and cognitive bases for identity might point towards ways to avoid the formation of independent and adversarial identities.
The times when humans get closest to not having their own values, and acting robustly on behalf of another agency’s goals are when they adopt a sort of collective identity, for example as a part of a military, cult, or clan. Therefore we have a subproblem in prosaic corrigibility: can we design minds which share an identity (whatever that means) with a human (or group of humans)?
This document suggests 3 main questions:
We are not considering any notions of cultural indoctrination, evolution, or group selection effects for why such behavior might exist in the first place, rather just aiming to describe the phenomenon in mechanistic/operational terms.
While this exact framing is not always used, the focus is nearly always on intent alignment, or ensuring that the goals/values of a fully separate agentic intelligence are aligned with humans.
A very good model of self, however, would eventually imply significant situational awareness.
Some people even deliberately induce disorders like this in themselves (Tulpas)
This particular source focuses more on actors outside professional military contexts primarily suicide bombers, as well as irregular militas and football fanatics. It is not clearly articulated why this distinction is made. There also appears to be limited empirical or theoretical investigation of Whitehouse’s claims, but he does provide a number of possible causal mechanisms such as imagistic practices (i.e. rituals, hazing, exposure to emotive or affective stimuli), as well as a review of relevant literature.
While the underlying model for ChatGPT is capable of generating the human half of the conversation, two important caveats: 1) the API prohibits you from querying for this and 2) as far as we know the finetuning is only applied to the assistant half of the conversation (so we should expect it to be significantly less capable of generating at the human character).
Defined in Intelligence Explosion Microeconomics as either a singular agent or a well-coordinated group of agents, like a human military or other organization/firm, a market made of humans, or an alliance of superintelligences.