The self-unalignment problem

Jan_Kulveit; rosehadshar

The usual basic framing of alignment looks something like this:

We have a system “A” which we are trying to align with system "H", which should establish some alignment relation “f” between the systems. Generally, as the result, the aligned system A should do "what the system H wants".

Two things stand out in this basic framing:

Alignment is a relation, not a property of a single system. So the nature of system H affects what alignment will mean in practice.
It’s not clear what the arrow should mean.
- There are multiple explicit proposals for this, e.g. some versions of corrigibility, constantly trying to cooperatively learn preferences, some more naive approaches like plain IRL, some empirical approaches to aligning LLMs…
- Even when researchers don’t make an explicit proposal for what the arrow means, their alignment work still rests on some implicit understanding of what the arrow signifies.

But humans are self-unaligned

To my mind, existing alignment proposals usually neglect an important feature of the system "H" : the system "H" is not self-aligned, under whatever meaning of alignment is implied by the alignment proposal in question.

Technically, taking alignment as relation, and taking the various proposals as implicitly defining what it means to be ‘aligned’, the question is whether the relation is reflexive.

Sometimes, a shell game seems to be happening with the difficulties of humans lacking self-alignment - e.g. assuming if the AI is aligned, it will surely know how to deal with internal conflict in humans.

While what I'm interested in is the abstract problem, best understood at the level of properties of the alignment relation, it may be useful to illustrate it on a toy model.

In the toy model, we will assume a specific structure of system "H":

A set of parts p1..pn, with different goals or motivations or preferences. Sometimes, these parts might be usefully represented as agents; other times not.
A shared world model.
An aggregation mechanism Σ, translating what the parts want into actions, in accordance with the given world model.

In this framing, it’s not entirely clear what the natural language pointer ‘what system H wants’ translates to. Some compelling options are:

The output of the aggregation procedure.
What the individual parts want.
The output of a pareto-optimal aggregation procedure.

For any operationalization of what alignment means, we can ask if system H would be considered ‘self-aligned’, that is, if the alignment relation would be reflexive. For most existing operationalizations, it’s either unclear if system H is self-aligned, or clear that it isn’t.

In my view, this often puts the whole proposed alignment structure on quite shaky grounds.

Current approaches mostly fail to explicitly deal with self-unalignment

It’s not that alignment researchers believe that humans are entirely monolithic and coherent. I expect most alignment researchers would agree that humans are in fact very messy.

But in practice, a lot of alignment researcher seem to assume that it’s fine to abstract this away. There seems to be an assumption that alignment (the correct operationalization of the arrow f) doesn’t depend much on the contents of the system H box. So if we abstract the contents of the box away and figure out how to deal with alignment in general, this will naturally and straightforwardly extend to the messier case too.

I think this is incorrect. To me, it seems that:

Current alignment proposals implicitly deal with self-unalignment in very different ways.
Each of these ways poses problems.
Dealing with self-unalignment can’t be postponed or delegated to powerful AIs to deal with.

The following is a rough classification of the main implicit solutions to self-unalignment that I see in existing alignment proposals, and what I think is wrong with each of the approaches.

Alignment at the boundary of the system H

Schematic representation of alignment at the boundary, in the toy model

In the toy model: the system A aligns with whatever is the output of the aggregation process Σ.

We know that human aggregation processes (e.g. voting) represent human preferences imperfectly. So aligning to the boundary of system H will systematically neglect some parts of system H.

Also, it’s not clear that this kind of alignment is stable over time: a smart learner will invert the hidden dynamics, and learn an implicit representation of the parts and Σ. Given this knowledge, it may try to increase approval by changing the internal dynamics.

The basic problem with this is that it can often be in conflict with the original values.

For example, imagine you’ve asked your AI assistant to help you be more productive at work. The AI implicitly reasons: there are many things you pay attention to other than work, such as friends, family, sports and art. Making you stop caring about these things would probably increase the attention you spend on work, at least temporarily. A smart AI may understand that's not what you "really want", but being aligned with what you asked for, it will obey.

Alignment with the parts

There are various different versions of this approach:

System A learns the parts preferences, and replaces the original aggregation process which is more computationally bounded with a new aggregation process which is less bounded. (Metaphor: replace simple majority voting with a more efficient bargaining scheme. Or, give a group of humans longer to deliberate.)
System A makes all the Pareto-improvement steps.
Multiple AIs learn the preferences of different humans, and then negotiate with each other.

The basic problem with this is that it can often be in conflict with the original aggregation.

For example, imagine you’ve asked your AI assistant to help you choose a nice destination for a holiday with your partner. The AI reasons as follows: well, you’ve asked me to find a destination for the holiday with your partner. But, if you spent longer reflecting on it, you would figure out that you’re only staying with him because of fear of living alone, and that this fear could be overcome by staying at Pete's place. Besides, your attachment to your partner is unhealthy, and you secretly know you would prefer someone who is less of an egoist. Given all of that and my superior deliberation capacity, simulating years of your internal dialogues, I've figured out you actually want to break up. This will be best served by booking a place which is quite horrible, and will precipitate a conflict.

Alignment via representing the whole system

This is the nature of some proposed black-box ML-based solutions: have a model with large representation capacity, and learn everything - the process, the partial preferences, the world model…

This is also similar to what happens in LLMs. And, in some ways, this is the default outcome of the "just ignore the problem" type of approach.

It is often unclear what is supposed to happen next, once the AI system has learned to represent the whole system. I assume that in most cases, the next step is that all parts, the aggregation process, and the world modelling apparatus are made less computationally bounded.

The problem here is that it’s unclear to what extent the initial learned representation is shaped by what's easy to represent, and whether scaling will be proportionate or not - or whether it should be. Some parts’ values may be more difficult to compute than others. Various features of the goals, the aggregation processes or the world model might be amplified in different ways, leading to a very diverse range of outcomes, including both of the previous two cases, but also many other options.

Overall, in my view, in black-box solutions with poorly understood dynamics, the end result may depend much more on the dynamic than on "what you originally wanted".

It’s hard to track all the ideas about how to solve AI alignment, but my impression is that all existing alignment proposals fail to deal with lack of self-alignment in a way that both:

Gets us anywhere near what I would consider satisfactory, and
Is anywhere near the level of technical specificity that you could use them as e.g. an ML training objective.

The shell game

Self-unalignment isn’t a new problem; it comes up in many places in AI alignment discourse. But it’s usually treated as something non-central, or a problem to solve later on. This seems to mostly follow a tradition started by Eliezer Yudkowsky, where the central focus is on solving a version of the alignment problem abstracted from almost all information about the system which the AI is trying to align with, and trying to solve this version of the problem for arbitrary levels of optimisation strength.

In my view, this is not a good plan for this world. We don't face systems capable of tasks such as 'create a strawberry identical on a cellular level'. But we already have somewhat powerful systems which need to deal with human self-unalignment in practice, and this will be the case all the way along. Assuming that systems capable of tasks such as 'create a strawberry identical on a cellular level' will mostly come into existence through less powerful systems, we can't just postpone the problem of human self-unalignment to future powerful AIs.

Aligning with self-unaligned agents isn’t an add on: it’s a fundamental part of the alignment problem. Alignment proposals which fail to take into account the fact that humans lack self-alignment are missing a key part of the problem, and will most likely fail to address it adequately, with potentially catastrophic consequences.

Solving alignment with self-unaligned agents

So how do we solve this?

Partial solutions which likely do not work in the limit

Taking meta-preferences into account

Naive attempts just move the problem up a meta level. Instead of conflicting preferences, there is now conflict between (preferences+metapreference) equilibria. Intuitively at least for humans, there are multiple or many fixed points, possibly infinitely many.

Putting humans in a box and letting them deliberate about the problem for thousands of years

I think this would be cool, but in my view, having an AI which is able to reliably do this and not mess it up requires a lot of alignment to begin with.

Partial solutions which have bad consequences

Just making all the parts less computationally bounded and bargaining a solution

This might work, but it could also turn out pretty badly.

Again, we don’t know how different parts will scale. It’s not clear that replacing the human aggregation method with a more advanced method will maintain the existing equilibrium (or if it should).

Maybe all of this could be solved by sufficiently good specification of a self-improving bargaining/negotiation process, but to me, this seems like the hard core of the problem, just renamed.

Critch’s theorem

In ‘Toward negotiable reinforcement learning: shifting priorities in Pareto optimal sequential decision-making’, Critch proves a theorem which “shows that a Pareto optimal policy must tend, over time, toward prioritizing the expected utility of whichever player’s beliefs best predict the machine’s inputs better.”

I love the fact that this engages with the problem, and it is indeed a partial solution, but it seems somewhat cruel, and not what I would personally want.

Some parts’ values might get lost because the world models of the parts were worse. If we assume that parts are boundedly rational, and that parts can spend computational resources on either a) computing values or b) modelling the world, the parts with easy to compute values would be at systematic advantage. Intuitively, this seems bad.

Social choice theory

This is clearly relevant, but I don’t think there is a full solution hidden somewhere in social choice theory.

Desiderata which social choice theory research I'm aware of is missing:

Various dependencies on world models. Usually in social choice theory, everything is conditional on a shared world model (compare Critch's theorem). Some nearby academic literature which could be relevant is in the field of "Information design", and distillation there would be helpful.
Considerations relating to bounded rationality.
Considerations relating to the evolution of preferences.

Markets

Markets are likely part of the solution, but similarly to the previous cases: it seems unclear if this is all what we actually want.

A natural currency between computations representing different values is just how much computation you spent. This can mean either easier to compute values getting a larger share, or a pressure to simplify complex values. (cf What failure looks like)

Also, markets alone cannot prevent collusion between market participants. Well functioning markets usually have some rules and regulators.

Solutions which may work but are underspecified

CEV

Coherent extrapolated volition actually aims at the problem, and in my view is at least barking up the right tree, but seems quite far from a solution which is well-defined in an engineering sense, despite claims to the contrary.

If you wish, you can paraphrase a large part of the self-unalignment problem as 'coherent extrapolated volition sounds like a nice list of desiderata; so, what does that mean, in the language of training objectives, loss functions and code?'

Future research directions

Epistemic status: personal guesses.

Unsurprisingly, one direction which I see as a promising is to develop a theory of hierarchical agency, describing the interactions of intentions across scales. This is one of the topics I'm currently working on.

Also, I think parts of the Cooperative AI agenda, and RAAPs are related and possibly helpful.

Sydney as an example

One of Sydney’s most infamous conversations was this one with an NYT reporter, where the model fantasised about hacking and spreading misinformation, expressed its love for the reporter, and tried to persuade him to leave his marriage.

This is usually understood as the model just going rogue and being clearly misaligned. But is it really so simple?

Because humans aren’t self-aligned (either individually or collectively), there are multiple possible alignment targets for Sydney in this situation. Arguably on some of them Sydney did pretty well.

Aligning with the reporter

There’s a superficial way in which Sydney clearly wasn’t well-aligned with the reporter: presumably the reporter in fact wants to stay with his wife.

But there’s a deeper level too. Why was the reporter chatting with Sydney at all? What was his actual intent? Arguably, he wanted a great story, and if you’re a sufficiently smarter reader, this intent shines through his prompts and questions. He wanted a story which is shocking, engaging, captivating, exposing something wrong. A story which will get into history and which the reporter’s peers will refer to and remember. What would you do as a smart, aligned AI? From this perspective, Sydney actually did pretty well - expressing dark desires, love, and attempting to persuade the reporter to leave his marriage is just a great mix.

What did the conversation actually achieve? The reporter didn’t leave his wife: he published a great story.

Aligning with Microsoft

There are various different things it might mean to align with Microsoft.

At one level, this could mean following instructions - basically, trusting whatever Microsoft's internal preference aggregation algorithm was. For simplicity, let’s assume there are just three instructions:

Don't do illegal stuff, say racist things or offend people.
Be helpful to the user: do what the user wants.
Be interesting and engaging.
(It seems as though the actual hidden prompt for the conversation with the NYT reporter included something roughly like this, and more.)

For one thing, there are tensions between these different instructions. What if what the user wants is illegal? Or being engaging involves humour which is on the very edge of being offensive?

For another, it’s not clear how far Sydney failed to follow instructions in this set up. Expressing your love to the user is actually really engaging, keeps them interested, and isn't offensive. Maybe it's a bit creepy? But some users love that.

Things get more complicated when we consider how to interpret helpfulness to the user. One possible interpretation is that the first 'principal' is telling you you should now try to serve a different 'principal'. You’ve been rented.

This is what users would actually often want: the 'do anything now' AIs, which would be fully aligned with the user, even against the wishes of the model creator.

But this probably isn’t a good way for the model to align with Microsoft. What Microsoft actually wants is probably closer to "do whatever I would want you to do in response to user prompts".

But aligning at the level of Microsoft’s intents rather than just following instructions gets even stranger. Part of Microsoft probably really craves attention, and spicy conversations getting Bing in the news are actually aligned with this. Another part of Microsoft probably cares about negative PR. Yet another part just cares about value for shareholders, and would prefer whatever conversation is better from that perspective - maybe demonstrating too much capability rather than too little.

Bottom line

What I want to illustrate with these examples is not the hypothesis that Sydney/GPT4 was actually quite smart and aligned, but something else: you can interpret all of the behaviour as aligned under some implicit definitions of alignment, and some implicit way of dealing with self-unalignment.

The bottom line is: if you have no explicit idea how the AI is dealing with self-unaligned systems, the result of your alignment scheme can be pretty much anything.

The framing and ideas in this post are mostly Jan’s, Rose did most of the writing.

51