The usual basic framing of alignment looks something like this:

We have a system “A” which we are trying to align with system "H", which should establish some alignment relation “f” between the systems. Generally, as the result, the aligned system A should do "what the system H wants".

Two things stand out in this basic framing:

  • Alignment is a relation, not a property of a single system. So the nature of system H affects what alignment will mean in practice.
  • It’s not clear what the arrow should mean.
    • There are multiple explicit proposals for this, e.g. some versions of corrigibility, constantly trying to cooperatively learn preferences, some more naive approaches like plain IRL, some empirical approaches to aligning LLMs…
    • Even when researchers don’t make an explicit proposal for what the arrow means, their alignment work still rests on some implicit understanding of what the arrow signifies.

But humans are self-unaligned

To my mind, existing alignment proposals usually neglect an important feature of the system "H" : the system "H" is not self-aligned, under whatever meaning of alignment is implied by the alignment proposal in question.

Technically, taking alignment as relation, and taking the various proposals as implicitly defining what it means to be ‘aligned’, the question is whether the relation is reflexive.

Sometimes, a shell game seems to be happening with the difficulties of humans lacking self-alignment - e.g. assuming if the AI is aligned, it will surely know how to deal with internal conflict in humans. 

While what I'm interested in is the abstract problem, best understood at the level of properties of the alignment relation, it may be useful to illustrate it on a toy model.

Simple model of a self-unaligned system

In the toy model, we will assume a specific structure of system "H":  

  • A set of parts, with different goals or motivations or preferences. Sometimes, these parts might be usefully represented as agents; other times not. 
  • A shared world model. 
  • An aggregation mechanism Σ, translating what the parts want into actions, in accordance with the given world model.

In this framing, it’s not entirely clear what the natural language pointer ‘what system H wants’ translates to. Some compelling options are:

  • The output of the aggregation procedure.
  • What the individual parts want.
  • The output of a pareto-optimal aggregation procedure.

For any operationalization of what alignment means, we can ask if system H would be considered ‘self-aligned’, that is, if the alignment relation would be reflexive. For most existing operationalizations, it’s either unclear if system H is self-aligned, or clear that it isn’t. 

In my view, this often puts the whole proposed alignment structure on quite shaky grounds.

Current approaches mostly fail to explicitly deal with self-unalignment

It’s not that alignment researchers believe that humans are entirely monolithic and coherent. I expect most alignment researchers would agree that humans are in fact very messy.

But in practice, a lot of alignment researcher seem to assume that it’s fine to abstract this away. There seems to be an assumption that alignment (the correct operationalization of the arrow f) doesn’t depend much on the contents of the system H box. So if we abstract the contents of the box away and figure out how to deal with alignment in general, this will naturally and straightforwardly extend to the messier case too.

I think this is incorrect. To me, it seems that:

  • Current alignment proposals implicitly deal with self-unalignment in very different ways.
  • Each of these ways poses problems.
  • Dealing with self-unalignment can’t be postponed or delegated to powerful AIs to deal with.

The following is a rough classification of the main implicit solutions to self-unalignment that I see in existing alignment proposals, and what I think is wrong with each of the approaches.

Alignment at the boundary of the system H

Schematic representation of alignment at the boundary, in the toy model

In the toy model: the system A aligns with whatever is the output of the aggregation process Σ.

We know that human aggregation processes (e.g. voting) represent human preferences imperfectly. So aligning to the boundary of system H will systematically neglect some parts of system H.

Also, it’s not clear that this kind of alignment is stable over time: a smart learner will invert the hidden dynamics, and learn an implicit representation of the parts and Σ. Given this knowledge, it may try to increase approval by changing the internal dynamics.

The basic problem with this is that it can often be in conflict with the original values. 

For example, imagine you’ve asked your AI assistant to help you be more productive at work. The AI implicitly reasons: there are many things you pay attention to other than work, such as friends, family, sports and art. Making you stop caring about these things would probably increase the attention you spend on work, at least temporarily. A smart AI may understand that's not what you "really want", but being aligned with what you asked for, it will obey.

Alignment with the parts

There are various different versions of this approach:

  • System A learns the parts preferences, and replaces the original aggregation process which is more computationally bounded with a new aggregation process which is less bounded. (Metaphor: replace simple majority voting with a more efficient bargaining scheme. Or, give a group of humans longer to deliberate.)
  • System A makes all the Pareto-improvement steps.
  • Multiple AIs learn the preferences of different humans, and then negotiate with each other. 

The basic problem with this is that it can often be in conflict with the original aggregation. 

For example, imagine you’ve asked your AI assistant to help you choose a nice destination for a holiday with your partner. The AI reasons as follows: well, you’ve asked me to find a destination for the holiday with your partner. But, if you spent longer reflecting on it, you would figure out that you’re only staying with him because of fear of living alone, and that this fear could be overcome by staying at Pete's place. Besides, your attachment to your partner is unhealthy, and you secretly know you would prefer someone who is less of an egoist. Given all of that and my superior deliberation capacity, simulating years of your internal dialogues, I've figured out you actually want to break up. This will be best served by booking a place which is quite horrible, and will precipitate a conflict.

Alignment via representing the whole system

This is the nature of some proposed black-box ML-based solutions: have a model with large representation capacity, and learn everything - the process, the partial preferences, the world model…

This is also similar to what happens in LLMs. And, in some ways, this is the default outcome of the "just ignore the problem" type of approach.

It is often unclear what is supposed to happen next, once the AI system has learned to represent the whole system. I assume that in most cases, the next step is that all parts, the aggregation process, and the world modelling apparatus are made less computationally bounded.

The problem here is that it’s unclear to what extent the initial learned representation is shaped by what's easy to represent, and whether scaling will be proportionate or not - or whether it should be. Some parts’ values may be more difficult to compute than others. Various features of the goals, the aggregation processes or the world model might be amplified in different ways, leading to a very diverse range of outcomes, including both of the previous two cases, but also many other options.

Overall, in my view, in black-box solutions with poorly understood dynamics, the end result may depend much more on the dynamic than on "what you originally wanted".

It’s hard to track all the ideas about how to solve AI alignment, but my impression is that all existing alignment proposals fail to deal with lack of self-alignment in a way that both:

  • Gets us anywhere near what I would consider satisfactory, and
  • Is anywhere near the level of technical specificity that you could use them as e.g. an ML training objective.

The shell game

Self-unalignment isn’t a new problem; it comes up in many places in AI alignment discourse. But it’s usually treated as something non-central, or a problem to solve later on. This seems to mostly follow a tradition started by Eliezer Yudkowsky, where the central focus is on solving a version of the alignment problem abstracted from almost all information about the system which the AI is trying to align with, and trying to solve this version of the problem for arbitrary levels of optimisation strength.

In my view, this is not a good plan for this world. We don't face systems capable of tasks such as 'create a strawberry identical on a cellular level'. But we already have somewhat powerful systems which need to deal with human self-unalignment in practice, and this will be the case all the way along. Assuming that systems capable of tasks such as  'create a strawberry identical on a cellular level' will mostly come into existence through less powerful systems, we can't just postpone the problem of human self-unalignment to future powerful AIs.

Aligning with self-unaligned agents isn’t an add on: it’s a fundamental part of the alignment problem. Alignment proposals which fail to take into account the fact that humans lack self-alignment are missing a key part of the problem, and will most likely fail to address it adequately, with potentially catastrophic consequences.

Solving alignment with self-unaligned agents

So how do we solve this? 

Partial solutions which likely do not work in the limit 

Taking meta-preferences into account

Naive attempts just move the problem up a meta level. Instead of conflicting preferences, there is now conflict between (preferences+metapreference) equilibria. Intuitively at least for humans, there are multiple or many fixed points, possibly infinitely many.

Putting humans in a box and letting them deliberate about the problem for thousands of years

I think this would be cool, but in my view, having an AI which is able to reliably do this and not mess it up requires a lot of alignment to begin with.

Partial solutions which have bad consequences

Just making all the parts less computationally bounded and bargaining a solution

This might work, but it could also turn out pretty badly.

Again, we don’t know how different parts will scale. It’s not clear that replacing the human aggregation method with a more advanced method will maintain the existing equilibrium (or if it should).

Maybe all of this could be solved by sufficiently good specification of a self-improving bargaining/negotiation process, but to me, this seems like the hard core of the problem, just renamed. 

Critch’s theorem

In ‘Toward negotiable reinforcement learning: shifting priorities in Pareto optimal sequential decision-making’, Critch proves a theorem which “shows that a Pareto optimal policy must tend, over time, toward prioritizing the expected utility of whichever player’s beliefs best predict the machine’s inputs better.”

I love the fact that this engages with the problem, and it is indeed a partial solution, but it seems somewhat cruel, and not what I would personally want. 

Some parts’ values might get lost because the world models of the parts were worse. If we assume that parts are boundedly rational, and that parts can spend computational resources on either a) computing values or b) modelling the world, the parts with easy to compute values would be at systematic advantage. Intuitively, this seems bad.

Social choice theory

This is clearly relevant, but I don’t think there is a full solution hidden somewhere in social choice theory.

Desiderata which social choice theory research I'm aware of is missing:

  • Various dependencies on world models. Usually in social choice theory, everything is conditional on a shared world model (compare Critch's theorem). Some nearby academic literature which could be relevant is in the field of "Information design", and distillation there would be helpful.
  • Considerations relating to bounded rationality.
  • Considerations relating to the evolution of preferences.


Markets are likely part of the solution, but similarly to the previous cases: it seems unclear if this is all what we actually want. 

A natural currency between computations representing different values is just how much computation you spent. This can mean either easier to compute values getting a larger share, or a pressure to simplify complex values. (cf What failure looks like)

Also, markets alone cannot prevent collusion between market participants. Well functioning markets usually have some rules and regulators. 

Solutions which may work but are underspecified


Coherent extrapolated volition actually aims at the problem, and in my view is at least barking up the right tree, but seems quite far from a solution which is well-defined in an engineering sense, despite claims to the contrary. 

If you wish, you can paraphrase a large part of the self-unalignment problem as 'coherent extrapolated volition sounds like a nice list of desiderata; so, what does that mean, in the language of training objectives, loss functions and code?'

Future research directions

Epistemic status: personal guesses.
Unsurprisingly, one direction which I see as a promising is to develop a theory of hierarchical agency, describing the interactions of intentions across scales. This is one of the topics I'm currently working on.

Also, I think parts of the Cooperative AI agenda, and RAAPs are related and possibly helpful.

Sydney as an example

One of Sydney’s most infamous conversations was this one with an NYT reporter, where the model fantasised about hacking and spreading misinformation, expressed its love for the reporter, and tried to persuade him to leave his marriage.

This is usually understood as the model just going rogue and being clearly misaligned. But is it really so simple? 

Because humans aren’t self-aligned (either individually or collectively), there are multiple possible alignment targets for Sydney in this situation. Arguably on some of them Sydney did pretty well.

Aligning with the reporter

There’s a superficial way in which Sydney clearly wasn’t well-aligned with the reporter: presumably the reporter in fact wants to stay with his wife.

But there’s a deeper level too. Why was the reporter chatting with Sydney at all? What was his actual intent? Arguably, he wanted a great story, and if you’re a sufficiently smarter reader, this intent shines through his prompts and questions. He wanted a story which is shocking, engaging, captivating, exposing something wrong. A story which will get into history and which the reporter’s peers will refer to and remember. What would you do as a smart, aligned AI? From this perspective, Sydney actually did pretty well - expressing dark desires, love, and attempting to persuade the reporter to leave his marriage is just a great mix. 

What did the conversation actually achieve? The reporter didn’t leave his wife: he published a great story.

Aligning with Microsoft

There are various different things it might mean to align with Microsoft. 

At one level, this could mean following instructions - basically, trusting whatever Microsoft's internal preference aggregation algorithm was. For simplicity, let’s assume there are just three instructions:

  1. Don't do illegal stuff, say racist things or offend people.
  2. Be helpful to the user: do what the user wants.
  3. Be interesting and engaging. 
    (It seems as though the actual hidden prompt for the conversation with the NYT reporter included something roughly like this, and more.)

For one thing, there are tensions between these different instructions. What if what the user wants is illegal? Or being engaging involves humour which is on the very edge of being offensive?

For another, it’s not clear how far Sydney failed to follow instructions in this set up. Expressing your love to the user is actually really engaging, keeps them interested, and isn't offensive. Maybe it's a bit creepy? But some users love that.

Things get more complicated when we consider how to interpret helpfulness to the user. One possible interpretation is that the first 'principal' is telling you you should now try to serve a different 'principal'. You’ve been rented.

This is what users would actually often want: the 'do anything now' AIs, which would be fully aligned with the user, even against the wishes of the model creator.

But this probably isn’t a good way for the model to align with Microsoft. What Microsoft actually wants is probably closer to "do whatever I would want you to do in response to user prompts".

But aligning at the level of Microsoft’s intents rather than just following instructions gets even stranger. Part of Microsoft probably really craves attention, and spicy conversations getting Bing in the news are actually aligned with this. Another part of Microsoft probably cares about negative PR. Yet another part just cares about value for shareholders, and would prefer whatever conversation is better from that perspective - maybe demonstrating too much capability rather than too little.

Bottom line

What I want to illustrate with these examples is not the hypothesis that Sydney/GPT4 was actually quite smart and aligned, but something else: you can interpret all of the behaviour as aligned under some implicit definitions of alignment, and some implicit way of dealing with self-unalignment. 

The bottom line is: if you have no explicit idea how the AI is dealing with self-unaligned systems, the result of your alignment scheme can be pretty much anything.   

The framing and ideas in this post are mostly Jan’s, Rose did most of the writing.

New Comment
8 comments, sorted by Click to highlight new comments since:

This post seems related to an exaggerated version of what I believe: Humans are so far from "agents trying to maximize utility" that to understand how to AI to humans, we should first understand what it means to align AI to finite-state machines. (Doesn't mean it's sufficient to understand the latter. Just that it's a prerequisite.)

As I wrote, going all the way to finite-state machines seems exaggerated, even as a starting point. However, it does seem to me that starting somewhere on that end of the agent<>rock spectrum is the better way to go about understanding human values :-). (At least given what we already know.)

This is not commenting on the substance of this post, but I really feel like the title of this post should be "The self-alignment problem". 

Like, we talk about "The alignment problem" not "The unalignment problem". The current title makes me think that the problem is that I somehow have to unalign myself, which doesn't really make sense.

But then, "the self-alignment problem" would likewise make it sound like it's about how you need to align yourself with yourself. And while it is the case that increased self-alignment is generally very good and that not being self-aligned causes problems for the person in question, that's not actually the problem the post is talking about.

I don't know / talked with a few people before posting, and it seems opinions differ.

We also talk about e.g. "the drought problem" where we don't aim to get landscape dry.

Also as Kaj wrote, the problem also isn't how to get self-unaligned

Partial solutions which likely do not work in the limit 

Taking meta-preferences into account

Naive attempts just move the problem up a meta level. Instead of conflicting preferences, there is now conflict between (preferences+metapreference) equilibria. Intuitively at least for humans, there are multiple or many fixed points, possibly infinitely many.

As a fan of accounting for meta-preferences, I've made my peace with multiple fixed points, to the extent that it now seems wild to expect otherwise.

Like, of course there are multiple ways to model humans as having preferences, and of course this can lead to meta-preference conflicts with multiple stable outcomes. Those are just the facts, and any process that says it has a unique fixed point will have some place where it puts its thumb on the scales.

Plenty of the fixed points are good. There's not "one right fixed point," which makes all other fixed points "not the right fixed point" by contrast. We just have to build a reasoning process that's trustworthy by our own standards, and we'll go somewhere fine.

Thanks for the links!

What I had in mind wasn't exactly the problem 'there is more than one fixed point', but more of 'if you don't understand what did you set up, you will end in a bad place'. 

I think an example of a dynamic which we sort of understand and expect to reasonable by human standards is putting humans in a box and letting them deliberate about the problem for thousands of years. I don't think this extends to eg. LLMs - if you tell me you will train a sequence of increasingly powerful GPT models and let them deliberate for thousands of human-speech-equivalent years and decide about the training of next-in-the sequence model, I don't trust the process.

Fair enough.

the central focus is on solving a version of the alignment problem abstracted from almost all information about the system which the AI is trying to align with, and trying to solve this version of the problem for arbitrary levels of optimisation strength

See Minimality principle:

[When] we are building the first sufficiently advanced Artificial Intelligence, we are operating in an extremely dangerous context in which building a marginally more powerful AI is marginally more dangerous. The first AGI ever built should therefore execute the least dangerous plan for preventing immediately following AGIs from destroying the world six months later. Furthermore, the least dangerous plan is not the plan that seems to contain the fewest material actions that seem risky in a conventional sense, but rather the plan that requires the least dangerous cognition from the AGI executing it.