This post outlines a thought experiment, designed to illustrate a problem with the Parliamentarian version of CEV (PCEV). It then shows that the problem is not inherent in the scenario of the thought experiment, by describing a large space of other designs, that does not suffer from this problem (based on modifying PCEV, in a way that gives each individual, meaningful influence, over the adoption of those preferences, that refer to her).

The most recently published version of CEV, uses Parliamentarianism to deal with disagreements, among individuals, that disagree on how to deal with disagreements. Let's refer to the AI, that implements this Parliamentarian definition, of the CEV of humanity, as PCEV. The basic idea of PCEV, is that extrapolated delegates, representing billions of humans, will negotiate and vote, on what PCEV will do.

PCEV is described in an arbital entry from 2016.

It links to an OB post by Bostrom, from 2009, on Parliamentarianism.

The 2009 OB post, proposes to use Parliamentarian negotiations, for a different purpose. The 2016 arbital entry, proposes to re purpose the idea for CEV.

The idea, that was only briefly mentioned in the short OB post, was not written up until 2021, when Toby Newberry and Toby Ord published: The Parliamentary Approach to Moral Uncertainty.

One central aspect of Parliamentarianism, in all three references above, is that delegates must negotiate and vote, under the assumption, that votes will be settled stochastically (votes are, however, not actually settled stochastically). Without this feature, the thought experiment below does not work. However, removing this feature would result in a fundamentally different type of AI, with a number of other problems (for example: any solid majority, no matter how narrow, would be able to do, literally, anything that they want, to everyone else).

The 2021 paper describes more than one version of Parliamentarian negotiations. In a version with multiple separate votes, on separate decisions, the resulting AI would sometimes prefer A to B, B to C, and C to A (see for example 3.2, in the 2021 paper). Thus, the thought experiment below uses the single vote version of Parliamentary negotiations (so, basically, the extrapolated delegates will vote, on what successor AI, PCEV should build).

The thought experiment:

Assume that each individual has a delegate that will represent the interests of that individual perfectly in the Parliament (that is, we assume away all issues related to extrapolating a human individual, and also all issues related to mapping an extrapolated human, to a utility function. This is done, to focus on a problem that remains, even under these optimistic assumptions).

A tiny number of fanatics thinks that everyone else are heretics, and they want PCEV to hurt all heretics as much as possible. This is a non strategic, normative, position, for example along the lines of the, common, normative position: ``heretics deserve eternal torture in hell''. The fanatics would be willing to trade a small probability of the Maximum Punishment outcome (MP), for a large probability of the heretics receiving a Lesser Punishment (LP). Since the delegates of the fanatics, must act, as if voting is stochastic, they will vote for MP, unless most of the heretics agree to a negotiated position, where most heretics, and the fanatics, all vote for LP (since the delegates of the fanatics, believe that a tiny number of votes for MP, has at least some probability, of resulting in MP, voting for MP, is simply the best way to represent the interests of the fanatics). Since the delegates of the heretics, must also act as if voting will be stochastic, they will also negotiate, as if the fanatics voting for MP, have some probability of resulting in MP. The only way the delegates of the heretics have, of convincing the delegates of the fanatics, to not vote for MP, is to agree to a negotiated position, where most of the heretics votes for LP. As long as most of the heretics would consider a small probability of MP, to be worse than a large probability of LP, then most of these delegates will agree to vote for LP (because all delegates act, under the assumption, that votes will be settled stochastically. Under the rules of the Parliament, agreeing to vote for LP, is simply the best outcome, that these delegates can achieve for the people they represent). The only bounds on how bad LP can be, is that a clever enough AI, must be able to think up something that is far, far worse. To give a simple example, we could say that in LP, for every person that is not a fanatic, PCEV will create 10^15 minds, specifically designed by PCEV, such that this particular heretic, will care maximally about the well being of this set of minds (subject to the constraint, that all created minds must be heretics). And then subject each of these 10^15 minds, to the most horrific treatment, that this heretic is capable of comprehending, for 10^15 subjective years. As long as PCEV can imagine something far, far, worse than this (and I think it is very safe to assume, that PCEV will be able to do this), then the heretics, will be classified as being essentially indifferent, between LP, and their preferred outcome (according to the ``interpersonal strength of caring'' metric, implicit in the negotiation rules of PCEV).

This feature of PCEV is obviously not bad in any ``objective'' sense. Consider for example Bob, who hates all humans, and would like to see as many humans as possible, hurt as badly as possible. Bob would not agree, that this feature of PCEV, is a good argument against PCEV. We can also consider Dave, who just want any powerful AI launched, as soon as possible (and genuinely does not care, at all, what type of AI, is launched). Dave would simply not find this feature of PCEV, relevant in any way. So, there is no hope of convincing people along the lines of Bob, or Dave, that launching PCEV, is bad idea. It does however seem, like the thought experiment outlined above, could be turned into an argument against PCEV, that should be convincing for many people. First however, we need to show that the problem is not inherent in the scenario.

To use the existence of this problem, as an argument against PVEV, one must establish that designs exist, that does not suffer from the problem. As it turns out, there exists a very large space of designs, that does not suffer from this problem. And one can find these designs, by analysing the root cause of the problem. Let's take the perspective of human individual Steve, who does not have preferences along the lines of either Bob, or Dave. From Steve's perspective, PCEV suffers from a serious problem. To me, it seems like this problem is inherent, in the way that PCEV adopts preferences, that refer to Steve. In proposals along the lines of PCEV, a clever and powerful AI will adopt preferences, that refer to Steve. However, if PCEV is pointed at a large number of humans, then Steve will have no meaningful influence, over the adoption, of those preferences, that refer to Steve. As with all other decisions, this decision is completely dominated by billions of other humans individuals. There does however exist a very large space of designs, that does give Steve such influence, without giving Steve any special treatment. In general, one can give each individual, meaningful influence, over the adoption of those preferences, that refer to her (without giving any individual, any special treatment).

Let's say that a Modified version of PCEV (MPCEV) is pointed at a large group of human individuals, that includes Steve. Let's see what happens if the modification is some version of the following rule:

If a preference is about Steve, then MPCEV will only take this preference into account, if: (i): the preference counts as concern for the well being of Steve, or if (ii): Steve would approve, of MPCEV taking this preference into account.

So, if a preference is about Steve, then this preference is ignored by MPCEV, unless Steve would approve of MPCEV taking this preference into account, or if the preference counts as concern for the well being of Steve. (the same rule is also applied to each individual, in the set of individuals, that MPCEV is pointed at)

This points to a large space of possible designs. For many of them, the problem illustrated in the thought experiment above, is not present (including essentially any common sense interpretation of the terms involved). Basically, unless we define the terms involved, in some very convoluted way, the threat of MP is removed from the delegates representing the heretics (meaning that they can no longer be pressured into voting for LP). So, the issue is not inherent in the scenario. Thus, from the point of view of a human individual (that does not have preferences along the lines of Bob, or Dave), PCEV suffers from an avoidable problem. And the source of this problem, seems to be, that (when a design, along the lines of PCEV, is pointed at a large number of human individuals), no individual, can have any meaningful influence, regarding the adoption of those preferences, that refer to her.

Let's refer to this feature, as a lack of control. As illustrated above, it is a problematic, and non necessary, feature of PCEV. This lack of control, is however not just a feature of PCEV. It is inherent in the concept, of building an AI, that is, in any way, describable as: ``doing what a group, would like the AI to do''. Let's refer to such an AI as a ``Group AI'', or simply: GAI. This includes all versions of CEV, as well as many other types of designs (including, for example, the ``majority vote AI'', that Elon Musk advocated for in a tweet). Regardless of what definition of ``Group'' is used, the decision of which ``Steve preferences'' to adopt, will be dominated by (some subset of) billions of other humans (just as any other decision). The fact that this lack of control is inherent in the core concept of a GAI, means that we can make statements regarding a GAI design, even if we don't know the details about this design. For example, let's take the perspective of Steve: a human individual, that does not have preferences along the lines of Bob, or Dave, mentioned above. Let's say that some GAI design is pointed at a set of billions of humans, that include Steve. Let's further state that this GAI has been successfully implemented, and that it is working as intended. Even without knowing the details of this GAI, we know, for a fact, that Steve will have no meaningful amount of control, over the adoption of those preferences, that refer to Steve. Given the severity of the potential outcomes (for example along the lines of LP, in the thought experiment above) it now seems to me, that the act of launching this GAI, creates an extreme risk, to Steve (that is inherent in the core concept of a GAI, and is not dependent on any particular set of design choices). Combined with the fact, that this lack of control, is a non necessary feature, this seems like a decisive argument, against launching any GAI design (from the perspective of any human individual, that does not have preferences, along the lines of Bob, or Dave). In particular, it seems like this line of argument, implies a necessary (but not sufficient), constraint, on any modification, to any GAI design: the modification must result in an AI, that is no longer describable as a GAI (as in, for example, the MPCEV design space mentioned above). Given that groups, and human individuals, are completely different types of things, it should not be particularly surprising, to learn that launching a GAI, would be bad for human individuals (in expectation, for human individuals that does not have preferences along the lines of Bob or Dave, and assuming that the GAI in question is implemented successfully and is working as intended).

More generally, it seems to me, that if an AI adopts preferences, that refer to Steve. But this AI does not give Steve any meaningful influence, regarding the adoption of those preferences, that refer to Steve. Then, it seems to me, that the act of launching such an AI, would probably be very bad for Steve (in expectation, if that AI is successfully implemented, and works as intended. And if Steve is a human individual, that does not have preferences along the lines of Bob, or Dave). This obviously does not mean, that the AI in question, is ``objectively'' bad. However (combined with the fact, that the problematic lack of control, is not a necessary feature), this sounds to me, like an excellent reason, for Steve to oppose the launch, of such an AI (the space of such AI designs includes, but is not limited to, every AI that is describable as a GAI (which in turn includes, but is not limited to, all versions of CEV)).

PS:

The reason that I focus on PCEV is that, as far as I can tell, this is the current state of the art, in terms of answering the ``what alignment target, should be aimed at?'' question. Additionally, as far as I can tell, this is also the most promising starting point, for anyone, trying to make progress on the ``what alignment target should be aimed at?'' question.