The proposal to add a ``Last Judge'' to an AI, does not remove the urgency, of making progress on the ``what alignment target should be aimed at?'' question.

ThomasCederborg

Currently, there exist no viable proposed answer, to the ``what alignment target should be aimed at?'' question. There also exist no reliable way of distinguishing a good alignment target, from an alignment target, such that successfully hitting this alignment target, would result in an outcome, that would be far, far, worse than extinction. (see my previous post: A problem with the most recently published version of CEV ). This means that making progress on the ``what alignment target should be aimed at?'' question is urgent by default. There also exist no reliable way of preventing a scenario, where someone successfully hits an alignment target, without properly analysing it. For example because the alignment target in question seems like ``the obviously correct thing to aim at'', and because no one finds any problems with it in time. There exists many different proposals of how to buy time. None of those proposals are guaranteed to work. There is also no particular reason to believe, that there would be enough time, even if we were to reason from the assumption, that at least one such proposal will be successful. See my previous post Making progress on the ``what alignment target should be aimed at?'' question, is urgent , for a longer discussion of proposals that are designed to buy time (which explains why such proposals do not remove the urgency of making progress on the ``what alignment target should be aimed at?'' question).

In other words: Many plausible paths exist, that eventually ends in an alignment target being successful hit, resulting in an outcome, that is far, far, worse than extinction. No existing proposal, of how to reduce AI related risks, removes this risk (and it seems likely to me, that some versions, of some of these proposals, are actively increasing the risk, that is the focus of the present text). The probability of many of these paths can be reduced, by progressing far enough, on the ``what alignment target should be aimed at?'' question, before it is too late. There is no way of knowing how much time there will be to work on this question. There is also no way of knowing how long it will take to progress to the needed point. Thus, the question is urgent by default.

The present post is responding to a class of arguments against urgency, that refer to ideas, along the lines of the ``last judge off switch'' proposal. The general idea can be described as giving an Extrapolated Person, an Off switch, so let's write EPO for short (to emphasise the fact that, if the proposal is successfully implemented, then the off switch is actually handed to whatever mind, happens to be implied, by the specific, human defined, extrapolation procedure, that has been chosen). Just as my previous post did not argue against the idea of a PHAI, the present text is not arguing against the idea of an EPO. It is instead arguing that no existing EPO proposal, removes the urgency of making progress, on the ``what alignment target should be aimed at?'' question (just as no existing PHAI proposal, removes this urgency).

One simple way of making the argument of the present text, is to note that any given implemented EPO, might fail to trigger the off switch, due to an implementation failure. Thus, failure to find an unexamined implicit assumption, that the alignment target being aimed at is built on top of, in time, might lead to an outcome, that is far, far, worse than extinction. If the launch only happened, because the design team was confident in their EPO, then the EPO idea was directly responsible for this outcome.

Another obvious scenario, is that an EPO might be removed by reckless people, that has an answer to the ``what alignment target should be aimed at?'' question, that no one has been able to find any flaws with, and who thinks that their answer is ``the objectively correct thing to do''. (one relevant fact here, is that not all plausible paths, lead to situations, where critical decisions, are in the hands of the people, that figured out, how to hit alignment targets)

Yet another scenario is that sensible people, taking a well calibrated risk as a response to a genuine danger, launches an AI with a disabled EPO, after that EPO keeps triggering the off switch (for example due to a reasonable, but inaccurate, estimate, that the EPO is probably triggering the off switch, due to an implementation failure). These sensible people might, for example, have legitimate fears, that a rival team will successfully launch an AI, that is aiming for an alignment target, that has been shown to be far, far, worse than extinction. This could make sensible people take the calculated risk, of forcing a launch, aiming at an alignment target, that no one has found a problem with so far. (avoiding this type of race dynamics sounds like a very good idea. However, there is no particular reason, to think that an attempt to avoid this type of race dynamics, will actually succeed. Thus, it is simply not possible to base plans, on the assumption, that this type of race dynamics will be successfully avoided). If progress on the ``what alignment target should be aimed at?'' question advances to the point where it is possible to see, that all available alignment targets are in fact bad, then the sensible design team mentioned above might not have any good options. But such a finding would at least give them a chance, to try to find some solution that could actually work (for example opting for some alternative strategy for preventing the launch of the rival AI, even if that strategy is less likely to work, or if it comes with high costs).

Each of these three scenarios are individually sufficient, for showing that the EPO idea does not remove the urgency of making progress on the ``what alignment target should be aimed at?'' question. The present post will however take a step further, and focus on a different issue. Basically, the present post will establish, that even a successfully implemented EPO, might result in a clever and powerful AI, using enormous resources, to hurt the ``last judge'', as much as possible (in other words: in this thought experiment, the result of an extrapolation procedure, will approve of the launch, of an AI, that wants to hurt the person, that the EPO extrapolation process was originally pointed at). This explanation is longer, but I think the conclusion will be seen as more decisive, by a larger set of people. In other words: the conclusion is harder to establish. But if accepted, it seems likely to me, that it will impact the behaviour of people, whose behaviour would not be impacted by the types of examples, that was mentioned above. One technical point that is important to understand fully, is that there exists a very, very, large number of mappings, that maps human individuals, to other minds. None of these mappings are ``objectively correct''. This means that there is simply no way for a clever AI to find ``the objectively correct mapping'', no matter how smart this AI is.

Let's look closer at one specific failure mode for an EPO (while keeping in mind that it is one specific issue, amongst a large number of possible failure modes). Specifically: we will consider a scenario, where an unexamined implicit assumption, regarding what type of thing, a human mind is, turns out to be wrong, in a way that makes the off switch not trigger. Let's say that Steve will be extrapolated, and that the result of this extrapolation, will be given a veto, over the launch of an AI. The extrapolation procedure is built on top of an unexamined implicit assumption, regarding what type of thing, a human mind is. This assumption happens to be wrong in an important way. The attempt to define ``Steve'' (the person that the extrapolation is supposed to be pointed at), does however refer to some Thing. In other words: due to the incorrect assumption, an attempt to point the extrapolation procedure at Steve, fails. It does however not fail completely, and the extrapolation procedure is pointed at some Thing (for example: the details of this Thing, is defined by the details of Steves neurones. But this Thing is importantly different from Steve). Let's refer to the result of this extrapolation, as an Extrapolated Thing (ET). ET has successfully been made smart enough to understand everything that is going on, and is completely aware of what has happened. ET is fully aware, of the fact that, the original Thing that was extrapolated, was not human. And this is very relevant information, for ET. In fact, this makes ET very uncomfortable, with the idea, of interfering with any human initiated process. Thus, ET will approve of any extrapolation procedure, and also approve of any alignment target (since both were initiated by actual humans). ET is fully aware of the misunderstanding that lead to this alignment target being selected. And ET is fully aware of the misunderstanding, that lead to this extrapolation procedure being selected. (any conceivable extrapolation process, or alignment target, designed by unextrapolated humans, will be based on some set of misunderstandings. So this event is not in itself surprising, or particularly informative. The event where a given extrapolation procedure, or alignment target, is approved of, despite the fact that it is based on a misunderstanding, can thus not be seen as a red flag). The fact that ET was not based on a human, is however making ET deeply uncomfortable, with any scenario, where ET is vetoing a human initiated process. And this happens to outweigh the discomfort, that the alignment target creates. Steve does not feel this way. But ET feels this way.

In this scenario, Steve does not understand why the alignment target is bad (if he had, he would not have participated in an AI project, that is aiming for this alignment target). Steve also does not understand what the problem with the EPO extrapolation procedure is (if he had, he would not have participated in an AI project, that is using this extrapolation procedure for the EPO). So, who exactly would trigger the off switch? Steve will not (even if we assume that Steve is also given a veto). And ET will not.

The AI will also not see this situation as problematic. Let's say that the alignment target being aimed at was PCEV, and that PCEV would decide to hand control to a tiny group of hateful fanatics (for a discussion of a scenario where PCEV hands power to a tiny group of hateful fanatics, see my previous post: A problem with the most recently published version of CEV ). The set of definitions, that lead to this situation (power to fanatics, an EPO extrapolation procedure that results in ET, who in turn decides to let this happen, etc) is what defines the goal of the AI. For the same reason, that no AI can be counted on, to realise that it has been given ``the wrong goal'', it is simply not possible to rely on this AI, deciding that ``I have been given the wrong goal''. The AI will understand exactly what is going on. But the AI has no reason, whatsoever, to see this situation, as a problem.

Leaving this specific type of unexamined implicit assumption, and speaking more generally: there is simply no particular reason to feel confident, that the result of a (human defined) extrapolation procedure, would protect Steve. This is true, even if the extrapolation procedure started with Steve. It is even more true for Bill, who is just one person, amongst billions, that is not being given any special treatment. It is certainly possible, that a given EPO will offer meaningful protection for Bill. The EPO could for example protect Bill from an implementation failure. It could also be, that the output of an extrapolation procedure, that was pointed at Steve, will agree with Bill, about the fact that the alignment target being aimed at, is bad. So, an EPO might save Bill from both implementation failures, and bad alignment targets. It is however also possible, that an EPO will prevent implementation failures, allowing some set of designers to, eventually, hit an alignment target, that would be far, far worse than extinction (especially if the EPO is combined with a PHAI, and is implemented by people, that does not approach the ``what alignment target should be aimed at?'' question, with sufficient caution).

Perhaps it is possible to build a convincing argument, that an EPO would, in some sense, be net positive for Bill (for example based on the fact that certain types of implementation failures, could also be far worse than extinction). If the alignment target in question, gives Bill meaningful influence, over the adoption of those preferences, that refer to Bill, an EPO might be a good idea, from Bills perspective (see my previous post A problem with the most recently published version of CEV , for a discussion of the importance, of this type of influence). (the idea that the addition of an EPO would be a net positive for Bill, when the design team in question is aiming for an alignment target, that would give Bill such influence, is something that sounds (to me) like it might be true. Whether or not it is, in fact, true, in some specific class of scenarios, is obviously not a settled question)

If, on the other hand, the alignment target will adopt preferences, that refer to Bill. But Bill will have no meaningful control over the adoption of those preferences, that refer to Bill (and no privileged connection to the EPO). Then it is difficult to see, how the addition of the EPO, could possibly be a net positive for Bill (even if we assume perfect implementation of the EPO, and even if we acknowledge the existence of scenarios, where such an EPO would be very good for Bill, for example due to implementation failures of the AI). It would seem that the most important danger, from Bill's perspective, would be that the EPO might help designers hit an alignment target, that is extremely bad for Bill. By protecting against implementation failure, and protecting against cases where ``the wrong tiny group of fanatics'' is given control over the AI, the EPO could enable the creation of a clever AI, that has preferences that refer to Bill. But Bill has no meaningful influence, regarding the adoption of these preferences, that refer to Bill. Thus, Bill has no reason, whatsoever, to think that this AI will want to help him, as opposed to want to hurt him. It is difficult to see, what possible positive effects, could ever be of comparable magnitude, to this risk.

It is worth going into a bit more details, regarding the idea, that an AI might notice, that a given extrapolation procedure ``is wrong''. The short answer is that it is not possible to rely on an AI, to notice that ``the wrong extrapolation procedure'', has been used for an EPO, for the same reason, that it is not possible to rely on an AI, to notice that it has been given ``the wrong goal''.

The specific details of the extrapolation procedure determines the goal of the AI. Changing one detail could completely change the goal (a slight difference in willingness to compromise, in the delegates representing a small number of people, could, for example, result in a different tiny subgroup being in charge of PCEV (meaning that a small change in a single detail, could lead to an arbitrarily large change in outcome. This is covered more below)). Thus, suggesting that it might be possible to rely on an AI, to notice that a given extrapolation is ``wrong'', is equivalent to suggesting that it might be possible to rely on an AI, to notice that it has been given the ``wrong goal''. In other words: it is nonsense. The definition of an extrapolation procedure is arbitrary. Any such procedure, will necessarily be built on top of a large number of constraints. The choice of which set of constraints to use, is fundamentally arbitrary (in the sense that different minds, with different goals, will prefer different sets of such constraints. There is no level of cleverness, above which all minds will agree, on the objectively correct set of constraints (just as there is no level of cleverness, above which all minds will agree, that some specific goal, is the objectively correct goal)). Let's illustrate a deeper, and more foundational, problem, by examining one specific constraint, that one might choose to add to an extrapolation procedure: the constraint that the result of an extrapolation procedure, must approve of the extrapolation procedure.

Consider Dave, a potential outcome of an extrapolation procedure. Dave will approve of an extrapolation procedure being applied to a given mind, iff the outcome of that procedure is Dave (for example because Dave is outcome oriented, and because Dave's definition of ``correct'', contains references to Dave). Minds along the lines of Dave are very strongly favoured by the constraint in question. In other words: the way that the constraint in question, acts within a larger architecture, is to dramatically steer influence towards minds, along the lines of Dave. It is entirely reasonable for a mind, to categorically reject any extrapolation procedure, that strongly favours minds along the lines of Dave. Let's now examine what the effect is on the outcome of an extrapolation, of human individual Steve, when we add this constraint to an alternative extrapolation process that did not previously contain this constraint. It would not be particularly shocking, to learn that any reasonable way of extrapolating Steve, will result in a mind, that categorically rejects any extrapolation procedure, that strongly favours minds along the lines of Dave. The addition of the constraint in question, can thus result in Steve's interests being represented by a mind along the lines of Dave, specifically because Steve does not like minds along the lines of Dave. (one obvious alternative outcome, is that Steve is extrapolated in such a way, that the resulting mind does not understand the concept of extrapolation, at a level, where Dave related issues, are understood). The idea that one might rely on the approval of unextrapolated minds, of a given procedure, is silly (it seems likely that there does not exists any human, that understands the concept at a level that would allow informed choices. There is no particular reason to think, that it is even possible in principle, for a human to understand the concept at a level, such that acceptance of a specific procedure would mean much). So, who exactly would object, when Steve is represented by Dave? The extrapolation procedure is defining the goal of the AI, so the AI will not object. Steve would approve of the procedure, because he does not understand the problem. And Dave is actively happy with the situation. (asking what an idealised / informed / extrapolated / modified / etc, version of Steve would say, is circular)

Dave related issues are just one specific example, of a deeper problem, connected to the constraint in question (and this particular constraint is, in turn, just one example, of the types of constraints, that one would have to pre specify. For example: the ET issue mentioned above, is not caused by this particular constraint). Let's examine one more, highly specific, example, that is also related to the constraint, that the outcome of an extrapolation procedure, must approve of the procedure. This issue arise from the fact that the constraint in question, strongly selects against minds, that understands a problem, that is shared by the extrapolation procedure of the EPO, and the extrapolation procedure, that is used to define the alignment target. Even if the designers are not using the same extrapolation procedure for the EPO, as the extrapolation procedure that is used when specifying the alignment target, it could still be, that both extrapolation procedures are built on top of the same, problematic, unexamined implicit assumption (this problematic assumption might, in turn, be completely unrelated to the constraint in question). Now, consider an alternative EPO extrapolation procedure, that does not contain the constraint in question. The result of this alternative EPO extrapolation procedure, might trigger the off switch, due to being able to understand an underlying problem, that also affects the goal definition (because the goal definition is built on top of an extrapolation procedure, that is in turn built on top of the same unexamined implicit assumption, that causes the problems with the EPO extrapolation procedure, which in turn makes the outcome of the EPO extrapolation procedure, disapprove of the EPO extrapolation procedure). Adding the constraint in question, means that the outcome of the resulting EPO extrapolation procedure, might decide to not shut things down, because the EPO extrapolation procedure will now actively select for minds, that do not understand the problem with the goal definition extrapolation procedure (because the same problem it is also present in the EPO extrapolation procedure).

One additional technical issue that is relevant in this scenario, is that there exists no ``objectively correct extrapolation distance''. There is no a red flag, associated with ``insufficient extrapolation distance''. Any human defined extrapolation procedure, will be built on top of multiple, inaccurate, unexamined implicit assumptions. Many risks associated with such inaccurate assumptions, will become more severe, at larger extrapolation distances. So, it is not reasonable to simply assume that something is wrong, due to ``insufficient extrapolation distance''. Extrapolating a mind, until a specific problem is understood, might lead to a general change in overall perspective, resulting in a mind that is incomprehensible, to the original mind. In general: if a specific insight, would lead to a mind, that takes positions, that are incomprehensible to the original mind, then choosing to avoid this insight can, thus, not be treated as a red flag. The constraint in question introduces a bias, favouring minds that are incapable of understanding the specific types of problems, that are associated with unexamined implicit assumptions, that extrapolation procedures are built on top of (and those unexamined implicit assumptions, are exactly the types of assumptions, that the outcome of the extrapolation procedure, should be able to evaluate (since the goal definition is also built on top of them)).

(again: this is, solely, an argument against relying on an EPO for safety. It is, of course, not an argument against the idea of an EPO, or against the particular constraint in question (if one wanted to evaluate this specific constraint, then one would have to (as a first step on a very long journey) try to find problematic scenarios, involving minds that actively oppose the extrapolation procedure, that was used to generate them. Such an analysis would however be very far outside the scope of the present text). One of the factors, that might make the presence of an EPO, a net negative, is if designers think that it offers full safety. This is not the slightest bit unusual. Any idea can be dangerous if it results in a false sense of security. An analogy would be that a seatbelt would be extremely dangerous, if a driver thinks that a seatbelt offers complete protection against all possible negative effects of crashes. Such a driver would be much better off, if someone removed the seatbelt. A sane designer might however be better off with an EPO, in the same way that a sane driver might be better off with a seatbelt. A separate issue, that can also make the addition of an EPO a net negative, can be illustrated with a different car analogy. Breaks might be a net negative, if they allow a driver to prevent car crashes, thus allowing the driver to navigate to a location, where something happens, that is far worse than a car crash. This car scenario does not allow us to construct an argument, that brakes are always a net negative addition to a car. The scenario can, of course, not be used as a general argument against breaks. Many car related plans, especially those involving sane drivers, are, of course, improved by the presence of breaks)

It might make sense to illustrate the sensitivity of the outcome, to small modifications in the details of an extrapolation procedure, with a thought experiment. Let's modify the scenario in my previous post: A problem with the most recently published version of CEV . Let's relax the assumption of ``perfect delegates'', perfectly representing the interests of individuals. Let's also add another, slightly larger, group of fanatics: F2. The fanatics in F2 are deeply conflicted regarding the permissibility of negotiating with heretics. Different holy laws, point in different directions, on the issue of the permissibility of negotiation. That is: one holy law states that any world, that is the result of negotiations with heretics, would be an abomination. Another holy law states that everything must be done, to ensure that your children are brought up into the right religion (and in the situation that the delegates find themselves in, negotiating with heretics, happens to be the only way of arriving at non tiny probabilities of this happening). Any world where the upbringing of their children was forfeited, or even gambled with, due to an unwillingness to negotiate, would be completely unacceptable. If the extrapolation leads to delegates representing F2, that are unwilling to negotiate (in other words, delegates that will vote for MP no matter what), then there is no change in outcome (since F2, in this scenario, will not be the largest voting bloc). If, however, extrapolation leads to F2 delegates, that are willing to negotiate, then F2 will end up in charge. In other words: The AI will either want to organise the universe in a way that is demanded by the religion of the original fanatics, or the religion of F2 (in addition to, in both cases, using some resources, to subject all heretics, to a negotiated form of ``limited punishment''). The difference between these two religions could be arbitrarily large. Basically, the fate of everything, hinges on what the details of the chosen extrapolation procedure implies, regarding the willingness to negotiate, when interpreting conflicting religious commandments, of the religion of the tiny number of fanatics in F2 (in a situation where they are presented with very dramatic new information, that completely demolishes their entire world model).

It might finally make sense to address the similar structure of the argument in the present post, and the argument in my previous post: Making progress on the ``what alignment target should be aimed at?'' question, is urgent . In both cases, the basic argument is that there is simply no reliable way of preventing an outcome that is far, far, worse than extinction, as a result of an alignment target getting successfully implemented, because that target sounds like ``the obviously correct thing to aim at'', and because no one finds any problem with it, in time. Making progress on the ``what alignment target should be aimed at?'' question, increases the probability that someone will see the flaw in time, and is thus one way of reducing this probability. There is no way of knowing how long it will take, to progress to the needed level of understanding. Thus, progress will remain urgent. The only way of removing this urgency, would be to present a plan that will reliably prevent this scenario. Such a proposal would have to be, among many other things, certain to be successfully implemented. Additionally, success of implementation, would have to lead to reliable prevention of the case, where a bad alignment target is hit. The reliable prevention of all scenarios, where a bad alignment target is hit, would have to be achieved, at a stage when no one has any idea, of how to reliably tell a good alignment target, from an alignment target, that would be far, far, worse than extinction (if the proposal does not have this feature, then the proposal implicitly assumes that sufficient progress on the ``what alignment target should be aimed at?'' question, will be achieved in time). Both of my posts, that are arguing for urgency, are of the form: ``this specific proposal, does not satisfy those criteria''. It seems obvious to me, that reliably ensuring that a good alignment target will be hit, at the current level of understanding, of the ``what alignment target should be aimed at?'' question, is simply not possible. Each post, dealing with a specific proposal, is following a simple formula, based on the underlying logic, that is outlined in the present paragraph.

PS:

It might make sense to explicitly state something, even though some readers will find it obvious. The proposal of: let's try to make progress on the ``what alignment target should be aimed at?'' question, is obviously not a reliable way, of avoiding the types of horrific outcomes, that feature in my various thought experiments. A serious attempt to make progress, would reduce the probability, of some paths leading to such outcomes. But there is no guarantee that such investigations will have any impact at all. Even if we were to reason from the assumption, that the first attempt to hit an alignment target, will succeed, there is still no guarantee, that the proposed investigative effort, would have any impact whatsoever. Even a serious attempt to make progress, might fail to make sufficient progress, before it is too late (even a large number of very clever people, fully focused on this issue, for an extended period of time, might simply fail to make sufficient progress). Alternatively, even if the attempt does produce sufficient insights, these insights might not be possible to communicate in time, to the people deciding what to aim at. There is, for example, no particular reason to think, that a problem with some specific unexamined implicit assumption, can be illustrated with a simple thought experiment. It is also true, that in some scenarios (that can not be reliably avoided), the people deciding what to aim at, is not the people, that figured out how to hit alignment targets. (this lack of guarantee of success, also means that it would be trivial to construct the various corresponding counter-counterarguments. One could, for example, easily show that the proposal to try to make progress on the ``what alignment target should be aimed at?'' question, does not constitute a viable argument against an EPO, or against a PHAI, or against any other proposed way of reducing the probability of horrific outcomes. Any given, specific, version of an EPO proposal, or a PHAI proposal, or some other class of proposals, might of course be pointless, or might have a strongly negative expected impact. But if so, then this should be demonstrated, in a way that does not rely on the proposal, to try to make progress on the ``what alignment target should be aimed at?'' question. There is nothing particularly unusual, or complex, about this general situation, when viewed at this level of abstraction. It is similar to seatbelts and breaks in a car (or an estimate of probable landmine locations). The proposal to attempt to add one of these features to a car that currently lacks them (or the proposal to try to build a map of likely landmine locations), does not remove the usefulness, of trying to add the other feature (both because success is not guaranteed, and because even complete engineering success, in adding one of these features, does not imply full safety. A flawlessly implemented seatbelt, that is guaranteed to work as intended, does not remove the usefulness of breaks, or the usefulness of a map of probable landmine locations. There are connections. Beaks might for example make it possible, to navigate to a location, that has lots of landmines. But, it simply does not make any sense, to treat any of these features, as a substitute for any other feature. Which is the generalised version of the argument, that the present post is trying to make))

LESSWRONG
LW

The proposal to add a ``Last Judge'' to an AI, does not remove the urgency, of making progress on the ``what alignment target should be aimed at?'' question.

1

New to LessWrong?

1