Different views of alignment have different consequences for imperfect methods

Stuart_Armstrong

Almost any powerful AI, with almost any goal, will doom humanity. Hence alignment is often seen as a constraint on AI power: we must direct the AI’s optimisation power in a very narrow direction. If the AI is weak, then imperfect methods of alignment might be sufficient. But as the AI’s power rises, the alignment methods must be better and better. Alignment is thus a dam that has to be tall enough and sturdy enough. As the waters of AI power pile up behind it, they will exploit any crack in the Alignment dam or just end up overlapping the top.

So assume is an Alignment method that works in environment E (where “environment” includes the physical setup, the AI’s world-model and knowledge, and the AI’s capabilities in general). Then we expect that there is an environment $E^{'}$ where $A$ fails – when the AI is so capable that it is no longer constrained by $A$ .

Now, maybe there is a better alignment method $A^{'}$ that would work in $E^{'}$ , but there’s likely another environment $E^{''}$ where $A^{'}$ fails. So unless $A$ is “almost perfect”, there will always be some environment when it will fail.

So we need A to be “almost perfect”. Furthermore, in the conventional view, you can’t get there by combining imperfect methods – “belt and braces” doesn’t work. So if $U$ is a utility function that partially defines human flourishing but is missing some key elements, and if $B$ is a box that contains the AI so that it can only communicate with humans via text interfaces, then $U + B$ is not much more of constraint than $U$ and $B$ individually. Most AIs that are smart enough to exploit $U$ and break $B$ , can get around $U + B$ .

A consequence of that perspective is that imperfect methods are of little use for alignment. Since you can’t get “almost perfect” by adding up imperfect methods, there’s no point in developing imperfect methods.

Concept extrapolation/model splintering^[1] has a different dynamic^[2]. Here the key idea is to ensure that the alignment method extends safely across an environment change. So if the AI starts with alignment method $A$ in environment $E$ , and then moves to environment $E^{'}$ , we must ensure that $A$ transitions to $A^{'}$ , where $A^{'}$ ensures that the AI doesn’t misbehave in environment $E^{'}$ . Thus the key is some `meta alignment’ method $M A$ that manages^[3] the transition to the new environment^[4].

An important difference with the standard alignment picture is that imperfect methods are not necessarily useless. If method $M A_{1}$ helps manage the transition between $E$ and $E^{'}$ , it may also work for transitioning between $E^{1, 000, 000}$ and $E^{1, 000, 001}$ . And “belts and braces” may work: even if neither $M A_{1}$ nor $M A_{2}$ work between $E^{1, 000, 000}$ and $E^{1, 000, 001}$ , maybe $M A_{1} + M A_{2}$ can work.

As an example of the first phenomena, Bayes’ theorem can be demonstrated in simple diagrams, but continues to be true in very subtle and complicated situations. The case of $M A_{1} + M A_{2}$ can be seen by considering them as subsets of human moral principles; then there are some situations where $M A_{1}$ is enough to get an answer (“a good thing for two beings is worth twice as much as a good thing for only one being”), some where $M A_{2}$ is enough (“the happiness of a human is worth 10 times that of a dog”), and some where both sets are needed (“compare the happiness of two humans versus 19 times that happiness for a dog”).

So the requirements for successful contribution to concept extrapolation approaches are lower than for more classical alignment methods: less perfection may be required.

In this post, I won’t argue that concept extrapolation is necessary or sufficient or even useful; I’m just noting the different dynamics that it has. ↩︎
Note that alignment methods which use “moral learning” have some similarities with concept extrapolation, so this is not a strict dichotomy. ↩︎
In my research, $A^{'}$ (and $M A$ ) will both generally be goals or utility functions, but they need not be for the present argument. ↩︎
An important note: it’s clear that if there is a “change of environment detector” in the AI, then this detector will become a hacking target of the AI. So MA must be integrated smoothly into the AI’s learning process, no something that it calls upon in specific situations. ↩︎