You want to align an AGI. You just started your first programming course in the Computer Science faculty and you have a bright idea: let's decompose the problem into simpler subproblems!

What you want is to reach a world state that is not a dystopia. For a state S such that S is dystopian, it holds that D(S) is true. If S is not a dystopia, D(S) is false.

You would like to implement the following algorithm:

You start at a random state S.
If D(S): 
    Change S a little in a non-dystopian direction.
Else:
    You aligned AI!

Of course, everything here is vaguely defined. But I still wonder if it makes sense to pose the questions I have before doing the defining work. I guess you might want to interpret the questions under the most reasonable (?) formalizations of the concept above if such a thing is at all possible and not already alignment-complete.

So, we have multiple questions:

1. Is specifying anything of the above (e.g., D) alignment complete?

2. Is evaluating D(S) alignment-complete?

3. Is changing S a little in a non-dystopian direction alignment complete?

4. How would you actually use that pseudo-algorithm though?

Now, there's a slightly different concept from alignment-completeness (I'd guess), which is "constructing a state U that is such that D(U) = false". Such a thing hasn't been done yet as far as I know, and every proposal tends to fail horribly.

So the questions above are worth repeating in this form:

1. Is specifying anything of the above (e.g., D) easier than constructing U?

2. Is evaluating D(S) easier than constructing U?

3. Is changing S a little in a non-dystopian direction easier than constructing U?

4. How would you actually use that pseudo-algorithm though?
 

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 10:56 AM
    Change S a little in a non-dystopian direction.

This step is interesting. Most actions change something random and unimportant. S is changing through all sorts of processes outside the AI's control.

It's not really a binary D or U, it's a multidimensional space, and humans classify it based on how much D and U they consider it to have.  The problem is it may not be monotonic nor smooth, so gradient descent (this is slightly closer to U, let's do that and repeat) doesn't work.

3. Yes (modulo local optima)
Use

You start at a random state S.
If D(S): 
    Change S a little.
Else:
    You aligned AI!

To focus on verification only.