I had a discussion with Paul Christiano, about his Iterated Amplification and Distillation scheme. We had a disagreement, a disagreement that I believe points to something interesting, so I'm posting this here.

It's a disagreement about the value of the concept of "preserving alignment". To vastly oversimplify Paul's idea, the AI A[n] will check that A[n+1] is still aligned with human preferences; meanwhile, A[n-1] will be checking that A[n] is still aligned with human preferences, all the way down to A[0] and an initial human H that checks on it.

Intuitively, this seems doable - A[n] is "nice", so it seems that it can reasonably check that A[n+1] is also nice, and so on.

But, as I pointed out in this post, it's very possible that A[n] is "nice" only because it lacks power/can't do certain things/hasn't thought of certain policies. So niceness - in the sense of behaving sensibly as an autonomous agent - does not go through the inductive step in this argument.

Instead, Paul confirmed that "alignment" means "won't take unaligned actions, and will assess the decisions of a higher agent in a way that preserves alignment (and preserves the preservation of alignment, and so on)".

This concept does induct properly, but seems far less intuitive to me. It relies on humans, for example, being able to ensure that A[0] will be aligned, that any more powerful copies it assesses will be aligned, that any more powerful copies those copies assess are also aligned, and so on.

Intuitively, for any concept C of alignment for H and A[0], I expect one of four things will happen, with the first three being more likely:

- The C does not induct.
- The C already contains all of the friendly utility function; induction works, but does nothing.
- The C does induct non-trivially, but is incomplete: it's very narrow, and doesn't define a good candidate for a friendly utility function.
- The C does induct in a non-trivial way, the result is friendly, but only one or two steps of the induction are actually needed.

Hopefully, further research should clarify if my intuitions are correct.

I agree that power and capability concerns seem important here. Even if we accept A[n+1] is not going to fail on C due to being more powerful than A[n], it seems likely to me that A[n] will not be capable enough to sufficiently assess A[n+1] such that we can get good enough guarantees that C holds for A[n+1].

If we look at the probability of C holding over the whole chain of induction I think things look even worse. Let's say there is a 99.9% likelihood that C will hold on any step of induction. then over many iterations as we multiply the probability that C holds we find probability that C holds over the whole process falls with a lower probability of C holding overall the more iterations needed.

Taken together this suggests a serious challenge, because to minimize risk of power differentials between successors to increase the chance of C holding we would want many small iterations, but this risks increasing the risk that small probabilities of failure in each iteration will compound such that it eventually becomes likely that C does not hold.

To meet this harkens back to a lesson we long ago learned in engineering: the more moving parts in your system the more likely it will fail.

I'm curious what you're imagining here - I don't really know why this would happen or what it would look like. Is it something like "this agent makes a successor that is fully friendly and powerful given resource constraints"?

I'm thinking something like "this utility function is friendly, once we have solved these n specific problems; let's create a few levels of higher intelligence, to solve these specific problems, using certain constraints (physical or motivational) to prevent things going wrong during this process".

If I imagine each level A[n] as maximizing the expected value of some simple utility function, I agree that it would be surprising if the result was not one of your first three cases. Intuitively, either we already have all of the friendly utility function, and we didn't need induction, or we didn't and bad things happen, which corresponds to cases 1 and 3.

But it seems like one of the main points of iterated amplification is that at least the initial levels need not be maximizing the expected value of some simple utility. In that case, there seems to be a much wider space of possible designs.

For example, we could have a system that has the epistemic state of wanting to help humans but knowing that it doesn't know how best to do that, and so asking humans for feedback and deferring to them when appropriate. Such a system with amplification might eventually learn the friendly utility function and start maximizing that, but it seems like there could be many iterations before that point, during which it is corrigible in the sense of deferring to humans and not maximizing its current conception of what is best.

I don't have a strong sense at the moment what would happen, but it seems plausible that the induction will go through and will have "actually mattered".