TL;DR: HRH places a theoretical bound on the best reflection process achievable by us. Notably, this is not necessarily guaranteed to converge to human values, nor is it something that is actually implementable in practice (analogously to HCH). In particular, this is intended to argue against claims that it is insufficient to model CEV as the output of some kind of reflection process. One can also think of HRH as an operationalization of the idea of a long reflection.
Thanks to Connor Leahy, Tasmin Leake, and AI_WAIFU for discussion.
One thing we might want to do is to define our CEV (i.e the ultimate thing we actually want our AGI to optimize for the rest of time) as the output of some long-running deliberation process (a long reflection). This would be extremely convenient; one could imagine having an AGI that lets the reflection process run untampered with, and then implements whatever it decides on. However, one might worry that this could be impossible -- perhaps there are kinds of moral progress that can't be captured in the frame of a reflection process, that require some kind of more sophisticated formalism to capture.
However, consider the reflection process defined as follows: it takes in information from the real world and the utility function output from the previous iteration of reflection, and has a human deliberate for a while and then outputs the improved utility function and crucially also an improved reflection process to be used in the next iteration (you can also tack on the ability for it to interact with the world, which enables it to run experiments, consult a computer, build more powerful aligned AGIs to help, etc). Let's call this Humans Reflecting on HRH. This process essentially covers any process we can use to come up with better theories of what our CEV is.
(If it bothers you that it "modifies" itself, you can think of it as a fixed function that takes in an initial program and optional additional inputs, and outputs a new program, and the function just evals the program internally at every step. The higher level algorithm of "have each step determine the algorithm used in the next step" remains constant, and that's the thing I refer to.)
I claim that HRH is the best achievable reflection process up to constant factors. Suppose you could come up with a better reflective process. Then the version of you in HRH would come up with that process too, and replace the next iteration of HRH with that process. A similar argument applies to the choice of who to put in the reflection process; if you can think of a better person to put in the process, then you could have thought of that within HRH and updated the reflection process to use that person instead. A similar argument also applies to how you ensure that the initial conditions of the reflective process are set up in the best way possible, or to ensure that the reflective process is robust to noise, etc, etc.
This construction may feel like cheating, but it exploits the core property that whatever reflection process we come up with, we are using our current reflective process to come up with it.
I expect some people to look at HRH and say "of course it would be aligned the hard part is that we literally can't implement that", and others to find it woefully inadequate and mutter "have you ever met a human?" Unfortunately, there is a fundamental limitation on what CEVs we can come up with, due to the fact that we bootstrap from humans. There might exist "better" CEVs that we could never think of, even with all the assistance and recursively self improved reflection processes we can construct for ourselves.
Some remaining difficulties with HRH that I haven't figured out yet:
- Can we extract intermediate outputs from HRH? Possibly using logical inductors to get credences over the final output? Can we somehow put some kind of convergence guarantee on these intermediates?
- How the hell do you actually implement anything remotely like HRH (with or without the logical inductors) in the real world?
- How do we firm down the formalism for cases where the reflection process interacts with the real world? Real world interactions potentially breaks things by making the reflection process not a pure function.
A point that doesn't seem to be in the water supply is that even superintelligences won't have (unerringly accurate estimates of) results of CEV to work with. Any predictions of values are goodhart cursed proxy values. Predictions that are not value-laden are even worse. So no AGIs that would want to run a CEV would be utility maximizers, and AGIs that are utility maximizers are maximizing something that isn't CEV of anything, including that of humanity.
Thus utility maximization is necessarily misaligned, not just very hard to align, until enough time has already passed for CEV to run its course, to completion and not merely in foretelling. Which likely never actually happens (reflection is unbounded), so utility maximization can only be approached with increasingly confident mild optimization. And there is currently mostly confusion on what mild optimization does as decision theory.
I agree that in practice you would want to point mild optimization at it, though my preferred resolution (for purely aesthetic reasons) is to figure out how to make utility maximizers that care about latent variables, and then make it try to optimize the latent variable corresponding to whatever the reflection converges to (by doing something vaguely like logical induction). Of course the main obstacles are how the hell we actually do this, and how we make sure the reflection process doesn't just oscillate forever.
This assumes that the output is a utility function. Gotta be careful with that kind of assumption when defining a process meant to capture "the best we could do"; it may turn out that we could, on reflection, come up with a better output than a utility function.