TL;DR: HRH places a theoretical bound on the best reflection process achievable by us. Notably, this is not necessarily guaranteed to converge to human values, nor is it something that is actually implementable in practice (analogously to HCH). In particular, this is intended to argue against claims that it is insufficient to model CEV as the output of some kind of reflection process. One can also think of HRH as an operationalization of the idea of a long reflection.
Thanks to Connor Leahy, Tasmin Leake, and AI_WAIFU for discussion.
One thing we might want to do is to define our CEV (i.e the ultimate thing we actually want our AGI to optimize for the rest of time) as the output of some long-running deliberation process (a long reflection). This would be extremely convenient; one could imagine having an AGI that lets the reflection process run untampered with, and then implements whatever it decides on. However, one might worry that this could be impossible -- perhaps there are kinds of moral progress that can't be captured in the frame of a reflection process, that require some kind of more sophisticated formalism to capture.
However, consider the reflection process defined as follows: it takes in information from the real world and the utility function output from the previous iteration of reflection, and has a human deliberate for a while and then outputs the improved utility function and crucially also an improved reflection process to be used in the next iteration (you can also tack on the ability for it to interact with the world, which enables it to run experiments, consult a computer, build more powerful aligned AGIs to help, etc). Let's call this Humans Reflecting on HRH. This process essentially covers any process we can use to come up with better theories of what our CEV is.
(If it bothers you that it "modifies" itself, you can think of it as a fixed function that takes in an initial program and optional additional inputs, and outputs a new program, and the function just evals the program internally at every step. The higher level algorithm of "have each step determine the algorithm used in the next step" remains constant, and that's the thing I refer to.)
I claim that HRH is the best achievable reflection process up to constant factors. Suppose you could come up with a better reflective process. Then the version of you in HRH would come up with that process too, and replace the next iteration of HRH with that process. A similar argument applies to the choice of who to put in the reflection process; if you can think of a better person to put in the process, then you could have thought of that within HRH and updated the reflection process to use that person instead. A similar argument also applies to how you ensure that the initial conditions of the reflective process are set up in the best way possible, or to ensure that the reflective process is robust to noise, etc, etc.
This construction may feel like cheating, but it exploits the core property that whatever reflection process we come up with, we are using our current reflective process to come up with it.
I expect some people to look at HRH and say "of course it would be aligned the hard part is that we literally can't implement that", and others to find it woefully inadequate and mutter "have you ever met a human?" Unfortunately, there is a fundamental limitation on what CEVs we can come up with, due to the fact that we bootstrap from humans. There might exist "better" CEVs that we could never think of, even with all the assistance and recursively self improved reflection processes we can construct for ourselves.
Some remaining difficulties with HRH that I haven't figured out yet:
- Can we extract intermediate outputs from HRH? Possibly using logical inductors to get credences over the final output? Can we somehow put some kind of convergence guarantee on these intermediates?
- How the hell do you actually implement anything remotely like HRH (with or without the logical inductors) in the real world?
- How do we firm down the formalism for cases where the reflection process interacts with the real world? Real world interactions potentially breaks things by making the reflection process not a pure function.