This is a very loose idea.
In Stable Pointers to Value II, I pointed at a loose hierarchy of approaches in which you try to get rid of the wireheading problem by revising the feedback loop to remove the incentive to wirehead. Each revision seems to change the nature of the problem (perhaps to the point where we don't want to call it a wireheading problem, and instead would put it in a more general perverse-instantiation category), but not eliminate problems entirely.
Talking with Lawrence Chan today, he described a way of solving problems by "going meta" (a strategy which he was mostly suspicious of, in the conversation). His example was: you can't extract human values by specifying it as a learning problem, because of severe identifiability problems. However, it is not entirely implausible that we can "learn to learn human values": have humans label examples of other humans trying to do things, indicating what values are being expressed in the scenario.
If this goes wrong, you can try and iterate the operation again, learning to learn to learn...
This struck me as similar to the hierarchy I had constructed in my older post.
My interpretation of what Lawrence meant by "going meta" here is this: machine learning research "eats" other research fields by using automated learning to solve problems which were previously being solved by the process of science, IE, hand-crafting hypotheses and testing them. AI alignment research is full of cases where this doesn't seem like a very good approach. However, one attitude we can take to such cases is to do the operation again: propose to learn how humans would solve this sticky problem.
This is not at all like other learning to learn approaches which merely seek to speed up normal learning. The idea is that our object-level loss function is insufficient to point out the behavior we really want. We want new normative feedback to come in at the meta-level, telling us more about which ways of solving the object-level problem are desirable and which are undesirable.
The idea I'm about to describe seems like a fairly hopeless idea, but I'm interested in seeing how it would go regardless.
What is the fixed point of this particular "go meta" operation?
The intuition is this: any utility function we try to write down has perverse instantiations, so that we don't really want to optimize it fully. Searching over a big space leads to Goodhart and optimization daemons. Unfortunately, search is more or less the only way to produce intelligent behavior that we know of.
However, it seems like we can often improve on this situation by providing more human input to check what was really wanted. Furthermore, it seems like we generally get more by doing this on the meta level -- we don't just want to refine the estimated utility function; we want to refine our notion of safely searching for good options (avoiding searches which goodhart on looks-good-to-humans by manipulating human psychology, for example), refine our notion of what learning the utility function even means, and so on.
Every stage of going meta introduces a need for yet another search, which brings back the problems all over again. But, maybe we can do something interesting by jumping up all the meta levels here, so that each search is itself governed by some feedback, except when we bottom out in extremely simple operations which we trust.
(This feels conceptually similar to some intuitions in HCH/IDA, but I don't see that it is exactly the same.)
"Recursive quantilization" is an attempt to make the idea a little more formal. I don't think it quite captures everything I would want from the "fixed point of the meta operation Lawrence Chan was suspicious of", but it has the advantage of being slightly more concrete.
Quantilizers are a way of optimizing a utility function when you're suspicious that it isn't the "true" utility function you should be optimizing, but you do think that the average difference is low when sampling things from a known background distribution. Intuitively, you don't want to move too far from the background distribution where your utility estimates are accurate, but you do want to optimize in the direction of high utility somewhat.
What if we want to quantilize, and we expect that there is some background distribution which would make us have a decent amount of trust in the accuracy of the given utility function, but we don't know what that background distribution is?
We have to learn the "safe" background distribution.
Learning is going to require a search for hypotheses matching whatever feedback we get, which re-introduces Goodhart, etc. So, we quantilize that search. But we need a background distribution which we expect to be safe. And so on.
There are a lot of potential concerns here, but the one which is most salient to me is that humans will have a lot of trouble providing feedback about non-fishy ways of solving the problems at even slightly high meta levels.
Object level: Plans for achieving high utility.
Meta 1: Distributions containing only plans which the utility function evaluates correctly.
Meta 2: Distributions containing only distributions-on-plans which the first-meta-level learning algorithm can be expected to evaluate correctly.
How do you analyze a distribution? Presumably you have to get a good picture of its shape in the highly multidimensional space -- look at examples of more and less typical members, and be convinced that the examples you looked at were representative. It's also important that you go into its code and check that it isn't intelligently optimizing for some misaligned goal.
It seems to me that a massive advance in transparency or informed oversight would be needed in order for humans to give helpful feedback at higher meta-levels.