There's a bit of an analogy between communication and value learning that I've been thinking about recently.
By "Gricean communication," I mean Paul Grice's model of communication as actions based on recursive modeling.
A classic example is me flashing my headlights to communicate that you need to check yours.
Why do I take this weird action of flashing my headlights? Well, because I expect you to check yours. But I don't expect this because I think you have some evolved, unconscious reflex to check your headlights when you see a flashing light; I expect you to model my behavior and deduce that I expect you to check your headlights.
So this simple act invokes several layers of recursive agent-shaped models:
The layer 1 model is "Charlie thinks your lights should be checked, so he flashes his."
Layer 2 is "You know that I'm flashing my lights because I think you should check yours."
Layer 3 is "Charlie knows that you'll interpret the flash as indicating checking your lights."
Layer 4 is "You know that I'm flashing my lights because I expect you to interpret that as a signal about what I'm thinking."
If we could only reason at layer 1, we would only be capable of reasoning about other people as stimulus-response machines - things in the environment with complicated behaviors that are worth learning about, but are more or less black boxes. As we increase the depth of these recursive models of our partner in communication, we're committing ourselves to more informative priors about how we expect people to reason and behave.
(See also previous similar discussion on LW)
Now, if we wanted, we could keep adding layers in an infinite tower. But more generally, we don't need to use lots of layers in the real world, and so we don't. As Dennett puts it in an excellent paper about Gricean communication among monkeys:
A fourth-order system might want you to think it understood you to be requesting that it leave. How high can we human beings go? "In principle," forever, no doubt, but in fact I suspect that you wonder whether I realize how hard it is for you to be sure that you understand whether I mean to be saying that you can recognize that I can believe you to want me to explain that most of us can keep track of only about five or six orders, under the best of circumstances.
What's the relation to value learning? Well, human values aren't written down on a stone tablet somewhere, they're inside humans. But our values aren't written down in plain FORTRAN on the inside of our frontal lobe, either - they can be pointed to only as elements of some model of humans and the surrounding environment.
If values are supposed to live in a model of humans, this raises the question "which model of humans?" And the answer is "the one humans think is best" - and now we're going to need those recursive agent-shaped models.
And why stop there? Why not locate the best way to model humans as having a preferences about models of humans, by consulting a model of humans with preferences over preferences over models?
There are a couple different directions to go from here. One way is to try to collapse the recursion. Find a single agent-shaped model of humans that is (or approximates) a fixed point of this model-ratification process (and also hopefully stays close to real humans by some metric), and use the preferences of that. This is what I see as the endgame of the imitation / bootstrapping research.
Another way might be to imitate communication, and find a way to use recursive models such that we can stop the recursion early without much loss in effectiveness. In communication, the innermost layer of the model can be quite simplistic, and then the next is more complicated by virtue of taking advantage of the first, and so on. At each layer you can do some amount of abstracting away of the details of previous layers, so by the time you're at layer 4 maybe it doesn't matter that layer 1 was just a crude facsimile of human behavior.
The main problem with both of these is the difficulty of converting human behavior, at any layer of recursion, into a format worthy of being called preferences. Humans are inconsistent on the object-level, and incapable of holding entire models of themselves in their heads in order to have specific thoughts about them.
Of course, it's fine if "preferences" aren't a big lookup table, but are instead some computation that takes in options and outputs responses. But we don't have good enough ways to think about what happens when the outputs are inconsistent.
The best humans can do, it often feels like, is to espouse mostly-consistent general principles that are insufficient to pin down any one preference exactly, and might change day to day. If something can do complicated ratification of models of human preferences about models, that's so not-human-like that I worry that something even more inhuman is going on under the hood.
This is why I'm interested in the analogy to Gricean communication, where you just need a few layers of models and they get less complicated the more recursive they are. It sort of seems like what I do when I think about metaethics.
The analogy isn't all that exact, though, because in communication the actual communicator contains the recursive model as a part of itself, and the model contains the deeper layers as strict sub-models. But it feels like in our current picture of meta-preferences, the actual preferrer is contained inside the meta-preferrer, which is contained inside the meta-meta-preferrer, and the recursion explodes rather than converging.
So yeah. If anyone else out there is thinking about technical ways to talk about meta-preferences, I'm interested in how you're trying to think about the problem.