I proposed a way around Goodhart's curse. Essentially this reduces to properly accounting all of our uncertainty about our values, including some meta-uncertainty about whether we've properly accounted for all our uncertainty.
Wei Dai had some questions about the approach, pointing out that it seemed to have a similar problem as corrigibility: once the AI has resolved all uncertainty about our values, then there's nothing left. I responded by talking about fuzziness rather than uncertainty.
We have a human H, who hasn't yet dedicated any real thought to population ethics. We run a hundred "reasonable" simulations where we introduce H to population ethics, varying the presentation a bit, and ultimately ask for their opinion.
In 45 of these runs, they endorsed total utilitarianism, in 15 of them, they endorsed average utilitarianism, and in 40 of them, they endorsed some compromise system (say the one I suggested here).
That's it. There is no more uncertainty; we know everything there is to know about H's potential opinions on population ethics. What we do with this information - how we define H's "actual" opinion - is up to us (neglecting, for the moment, the issue of H's meta-preferences, which likely suffer from a similar type of ambiguity).
We could round these preferences to "total utilitarianism". That would be the sharpest option.
We could normalise those three utility functions, then add them with the 45-15-40 relative weights.
Or we could do a similar normalisation, but, mindful of fragility of value, we could either move the major options to equal weights 1-1-1, or stick with 45-15-40 but use some smooth minimum on the combination. These would be the more fuzzy choices.
All of these options are valid, given that we haven't defined any way of resolving ambiguous situations like that. And note that fuzziness looks a lot like uncertainty, in that a high fuzziness mix looks like what you'd have as utility function if you were very uncertain. But, unlike uncertainty, knowing more information doesn't "resolve" this fuzziness. That's why Jessica's critique of corrigibility doesn't apply to this situation.
(And note also that we could introduce fuzziness for different reasons - we could believe that this a genuinely good way of resolving competing values, or it could be to cover uncertainty that would be too dangerous to have the AI resolve, or we could introduce it to avoid potential Goodhart problems, without believing that the fuzziness is "real").
The picture where we have 45-15-40 weights on well-defined moral theories, is not a realistic starting point for establishing human values. We humans start mainly with partial preferences, or just lists of example of correct and incorrect behaviours in a narrow span of circumstance.
Extrapolating from these examples to a weighting on moral theories is a process that is entirely under human control. We decide how to do so, thus incorporating our meta-preference implicitly in the process and its outcome.
Consider the supervised learning task of separating photos of dogs from photos of non-dogs. We hand the neural net a bunch of labelled photos, and tell it to go to work. It now has to draw a conceptual boundary around "dog".
What is the AI's concept of "dog" ultimately grounded on? It's obviously not just on the specific photos we handed it - that way lies overfitting and madness.
But nor can we generate every possible set of pixels and have a human label them as dog or non-dog. Take for example the following image:
That, apparently, is a cat, but I've checked with people at the FHI and we consistently mis-identified it. However, a sufficiently smart AI might be able to detect some implicit cat-like features that aren't salient to us, and correctly label it as non-dog.
Thus, in order to correctly identify the term "dog", defined by human labelling, the AI has to disagree with... human labelling. There are more egregious non-dogs that could get labelled as "dogs", such as a photo of a close friend with a sign that says "Help! they'll let me go if you label this image as a dog".
When we program a neural net to classify dogs, we make a lot of choices - the size of the neural net, activation functions and other hyper-parameters, the size and contents of the training, test, and validation sets, whether to tweak the network after the first run, whether to publish the results or bury them, or so on.
Some of these choice can be seen as exactly the "fuzziness" which I defined above - some options determine whether the boundary is drawn tightly or loosely around the examples of "dog", and whether ambiguous options are pushed to one category or allowed to remain ambiguous. But some of these choices - such as methods for avoiding sampling biases or adversarial learning example of a panda as a gibbon - are much more complicated than just "sharp versus fuzzy". I'll call these choices "extrapolation choices", as they determine how the AI extrapolates from the example we have given it.
The same will apply to AIs estimating human preferences. So we have three types of things here:
So when I wrote that to avoid Goodhart problems "The important thing is to correctly model my uncertainty and overconfidence.", I can now refine that into:
Neat and elegant! However, to make it more applicable, I unfortunately need to extend it in a less elegant fashion:
Note that there is no longer any deep need to model "my" uncertainty. It is still important to model uncertainty about the real world correctly, and if I'm mistaken about the real world, this may be relevant to what I believe my extrapolation desiderata are. But modelling my uncertainty is merely instrumentally useful, but modelling my fuzziness is a terminal goal if we want to get it right.
As a minor example of the challenge of the above, consider that this would have needed to be able to detect that adversarial examples were problematic, before anyone had conceived of the idea.
I won't develop this too much more here, as the ideas will be included in my research agenda whose first draft should be published here soon.
Nice post. I suspect you'll still have to keep emphasizing that fuzziness can't play the role of uncertainty in a human-modeling scheme (like CIRL), and is instead a way of resolving human behavior into a utility function framework. Assuming I read you correctly.
I think that there are some unspoken commitments that the framework of fuzziness makes for how to handle extrapolating irrational human behavior. If you represent fuzziness as a weighting over utility functions that gets aggregated linearly (i.e. into another utility function), this is useful for the AI making decisions but can't be the same thing that you're using to model human behavior, because humans are going to take actions that shouldn't be modeled as utility maximization.
To bridge this gap from human behavior to utility function, what I'm interpreting you as implying is that you should represent human behavior in terms of a patchwork of utility functions. In the post you talk about frequencies in a simulation, where small perturbations might lead a human to care about the total or about the average. Rather than the AI creating a context-dependent model of the human, we've somehow taught it (this part might be non-obvious) that these small perturbations don't matter, and should be "fuzzed over" to get a utility function that's a weighted combination of the ones exhibited by the human.
But we could also imagine unrolling this as a frequency over time, where an irrational human sometimes takes the action that's best for the total and other times takes the action that's best for the average. Should a fuzzy-values AI represent this as the human acting according to different utility functions at different times, and then fuzzing over those utility functions to decide what is best?
I'm not basing this on behaviour (because that doesn't work, see: https://arxiv.org/abs/1712.05812 ), but on partial models.