The mathematical result is clear: you cannot deduce human preferences merely by observing human behaviour (even with simplicity priors).
Yet many people instinctively reject this result; even I found it initially counter-intuitive. And you can make a very strong argument that it's wrong. It would go something like this:
"I, a human H, can estimate what human K wants, just by observing their behaviour. And these estimations have evidence behind them: K will often agree that I've got their values right, and I can use this estimation to predict K's behaviour. Therefore, it seems I've done the impossible: go from behaviour to preferences."
This is how I interpret what's going on here. Humans (roughly) have empathy modules E which allow them to estimate the preferences of other humans, and prediction modules P which use the outcome of E to predict their behaviour. Since evolution is colossally lazy, these modules don't vary much from person to person.
So, for hK a history of human K's behaviour in typical circumstances, the modules for two humans H and J will give similar answers:
Moreover, when humans turn their modules to their own behaviour, they get similar result. The human K will have a privileged access to their own deliberations; so define ˇhK as the internal history of K. Thus:
This idea connects with partial preferences/partial models in the following way: EK(ˇhK) gives K access to their own internal models and preferences; so the approximately equal symbols above means that, by observing the behaviour of other humans, we have approximate access to their own internal models.
Then P just takes the results of E to predict future behaviour; since E and P have co-evolved, it's no surprise that P would have a good predictive record.
So, given E, it is true that a human can estimate the preferences of another human, and, given P, it is true they can use this knowledge to predict behaviour.
So, what are the problems here? There are three:
So both are correct: my result (without assumption, you cannot go from human behaviour to preferences) and the critique (given these assumptions that humans share, you can go from human behaviour to preferences).
And when it comes to humans predicting humans, the critique is more valid: listening to your heart/gut is a good way to go. But when it comes to programming potentially powerful AIs that could completely transform the human world in strange and unpredictable ways, my negative result is more relevant than the critique is.
I've had some disagreements with people that boil down to me saying "without assuming A, you cannot deduce B", and them responding "since A is obviously true, B is true". I then go on to say that I am going to assume A (or define A to be true, or whatever).
At that point, we don't actually have a disagreement. We're saying the same thing (accept A, and thus accept B), with a slight difference of emphasis - I'm more "moral anti-realist" (we choose to accept A, because it agrees with our intuition) they are more "moral realist" (A is true, because it agrees with our intuition). It's not particularly productive to dig more.
There are some interesting practical consequences to this analysis. Suppose, for example, that someone is programming a clickbait detector. They then gather a whole collection of clickbait examples, train a neural net on them, and fiddle with the hyperparameters till the classification looks decent.
But both "gathering a whole collection of clickbait examples", "the classification looks decent" are not facts about the universe: they are judgements of the programmers. The programmers are using their own E and P modules to establish that certain articles are a) likely to be clicked on, but b) not what the clicker would really want to read. So the whole process is entirely dependent on programmer judgement - it might feel like "debugging", or "making reasonable modelling choices", but its actually injecting the programmers' judgements into the system.
And that's fine! We've seen that different people have similar judgements. But there are two caveats: first, not everyone will agree, because there is not perfect agreement between the empathy modules. The programmers should be careful as to whether this is an area of very divergent judgements or not.
And second, these results will likely not generalise well to new distributions. That's because having implicit access to categorisation modules that themselves are valid only in typical situations... is not a way to generalise well. At all.
Hence we should expect poor generalisation from such methods, to other situations and (sometimes) to other humans. In my opinion, if programmers are more aware of these issues, they will have better generalisation performance.
I'd consider the Star Trek universe to be much more typical that, say, 7th century China. The Star Trek universe is filled with beings that are slight variants or exaggerations of modern humans, while people in 7th century China will have very alien ways of thinking about society, hierarchy, good behaviour, and so on. But that is still very typical compared with the truly alien beings that can exist in the space of all possible minds. ↩︎
For instance, Americans will typically explain a certain behaviour by intrinsic features of the actor, while Indians will give more credit to the circumstance (Miller, Joan G. "Culture and the development of everyday social explanation." Journal of personality and social psychology 46.5 (1984): 961). ↩︎
The problem with the maths is that it does not correlate 'values' with any real world observable. You give all objects a property, you say that that property is distributed by simplicity priors. You have not yet specified how these 'values' things relate to any real world phenomenon in any way. Under this model, you could never see any evidence that humans don't 'value' maximizing paperclips.
To solve this, we need to understand what values are. The values of a human are much like the filenames on a hard disk. If you run a quantum field theory simulation, you don't have to think about either, you can make your predictions directly. If you want to make approximate predictions about how a human will behave, you can think in terms of values and get somewhat useful predictions. If you want to predict approximately how a computer system will behave, instead of simulating every transistor, you can think in terms folders and files.
I can substitute words in the 'proof' that humans don't have values, and get a proof that computers don't have files. It works the same way, you turn your uncertainty in the relation between the exact and the approximate into a confidence that the two are uncorrelated.
Making a somewhat naive and not formally specified assumption along the lines of, "the real action taken optimizes human values better than most possible actions" will get you a meaningful but not perfect definition of 'values'. You still need to say exactly what a "possible action" is.
Making a somewhat naive and not formally specified assumption along the lines of, "the files are what you see when you click on the file viewer" will get you a meaningful but not perfect definition of 'files'. You still need to say exactly what a "click" is. And how you translate a pattern of photons into a 'file'.
We see that if you were running a quantum simulation of the universe, then getting values out of a virtual human is the same type of problem as getting files off a virtual computer.
I like this analogy. Probably not best to put too much weight on it, but it has some insights.