Is my result wrong? Maths vs intuition vs evolution in learning human preferences

The problem with the maths is that it does not correlate 'values' with any real world observable. You give all objects a property, you say that that property is distributed by simplicity priors. You have not yet specified how these 'values' things relate to any real world phenomenon in any way. Under this model, you could never see any evidence that humans don't 'value' maximizing paperclips.

To solve this, we need to understand what values are. The values of a human are much like the filenames on a hard disk. If you run a quantum field theory simulation, you don't have to think about either, you can make your predictions directly. If you want to make approximate predictions about how a human will behave, you can think in terms of values and get somewhat useful predictions. If you want to predict approximately how a computer system will behave, instead of simulating every transistor, you can think in terms folders and files.

I can substitute words in the 'proof' that humans don't have values, and get a proof that computers don't have files. It works the same way, you turn your uncertainty in the relation between the exact and the approximate into a confidence that the two are uncorrelated. Making a somewhat naive and not formally specified assumption along the lines of, "the real action taken optimizes human values better than most possible actions" will get you a meaningful but not perfect definition of 'values'. You still need to say exactly what a "possible action" is.

Making a somewhat naive and not formally specified assumption along the lines of, "the files are what you see when you click on the file viewer" will get you a meaningful but not perfect definition of 'files'. You still need to say exactly what a "click" is. And how you translate a pattern of photons into a 'file'.

We see that if you were running a quantum simulation of the universe, then getting values out of a virtual human is the same type of problem as getting files off a virtual computer.

I'd consider the Star Trek universe to be much more typical that, say, 7th century China. The Star Trek universe is filled with beings that are slight variants or exaggerations of modern humans, while people in 7th century China will have very alien ways of thinking about society, hierarchy, good behaviour, and so on. But that is still very typical compared with the truly alien beings that can exist in the space of all possible minds. ↩︎
For instance, Americans will typically explain a certain behaviour by intrinsic features of the actor, while Indians will give more credit to the circumstance (Miller, Joan G. "Culture and the development of everyday social explanation." Journal of personality and social psychology 46.5 (1984): 961). ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

9

Is my result wrong? Maths vs intuition vs evolution in learning human preferences

9

Evolution and empathy modules

The problems

A note on assumptions

In practice: debugging and injecting moral preferences