Practical consequences of impossibility of value learning

by Stuart Armstrong 2 min read2nd Aug 2019No comments

9


There is a No Free Lunch result in value-learning. Essentially, you can't learn the preferences of an agent from its behaviour unless you make assumptions about its rationality, and you can't learn its rationality unless you make assumptions about its preferences.

More importantly, simplicity/Occam's razor/regularisation don't help with this, unlike with most No Free Lunch theorems. Among the simplest explanations of human behaviour are:

  1. We are always fully rational all the time.
  2. We are always fully anti-rational all the time.
  3. We don't actually prefer anything to anything.

That result, though mathematically valid, seems highly theoretical, and of little practical interest - after all, for most humans, it's obvious what other humans want, most of the time. But I'll argue that the result has strong practical consequences.

Identifying clickbait

Suppose that Facebook or some other corporation decides to cut down on the amount of clickbait on its feeds.

This shouldn't be too hard, the programmers reason. They start by selecting a set of clickbait examples, and check how people engage with these. They programme a neural net to recognise that kind of "engagement" on other posts, which nets a large amount of candidate clickbait. They then go through the candidate posts, labelling the clear examples of clickbait and the clear non-examples, and add these to the training and test sets. They retrain and improve the neural net. A few iterations later, their neural net is well trained, and they let it run on all posts, occasionally auditing the results. Seeking to make the process more transparent, they run interpretability methods on the neural net, seeking to isolate the key components of clickbait, and clear away some errors or over-fits - maybe the equivalent, for clickbait, of removing the "look for images of human arms" in the dumbbell identification nets.

The central issue

Could that method work? Possibly. With enough data and enough programming efforts, it certainly seems that it could. So, what's the problem?

The problem is that so many stages of the process requires choices on the part of the programmers. The initial selection of clickbait in the first place; the labelling of candidates at the second stage; the number of cycles of iterations and improvements; the choice of explicit hyper-parameters and implicit ones (like how long to run each iteration); the auditing process; the selection of key components. All of these rely on the programmers being able to identify clickbait, or the features of clickbait, when they see them.

And that might not sound bad; if we wanted to identify photos of dogs, for example, we would follow a similar process. But there is a key difference. There is a somewhat objective definition of dog (though beware ambiguous cases). And the programmers, when making choices, will be approximating or finding examples of this definition. But there is no objective, semi-objective, or somewhat objective definition of clickbait.

Why? Because the definition of clickbait depends on assessing the preferences of the human that sees it. It can be roughly defined as "something a human is likely to click on (behaviour), but wouldn't really ultimately want to see (preference)".

And, and this is an important point, the No Free Lunch theorem applies to humans. So humans can't deduce preferences or rationality from behaviour, at least, not without making assumptions.

So how do we solve the problem? Because humans do often deduce the preferences and rationality of other humans, and often other humans will agree with them, including the human being assessed. How do we do it?

Well, drumroll, we do it by... making assumptions. And since evolution is so very lazy, the assumptions that humans make - about each other's rationality/preference, about their own rationality/preference - are all very similar. Not identical, of course, but compared with a random agent making random assumptions to interpret the behaviour of another random agent, humans are essentially all the same.

This means that, to a large extent, it is perfectly valid for programmers to use their own assumptions when defining clickbait, or in other situations of assessing the values of others. Indeed, until we solve the issue in general, this may be the only way of doing this; it's certainly the only easy way.

The lesson

So, are there any practical consequences for this? Well, the important thing is that programmers realise they are using their own assumptions, and take these into consideration when programming. Even things that they feel might just be "debugging", by removing obvious failure modes, could be them injecting their assumptions into the system. This has two major consequence:

  1. These assumptions don't form a nice neat category that "carve reality at its joints". Concepts such as "dog" are somewhat ambiguous, but concepts like "human preferences" will be even more so, because they are a series of evolutionary kludges, rather than a single natural thing. Therefore we expect that extrapolating programmer assumptions, or moving to a new distribution, will result in bad behaviour, that will have to be patched anew with more assumptions.
  2. There are cases when their assumptions and those of the users may diverge; looking out for these situations is important. This is easier if programmers realise they are making assumptions, rather than approximating objectively true categories.

9