Learning human preferences: optimistic and pessimistic scenarios

by Stuart Armstrong5 min read18th Aug 20205 comments

14

RationalityAI
Frontpage

In this post and the next, I try and clarify - for myself and for others - the precise practical implications of the "Occam's razor is insufficient to infer the preferences of irrational agents " paper.

Time and again, I've had trouble getting others to understand what that paper implies, and what it doesn't. It's neither irrelevant (like many no-free-lunch theorems), nor is it a radical skepticism/nothing is real/we can't really know anything paper.

I've been having productive conversations with Rebecca Gorman, whom I want to thank for her help (and who phrased things well in terms of latent variables)!

A simple biased agent

Consider the following simple model of an agent:

The agent's actions can be explained by their beliefs and preferences[1], and by their biases: by this, we mean the way in which the action selector differs from an unboundedly rational expected preference maximiser.

The results of the Occam's razor paper imply that preferences (and beliefs, and biases) cannot be deduced separately from knowing the agent's policy (and hence, a fortiori, from any observations of the agent's behaviour).

Latent and "natural" variables

Let be a latent variable of the policy - or a some variable that can be deduced from in some simple or natural way.

A consequence of the Occam's razor result is that any such will typically be a mixture of preferences, beliefs, and biases. For if the tended to be restricted to one of these three components, that would mean that separating them would be possible via latent or simple variables.

So, for example, if we conducted a principle component analysis on , we would expect that the components would all be mixes of preferences/beliefs/biases.

The optimistic scenario

To get around the impossibility result, we need "normative assumptions": assumptions about the preferences (or beliefs, or biases) of the agent that cannot be deduced fully from observations.

Under the optimistic scenario, we don't need many of these, at least for identifying human preferences. We can label a few examples ("the anchoring bias, as illustrated in this scenario, is a bias"; "people are at least weakly rational"; "humans often don't think about new courses of action they've never seen before", etc...). Call this labelled data[2] .

The algorithm now constructs categories preferences*, beliefs*, and biases* - these are the generalisations that it has achieved from . Optimistically, these correspond quite closely to what we mean by these categories, at least when combined with the information of , the policy of human . It is now possible for the algorithm to identify latent or natural variables that lie along the "preferences", "beliefs", and "biases" axes, thus identifying and isolating human preferences.

It seems there's a contradiction here - by definition, does not contain much information, yet separating preferences may require a lot of information. The hope is that acts as a doorway to other sources of information - such as human psychology papers, Wikipedia, human fiction, and so on. Call this other data .

The Occam's razor result still applies to : one of the simplest explanations for is to assume that is always rational and that consists of "speech acts" (think of a dishonest politician's speech - you would not want to take the literal content of the speech as correct information). The result still applies even to , where we take the policies of every human in the set of all humans.

However, it is hoped that will allow the algorithm to effectively separate preferences from biases and beliefs. The hope is that acts as key to unlock the vast amount of information in - that once the algorithm has a basic idea what a preference is, then all the human literature on the meaning of preference becomes usable. As more than just speech acts, but as actual sources of information, as the algorithm realises the meaning of the way we want it to, and realises what is lies/metaphors/exaggerations.

This is what we would hope would happen. Guided by our own intuitions - which have no problem distinguishing preferences in other humans and in ourselves, at least roughly - we may feel that this is likely.

The pessimistic scenario

In the pessimistic scenario, human preferences, biases, and beliefs are twisted together is a far more complicated way, and cannot be separated by a few examples.

Consider for example the anchoring bias. I've argued that the anchoring bias is formally very close to being a taste preference.

In contrast, take examples of racial bias, hindsight bias, illusion of control, or naive realism. These biases all seem to be of quite different from the anchoring bias, and quite different from each other. At the very least, they seem to be of different "type signature".

So, under the pessimistic scenario, some biases are much closer to preferences that generic biases (and generic preferences) are to each other. It's not uncommon for parts of the brain to reuse other parts for different purposes; the purity moral preference, for example, recycles part of the emotion of disgust. Individual biases and preferences probably similarly use a lot of the same machinery in the brain, making it hard to tell the differences between them.

Thus providing a few examples of preferences/beliefs/biases, , is not enough to disentangle them. Here fails to unlock the meaning of - when reading psychology papers, the algorithm sees a lot of behaviour ("this human wrote this paper; I could have predicted that"), but not information relevant to the division between preferences/beliefs/biases.

Pessimism, information, and circular reasoning

It's worth digging into that last point a bit more, since it is key to many people's intuitions in this area. On this website, we find a quote:

Civil strife is as much a greater evil than a concerted war effort as war itself is worse than peace. Herodotus

Taken literally, this would mean civil strife << war << peace. But no-one sensible would take it literally; first of all, we'd want to know if the quote was genuine, we'd want to figure out a bit about Herodotus's background, we'd want to see whether his experience is relevant, what has changed in warfare and human preferences over the centuries, and so on.

So we'd be putting the information into context, and, to do so, we'd be using our own theory of mind, our own knowledge of what a preference is, what beliefs and biases humans typically have...

There's a chicken and egg problem: it's not clear that extra information is much use to the algorithm, without a basic understanding of what preferences/beliefs/biases are. So without a good grasp to get started, the algorithm may not be able to use the extra information - even all the world's information - to get a further understanding. And human outputs - such as psychology literature - are written to be understood unambiguously (-ish) by humans. Thus interpreting it in the human fashion, may rely on implicit assumptions that the algorithm doesn't have access to.

It's important to realise that this is not a failure of intelligence on the part of the algorithm. AIXI, the idealised uncomputable superintelligence, will fail at image classification tasks if we give it incorrectly labelled data or don't give it enough ambiguous examples to resolve ambiguous cases.

Failure mode of pessimistic scenario

So the failure mode, in the pessimistic scenario, is that the algorithm generates the categories preferences*, beliefs*, and biases*, but that these don't correspond well to actual preferences, beliefs, or biases - at least not as we get beyond the training examples provided (it doesn't help that humans themselves have trouble distinguishing these in many situations!).

Sp, what the algorithm thinks is a preference may well be a mixture of all three categories. We might correct it by pointing out its mistakes and adding some more examples, but this might only carry it a bit further: whenever it gets to an area where we haven't provided labels, it starts to make large categorisation errors or stumbles upon adversarial examples.

This may feel counter-intuitive, because, for us, extracting preferences feels easy. I'll address that point in the next section, but I'll first note that algorithms finding tasks hard that we find easy is not unusual.

To reiterate: making the algorithm smarter would not solve the problem; the issue (in the pessimistic scenario) is that the three categories are not well-defined nor well-separated.

Pessimism: humans interpreting other humans

We know that humans can interpret the preferences, beliefs, and biases of other humans, at least approximately. If we can do it so easily, how could it be hard for a smart algorithm to do so?

Moravec's paradox might imply that it would be difficult for an algorithm to do so, but that just means we need a smart enough algorithm.

But the question might be badly posed, in which case infinite smartness would not be enough. For example, imagine that humans looked like this, with the "Human Agency Interpreter" (basically the theory of mind) doing the job of interpreting other humans. The green arrows are there to remind us how much of this is done via empathy: by projecting our own preferences/beliefs onto the human we are considering.

This setup also has an optimistic and a pessimistic scenario. They involve how feasible it is for the algorithm to isolate the "Human Agency Interpreter". In the optimistic scenario, we can use a few examples, point to the Wikipedia page on theory of mind, and the algorithm will extract a reasonable facsimile of the human agency interpreter, and then use that to get a reasonable decomposition of the human algorithm into beliefs/preferences/biases.

In the pessimistic scenario, the Human Agency Interpreter is also twisted up with everything else in the human brain, and our examples are not enough to disentangle it, and the same problem re-appears at this level: there is no principled way of figuring out the human theory of mind, without starting from the human theory of mind.


  1. It may seem odd that there is an arrow going from observations to preferences, but a) human preferences do seem to vary in time and circumstances, and b) there is no clear distinction between observation-dependent and observation-independent preferences. For example, you could have a preference for eating when you're hungry; is this an eating preference that is hunger-dependent, or a eating-when-hungry preference that is independent of any observations? Because of these subtleties, I've preferred to draw the arrow unambiguously going into the preferences node, from the observations node, so that there is no confusion. ↩︎

  2. This data may end up being provided implicitly, by programmers correcting "obvious mistakes" in the algorithm. ↩︎

14