User Veedrac recently commented:

You have shown that simplicity cannot distinguish $(p, R)$ from $(- p, - R)$ , but you have not shown that simplicity cannot distinguish a physical person optimizing competently for a good outcome from a physical person optimizing nega-competently for a bad outcome.

This goes to the heart of an important confusion:

"Agent $A$ has preferences $R$ " is not a fact about the world. It is a stance about $A$ , or an interpretation of $A$ . A stance or an interpretation that we choose to take, for some purpose or reason.

Relevant for us humans is:

We instinctively take a particular preference stance towards other humans; and humans tend to take the same stance towards others and towards each other. This makes the stance feel "natural" and intrinsic to the world, when it is not.

The intentional stance

Daniel Dennett defined the intentional stance as follows:

Here is how it works: first you decide to treat the object whose behavior is to be predicted as a rational agent; then you figure out what beliefs that agent ought to have, given its place in the world and its purpose. Then you figure out what desires it ought to have, on the same considerations, and finally you predict that this rational agent will act to further its goals in the light of its beliefs. A little practical reasoning from the chosen set of beliefs and desires will in most instances yield a decision about what the agent ought to do; that is what you predict the agent will do.

In the physical stance, we interpret something as being made of atoms and following the laws of physics. In the intentional stance, we see it as being an agent and following some goal. The first allows for good prediction of the paths of planets; the second, for the outcome of playing AlphaZero in a game of Go.

The preference stance (or the (ir)rationality stance^[1]) is a more general stance, where you see the object as having preferences, but not necessarily being rational about optimising them.

The preference/(ir)rationality stance

What it the intentional stance for?

In a sense, the intentional stance is exactly the same as the preference stance. Dennett takes an object and treats it as an agent, and splits it into preference and rationality. Ok, he assumes that the agent is "rational", but allows for us to "figure out what what beliefs the agent ought to have." That, in practice, allows us to model a lot of irrationality if we want to. And I'm fully convinced that Dennett takes biases and other lapses of rationality into account when dealing with other humans.

So, in a sense, Dennett is already taking a preference towards the object. And he is doing so for the express purpose of better predicting the behaviour of that object.

What is the preference stance for?

Unlike the intentional stance, the preference stance is not taken for the purpose of better predicting humans. It is instead taken for the purpose of figuring out what the human preferences are - so that we could maximise or satisfy them. The Occam's razor paper demonstrates that, from the point of view of Kolomogorov complexity, taking a good preference stance (ie plausible preferences to maximise) is not at all the same thing as taking a good (predictive) intentional stance.

But it often feels as if it is; we seem to predict people better when we assume, for example, that they have specific biases or want specific things. Why is this, and how does it seem to get around the result?

Rationality stance vs empathy machine

There are two preference stances that it is easy for humans to take. The first is to assume that an object is a rational agent with a certain preference. Then we can try and predict which action or which outcome would satisfy that preference, and then expect that action/outcome. We do this often when modelling people in economics, or similar mass models of multiple people at once.

The second is to use the empathy machinery that evolution has developed for us, and model the object as being human. Applying this to the weather and the natural world, we anthropomorphised and created gods. Applying to other humans (and to ourselves) gives us quite decent predictive power.

I suspect this is what underlies Veedrac intuition. For if we apply our empathy machine to fellow humans, we get something that is far closer to a "goodness optimiser", albeit a biased one, than to an "badness nega-optimiser".

But this doesn't say that the first is more likely, or more true, about our fellow humans. It say that the easiest stance for us to take is to treat other humans in this way. And this is not helpful, unless we manage to get our empathy machine into an AI. That is part of the challenge.

And this brings us back to why the empathy machine seems to make better predictions about humans. Our own internal goals, the goals that we think we have on reflection, and how we expect people (including us) to behave given those goals... all of those coevolved. It seems that it was easier for evolution to use our internal goals (see here for what I mean by these) and our understanding of our own rationality, to do predictions. Rather than to run our goals and our predictions as two entirely separate processes.

That's why, when you use empathy to figure out someone's goals and rationality, this also allows you to better predict them. But this is a fact about you (and me), not about the world. Just as "Thor is angry" is actually much more complex than electromagnetism, our prediction of other people via our empathy machine is simpler for us to do - but is actually more complex for an agent that doesn't already have this empathy machinery to draw on.

So assuming everyone is rational is a simpler explanation of human behaviour than our empathy machinery - at least, for generic non-humans.

Or, to quote myself:

A superintelligent AI could have all the world’s video feeds, all of Wikipedia, all social science research, perfect predictions of human behaviour, be able to perfectly manipulate humans... And still conclude that humans are fully rational.

It would not be wrong.

I'll interchangeably call it a preference or an (ir)rationality stance, since given preferences, the (ir)rationality can be deduced from behaviour, and vice versa. ↩︎

AI ALIGNMENT FORUM
AF