Anthropomorphisation vs value learning: type 1 vs type 2 errors

by Stuart Armstrong1 min read22nd Sep 20207 comments



The Occam's razor paper showed that one cannot deduce an agent 's reward function ( - using the notation from that paper) or their level of rationality () by observing their behaviour or even by knowing their policy (). Subsequently, in a LessWrong post, it was demonstrated that even knowing the agent's full algorithm (call this ) would not be enough to deduce either or individually.

In an online video, I argued that the reason humans can do this when assessing other humans, is because we have an empathy module/theory of mind , that allows us to model the rationality and motives of other humans. These are, crucially, quite similar from human to human, and when we turn them on ourselves, the results are similar to what happens when others assess us. So, roughly speaking, there is an approximate 'what humans want', at least in typical environments[1], that most humans can agree on.

I struggled to convince people that, without this module, we would fail to deduce the motives of other humans. It is hard to imagine what we would be like if we were fundamentally different.

But there is an opposite error that people know very well: anthropomorphisation. In this situation, humans attribute motives to the behaviour of the wind, the weather, the stars, the stock market, cute animals, uncute animals...

So the same module that allows us to, somewhat correctly, deduce the motivations of other humans, also sets us up to fail for many other potential agents. If we started 'weakening' , then we would reduce the number of anthropomorphisation errors we made, but we'd start making more errors about actual humans.

So our can radically fail at assessing the motivations of non-humans, and also sometimes fails at assessing the motivations of humans. Therefore I'm relatively confident in arguing that is not some "a priori" object, coming from pure logic, but is contingent and dependent on human evolution. If we met an alien race, they we would likely assess their motives in ways they would find incorrect - and they'd assess our motives in ways we would find incorrect, no matter how much information either of us had.

  1. See these posts for how we can and do extend this beyond typical environments. ↩︎



7 comments, sorted by Highlighting new comments since Today at 11:30 AM
New Comment

I agree with the idea that we empathetically simulate people, ("simulation theory") ... and I think we have innate social emotions if and only if we use that module when thinking about someone. So I brought that up here as a possible path to "finding human goals" in a world-model, and even talked about how dehumanization and anthropomorphization are opposite errors, in agreement with you. :-D

I think "weakening EH" isn't quite the right perspective, or at least not the whole story, at least when it comes to humans. We have metacognitive powers, and there are emotional / affective implications to using EH vs not using EH, and therefore using EH is at least partly a decision rather than a simple pattern-matching process. If you are torturing someone, you'll find that it's painful to use EH, so you quickly learn to stop using it. If you are playing make-believe with a teddy bear, you find that it's pleasurable to use EH on the teddy bear, so you do use it.

So dehumanization is modeling a person without using EH. When you view someone through a dehumanization perspective, you lose your social emotions. It no longer feels viscerally good or bad that they are suffering or happy.

But I do not think that when you're dehumanizing someone, you lose your ability to talk coherently about their motivations. Maybe we're a little handicapped in accurately modeling them, but still basically competent, and can get better with practice. Like, if a prison guard is dehumanizing the criminals, they can still recognize that a criminal is "trying" to escape. 

I guess the core issue is whether motivation is supposed to have a mathematical definition or a normal-human-definition. A normal-human-definition of motivation is perfectly possible without any special modules, just like the definition of every other concept like "doorknob" is possible without special modules. You have a general world-modeling capability that lumps sensory patterns into categories/concepts. Then you're in ten different scenarios where people use the word "doorknob". You look for what those scenarios have in common, and it's that a particular concept was active in your mind, and then you attach that concept to the word "doorknob". Those ten examples were probably all central examples of "doorknob". There are also weird edge cases, where if you ask me "is that a doorknob?" I would say "I don't know, what exactly do you mean by that question?".

We don't need a special module to get an everyday definition of doorknobs, and likewise I don't think we don't need a special module to get an everyday definition of human motivation. We just need to be exposed to ten central examples, and we'll find the concept(s) that are most activated by those examples, and call that concept "motivation". Here's one of the ten examples: "If a person, in a psychologically-normal state of mind, says 'I want to do X', then they are probably motivated to do X."

Of course, just like doorknobs, there are loads of edge cases where if you ask a normal person "what is Alice's motivation here?" they'll say "I don't know, what exactly do you mean by that question?".

A mathematical definition of human motivation would, I imagine, have to be unambiguous and complete, with no edge cases. From a certain perspective, why would you ever think that such a thing even exists? But if we're talking about MDPs and utility functions, or if we're trying to create a specification for an AGI designer to design to, this is a natural thing to hope for and talk about.

I think if you gave an alien the same ten labelled central examples of "human motivation", plus lots of unlabeled information about humans and videos of humans, they could well form a similar concept around it (in the normal-human-definition sense, not the mathematical-definition sense), or at least their concept would overlap ours as long as we stay very far away from the edge cases. That's assuming the aliens' world-modeling apparatus is at least vaguely similar to ours, which I think is plausible, since we live in the same universe. But it is not assuming that the aliens' motivational systems and biases are anything like ours.

Sorry if I'm misunderstanding anything :-)

We don't need a special module to get an everyday definition of doorknobs, and likewise I don't think we don't need a special module to get an everyday definition of human motivation.

I disagree. Doornobs exist in the world (even if the category is loosely defined, and has lots of edge cases), whereas goals/motivations are interpretations that we put upon agents. The main result of the Occam's razor paper is that there the goals of an agent are not something that you can know without putting your own interpretation on it - even if you know every physical fact about the universe. And two very different interpretations can be equally valid, with no way of distinguishing between them.

(I like the anthropomorphising/dehumanising symmetry, but I'm focusing on the aspects of dehumanising that cause you to make errors of interpretation. For example, out-groups are perceived as being coherent, acting in concert without disagreements, and often being explicitly evil. This is an error, not just a reduction in social emotions)

It's your first day working at the factory, and you're assigned to shadow Alice as she monitors the machines on the line. She walks over to the Big Machine and says, "Looks like it's flooping again," whacks it, and then says "I think that fixed it". This happens a few times a day perpetually. Over time, you learn what flooping is, kinda. When the Big Machine is flooping, it usually (but not always) makes a certain noise, it usually (but not always) has a loose belt, and it usually (but not always) has a gear that shifted out of place. Now you know what it means for the Big Machine to be flooping, although there are lots of edge cases where neither you nor Alice has a good answer for whether or not it's truly flooping, vs sorta flooping, vs not flooping.

By the same token, you could give some labeled examples of "wants to take a walk" to the aliens, and they can find what those examples have in common and develop a concept of "wants to take a walk", albeit with edge cases.

Then you can also give labeled examples of "wants to braid their hair", "wants to be accepted", etc., and after enough cycles of this, they'll get the more general concept of "want", again with edge cases.

I don't think I'm saying anything that goes against your Occam's razor paper. As I understood it (and you can correct me!!), that paper was about fitting observations of humans to a mathematical model of "boundedly-rational agent pursuing a utility function", and proved that there's no objectively best way to do it, where "objectively best" includes things like fidelity and simplicity. (My perspective on that is, "Well yeah, duh, humans are not boundedly-rational agents pursuing a utility function! The model doesn't fit! There's no objectively best way to hammer a square peg into a round hole! (ETA: the model doesn't fit except insofar as the model is tautologically applicable to anything)")

I don't see how the paper rules out the possibility of building an unlabeled predictive model of humans, and then getting a bunch of examples labeled "This is human motivation", and building a fuzzy concept around those examples. The more labeled examples there are, the more tolerant you are of different inductive biases in the learning algorithm. In the limit of astronomically many labeled examples, you don't need a learning algorithm at all, it's just a lookup table.

This procedure has nothing to do with fitting human behavior into a model of a boundedly-rational agent pursuing a utility function. It's just an effort to consider all the various things humans do with their brains and bodies, and build a loose category in that space using supervised learning. Why not?

that paper was about fitting observations of humans to a mathematical model of "boundedly-rational agent pursuing a utility function"

It was "any sort of agent pursuing a reward function".

Sorry for the stupid question, but what's the difference between "boundedly-rational agent pursuing a reward function" and "any sort of agent pursuing a reward function"?

A boundedly-rational agent is assumed to be mostly rational, failing to be fully rational because of a failure to figure things out in enough detail.

Humans are occasionally rational, often biased, often inconsistent, sometimes consciously act against their best interests, often follow heuristics without thinking, sometimes do think things through. This doesn't seem to correspond to what is normally understood as "boundedly-rational".

Gotcha, thanks. I have corrected my comment two above by striking out the words "boundedly-rational", but I think the point of that comment still stands.