Anthropomorphisation vs value learning: type 1 vs type 2 errors

Stuart_Armstrong

Anthropomorphisation vs value learning: type 1 vs type 2 errors — AI Alignment Forum

7 Anthropomorphisation vs value learning: type 1 vs type 2 errors

22nd Sep 2020

1 min read

7

The Occam's razor paper showed that one cannot deduce an agent 's reward function ( $R_{H}$ - using the notation from that paper) or their level of rationality ( $p_{H}$ ) by observing their behaviour or even by knowing their policy ( $π_{H}$ ). Subsequently, in a LessWrong post, it was demonstrated that even knowing the agent's full algorithm (call this $a_{H}$ ) would not be enough to deduce either $R_{H}$ or $p_{H}$ individually.

In an online video, I argued that the reason humans can do this when assessing other humans, is because we have an empathy module/theory of mind $E_{H}$ , that allows us to model the rationality and motives of other humans. These $E_{H}$ are, crucially, quite similar from human to human, and when we turn them on ourselves, the results are similar to what happens when others assess us. So, roughly speaking, there is an approximate 'what humans want', at least in typical environments^[1], that most humans can agree on.

I struggled to convince people that, without this module, we would fail to deduce the motives of other humans. It is hard to imagine what we would be like if we were fundamentally different.

But there is an opposite error that people know very well: anthropomorphisation. In this situation, humans attribute motives to the behaviour of the wind, the weather, the stars, the stock market, cute animals, uncute animals...

So the same module that allows us to, somewhat correctly, deduce the motivations of other humans, also sets us up to fail for many other potential agents. If we started 'weakening' $E_{H}$ , then we would reduce the number of anthropomorphisation errors we made, but we'd start making more errors about actual humans.

So our $E_{H}$ can radically fail at assessing the motivations of non-humans, and also sometimes fails at assessing the motivations of humans. Therefore I'm relatively confident in arguing that $E_{H}$ is not some "a priori" object, coming from pure logic, but is contingent and dependent on human evolution. If we met an alien race, they we would likely assess their motives in ways they would find incorrect - and they'd assess our motives in ways we would find incorrect, no matter how much information either of us had.

See these posts for how we can and do extend this beyond typical environments. ↩︎

Frontpage

Mentioned in

7Dehumanisation *errors*

Anthropomorphisation vs value learning: type 1 vs type 2 errors

New Comment

7 comments, sorted by

top scoring

Click to highlight new comments since: Today at 5:23 AM

[-]Steven Byrnes6y20

I agree with the idea that we empathetically simulate people, ("simulation theory") ... and I think we have innate social emotions if and only if we use that module when thinking about someone. So I brought that up here as a possible path to "finding human goals" in a world-model, and even talked about how dehumanization and anthropomorphization are opposite errors, in agreement with you. :-D

I think "weakening EH" isn't quite the right perspective, or at least not the whole story, at least when it comes to humans. We have metacognitive powers, and there are emotional / affective implications to using EH vs not using EH, and therefore using EH is at least partly a decision rather than a simple pattern-matching process. If you are torturing someone, you'll find that it's painful to use EH, so you quickly learn to stop using it. If you are playing make-believe with a teddy bear, you find that it's pleasurable to use EH on the teddy bear, so you do use it.

So dehumanization is modeling a person without using EH. When you view someone through a dehumanization perspective, you lose your social emotions. It no longer feels viscerally good or bad that they are suffering or happy.

But I do not think that when you're dehumanizing someone, you lose your ability to talk coherently about their motivations. Maybe we're a little handicapped in accurately modeling them, but still basically competent, and can get better with practice. Like, if a prison guard is dehumanizing the criminals, they can still recognize that a criminal is "trying" to escape.

I guess the core issue is whether motivation is supposed to have a mathematical definition or a normal-human-definition. A normal-human-definition of motivation is perfectly possible without any special modules, just like the definition of every other concept like "doorknob" is possible without special modules. You have a general world-modeling capability that lumps sensory patterns into categories/concepts. Then you're in ten different scenarios where people use the word "doorknob". You look for what those scenarios have in common, and it's that a particular concept was active in your mind, and then you attach that concept to the word "doorknob". Those ten examples were probably all central examples of "doorknob". There are also weird edge cases, where if you ask me "is that a doorknob?" I would say "I don't know, what exactly do you mean by that question?".

We don't need a special module to get an everyday definition of doorknobs, and likewise I don't think we don't need a special module to get an everyday definition of human motivation. We just need to be exposed to ten central examples, and we'll find the concept(s) that are most activated by those examples, and call that concept "motivation". Here's one of the ten examples: "If a person, in a psychologically-normal state of mind, says 'I want to do X', then they are probably motivated to do X."

Of course, just like doorknobs, there are loads of edge cases where if you ask a normal person "what is Alice's motivation here?" they'll say "I don't know, what exactly do you mean by that question?".

A mathematical definition of human motivation would, I imagine, have to be unambiguous and complete, with no edge cases. From a certain perspective, why would you ever think that such a thing even exists? But if we're talking about MDPs and utility functions, or if we're trying to create a specification for an AGI designer to design to, this is a natural thing to hope for and talk about.

I think if you gave an alien the same ten labelled central examples of "human motivation", plus lots of unlabeled information about humans and videos of humans, they could well form a similar concept around it (in the normal-human-definition sense, not the mathematical-definition sense), or at least their concept would overlap ours as long as we stay very far away from the edge cases. That's assuming the aliens' world-modeling apparatus is at least vaguely similar to ours, which I think is plausible, since we live in the same universe. But it is not assuming that the aliens' motivational systems and biases are anything like ours.

Sorry if I'm misunderstanding anything :-)

[-]Stuart_Armstrong6y10

We don't need a special module to get an everyday definition of doorknobs, and likewise I don't think we don't need a special module to get an everyday definition of human motivation.

I disagree. Doornobs exist in the world (even if the category is loosely defined, and has lots of edge cases), whereas goals/motivations are interpretations that we put upon agents. The main result of the Occam's razor paper is that there the goals of an agent are not something that you can know without putting your own interpretation on it - even if you know every physical fact about the universe. And two very different interpretations can be equally valid, with no way of distinguishing between them.

(I like the anthropomorphising/dehumanising symmetry, but I'm focusing on the aspects of dehumanising that cause you to make errors of interpretation. For example, out-groups are perceived as being coherent, acting in concert without disagreements, and often being explicitly evil. This is an error, not just a reduction in social emotions)

[-]Steven Byrnes6y*10

It's your first day working at the factory, and you're assigned to shadow Alice as she monitors the machines on the line. She walks over to the Big Machine and says, "Looks like it's flooping again," whacks it, and then says "I think that fixed it". This happens a few times a day perpetually. Over time, you learn what flooping is, kinda. When the Big Machine is flooping, it usually (but not always) makes a certain noise, it usually (but not always) has a loose belt, and it usually (but not always) has a gear that shifted out of place. Now you know what it means for the Big Machine to be flooping, although there are lots of edge cases where neither you nor Alice has a good answer for whether or not it's truly flooping, vs sorta flooping, vs not flooping.

By the same token, you could give some labeled examples of "wants to take a walk" to the aliens, and they can find what those examples have in common and develop a concept of "wants to take a walk", albeit with edge cases.

Then you can also give labeled examples of "wants to braid their hair", "wants to be accepted", etc., and after enough cycles of this, they'll get the more general concept of "want", again with edge cases.

I don't think I'm saying anything that goes against your Occam's razor paper. As I understood it (and you can correct me!!), that paper was about fitting observations of humans to a mathematical model of "~~boundedly-rational~~ agent pursuing a utility function", and proved that there's no objectively best way to do it, where "objectively best" includes things like fidelity and simplicity. (My perspective on that is, "Well yeah, duh, humans are not ~~boundedly-rational~~ agents pursuing a utility function! The model doesn't fit! There's no objectively best way to hammer a square peg into a round hole! (ETA: the model doesn't fit except insofar as the model is tautologically applicable to anything)")

I don't see how the paper rules out the possibility of building an unlabeled predictive model of humans, and then getting a bunch of examples labeled "This is human motivation", and building a fuzzy concept around those examples. The more labeled examples there are, the more tolerant you are of different inductive biases in the learning algorithm. In the limit of astronomically many labeled examples, you don't need a learning algorithm at all, it's just a lookup table.

This procedure has nothing to do with fitting human behavior into a model of a ~~boundedly-rational~~ agent pursuing a utility function. It's just an effort to consider all the various things humans do with their brains and bodies, and build a loose category in that space using supervised learning. Why not?

[-]Stuart_Armstrong6y10

that paper was about fitting observations of humans to a mathematical model of "boundedly-rational agent pursuing a utility function"

It was "any sort of agent pursuing a reward function".

[-]Steven Byrnes6y20

Sorry for the stupid question, but what's the difference between "boundedly-rational agent pursuing a reward function" and "any sort of agent pursuing a reward function"?

[-]Stuart_Armstrong6y20

A boundedly-rational agent is assumed to be mostly rational, failing to be fully rational because of a failure to figure things out in enough detail.

Humans are occasionally rational, often biased, often inconsistent, sometimes consciously act against their best interests, often follow heuristics without thinking, sometimes do think things through. This doesn't seem to correspond to what is normally understood as "boundedly-rational".

[-]Steven Byrnes6y10

Gotcha, thanks. I have corrected my comment two above by striking out the words "boundedly-rational", but I think the point of that comment still stands.

Moderation Log