In response to my post contrasting value learning with anthropomorphisation, steve2152 brought up the fact that dehumanisation can be seen as the opposite of anthropomorphisation.
I agree with this insight, but only when dehumanisation causes errors of interpretation. I was using empathy in the sense of "insight into the other agent", rather than "sympathy with the other agent".
In practice, dehumanisation does tend to cause errors. We see outgroups as more homogeneous, coherent, and organised than they actually are. Despite the suave psychopaths depicted in movies, psychopaths tend to be less effective at achieving their goals (as evidenced by the large number of psychopaths in prison). Torturers are less effective at extracting true information than classical interrogators.
Now, it's not a universal law by any means, but it does seem that dehumanisation can often lead to errors, and from that perspective can be seen as a failure of value learning.
The meaning of errors
- "Objection! Hold on just a minute!" screams the convenient strawman I have just constructed.
"You've claimed that 'agent's goals' are interpretations by the outside observer; that you can model a human as perfectly rational, without being wrong. You've claimed that this is 'structured white box knowledge', which can't be deduced from the agent's policy or its algorithm."
"Given that, how can you claim that anyone 'fails' at interpreting the goals of others, or that they make 'errors'?"
This a very valid point, strawman, but I've also pointed out that human theory of mind/empathy is very similar from human to human, and tends to agree with how we interpret our own goals. Because of this, there is a rough "universal human theory of mind", ie a universal way of going from human policy to human preferences.
When I'm talking about errors, I'm talking about deviations from this ideal.
Because human theories of mind do not agree perfectly, there will always be an irreducible level of uncertainty in this ideal, but there is agreement on the broad strokes of it. ↩︎