EDIT: This is a post about an alien mind being unlearnable in practice. As a reminder, theory of mind is unlearnable in theory, as stated here - there is more information in "preferences + (ir)rationality" than there is in "behaviour", "policy", or even "complete internal brain structure". This information gap must be covered by assumptions (or "labelled data", in CS terms) of one form or another - assumptions that cannot be deduced from observation. It is unclear whether we need only a few trivial assumptions or a lot of detailed and subtle ones. Hence posts like this one, looking at the practicality angle.
I suggested that an alien "theory of mind" might be unlearnable; Rohin Shah challenged this conclusion, asking whether a theory of mind was truly unlearnable, even for a very intelligent Alice. Let's dig into this concept for a bit.
There is, of course, a weak and a strong version of the unlearnability hypothesis. The strong version is that Alice, even with infinite time and total rationality, couldn't learn an alien theory of mind. The weaker version is that a smart and motivated Alice with a lot of resources and data, couldn't learn an alien theory of mind in reasonable time.
You can nuance both of those by wondering how much of the theory of mind is unlearnable. It doesn't really matter if a few less important bits are unlearnable. So the real question is, how hard is it to learn enough alien theory of mind, with enough data and effort? We might also ask whether the learning process is interactive or non-interactive: does Alice merely observe the alien natives, or is there a conversation going where the aliens try and correct her interpretations?
Unfortunately, we don't have a convenient alien civilization on hand to test this (and, even if we did, we might be unsure whether we'd really understood their theory of mind, or just thought that we did). So instead, let's imagine an alien Alice - aAlice - who is trying to learn the human theory of mind, and see how she might go astray.
It won't take long for aAlice to realise that there is a difference between what humans say publicly, and what we say privately. Also, there is a difference between what we say under the impact of strong emotion, and what we say when calm and relaxed.
She concludes, naturally (as this is close to how her species behaves), that our authentic statements are those given in public, when we are under the sway of strong emotions. She will find quite a lot of evidence for her position. For example, some people will calmly write about the "authenticity" of strong emotion; aAlice interprets this as: "See? Even in their irrational mode, they sometimes let slip a bit of genuine information."
She can point to other reasons for the correctness of her interpretation. For example, humans often publicly praise powerful people, while mocking them behind their back. These humans also go out of their way to be servile to the powerful humans. aAlice concludes, from the "revealed preference" perspective, that our public praise is the correct interpretation, as that is what is compatible with our behaviour. The private mocking must be some hypocritical "speech act", maybe used for social bonding.
Of course, there is a lot of variety in human public-emotional speech, and a lot of wild contradictions. If you point this out to aAlice, she would respond "yes, I know; aren't humans a fascinating species? I have several theories that I'm developing, to explain their complex preference." She might also point out that private-calm speech is also varied and contradictory; according to her theories - meticulously developed through observation and experimentation - the variations and contradictions in private-calm speech are much more of a problem than those in public-emotional speech.
Could we convince aAlice that she's wrong; that private-calm speech is much closer to our true preferences than public-emotional speech is? The true picture is much more nuanced than that, of course, but if we can't communicate the basic facts, we can forget about transmitting the nuances.
How would we transmit that information? Our first instinct would be to calmly explain this to her, preferably without too many different people around listening in and chiming in. This approach she would reject immediately, of course, as she already has concluded that private-calm speech is inauthentic.
The above paragraph means that aAlice would have a very hard time concluding she was wrong, in the non-interactive situation. Most of our deep musings about our true preferences are in the private-calm setting, so would be ignored by aAlice. Can our standard public-emotional pronouncements, filtered by aAlice's complex interpretations, ever convince her to take our private-calm statements more seriously? That seems unlikely.
But, back to the interactive setting. We might realise that our explanations to aAlice are not working. This realisation might take some time, as aAlice might calmly and privately agree with us when we explain where she is wrong (she "knows" that private-calm statements carry no weight, so she just follows the social conventions of calmly agreeing to statement like "rationality requires careful thought").
Out of consideration to us, she would be careful to state her true conclusions and beliefs only in public-emotional ways. Thus it might take us a long while to figure out aAlice's true beliefs about us. We'd also need to do a lot of interpretation of aAlice's goals: from our perspective, aAlice being benevolent while taking our public-emotional statements as true, might be indistinguishable to her being whimsical while taking our private-calm statements as true.
But let's assume that we have somehow understood aAlice, in the same way that she has failed to understand us. Can we correct her misapprehension? Our next attempt might be to communicate our corrections in a public-emotional way. But this would be problematic. First of all, in the public-emotional sphere, there will be other humans stating their opinions and contradicting ours. aAlice has no reason to pay more attention to our pronouncements.
Indeed, she has reason to pay less attention to our pronouncements. Because we will have privately-calmly concluded that we needed to express private-calm sentiments to aAlice in public-emotional ways. This will make for very odd and inauthentic public-emotional pronouncements. And this is where nuance will sting us. We know, as does aAlice, that the public-emotional vs private-calm dichotomy is not fully correct, just a rough approximation. aAlice is therefore likely to add nuance to her interpretation, and set aside these odd and inauthentic public-emotional pronouncements, ignoring them entirely.
This is not helped by the fact that we have a relatively poor grasp of our own theory of mind (see Moravec's paradox, amongst others). Many aspects of our minds and culture only become obvious to us when we encounter beings with different minds and cultures. So a lot of what we will be trying to communicate to aAlice, at least initially, will be incorrect or underspecified. This will give her another reason to reject our attempts at correction, and to build a new elaboration in her human theory of mind, where she adds a term saying "public-emotional expressions of private-calm sentiments are as inauthentic as private-calm expressions themselves."
So our explanations have increased aAlice's misunderstanding, and made it actually harder for us to correct her. This one of the reasons that anthropologists use methods like participant observation (becoming integrated in the culture they are studying) rather than simply asking members of that culture questions. If you don't have an understanding of the culture (an understanding derived mostly from using our own theory of mind during the participation process), then we can't know what the people are likely to be honest about, and in what context. Indeed, we might not even understand what the words mean to them, let along whether they're being honest with them.
So, is the alien theory of mind problem unsolvable? I'm not sure. Like any method of extracting preferences from behaviour, it relies on assumptions, assumptions that cannot be derived from observations. The optimistic perspective is that we only need a few key assumptions, and then a lot of observation and athropology will suffice to fill in the holes. But the aAlice example above is a cautionary tale; we may need much stronger assumptions than we expect, before two alien species can interpret each other correctly.
And, bringing that all back to AI, we may need stronger assumptions than we expect, before an AI can deduce our preferences from observation.
Notice that this elaboration is actually true: the level of authenticity of our private-calm expressions is roughly the same as that of the public-emotional ones we have constructed specifically for aAlice. So where theory of mind is concerned, adding true statements can sometimes make misinterpretations worse. ↩︎
Fwiw, in this story I find myself surprised at aAlice's confidence in her theory. If I were telling a story about an unlearnable theory of mind, I'd be leaning on huge uncertainty that prevents aAlice from doing anything.
It's an interesting question as to whether aAlice is actually overconfident. Her predictions about human behaviour may be spot on, at this point - much better than human predictions about ourselves. So her confidence depends on whether she has the right kind of philosophical uncertainty.
I like this post but I'm a bit confused about why it would ever come up in AI alignment. Since you can't get an "ought" from an "is", you need to seed the AI with labeled examples of things being good or bad. There are a lot of ways to do that, some direct and some indirect, but you need to do it somehow. And once you do that, it would presumably disambiguate "trust public-emotional supervisor" from "trust private-calm supervisor".
Hmm, maybe the scheme you have in mind is something like IRL? I.e.: (1) AI has a hardcoded template of "Boltzmann rational agent", (2) AI tries to match that template to supervisor as best as it can, (3) AI tries to fulfill the inferred goals of the supervisor? Then this post would be saying that we should be open to the possibility that the "best fit" of this template would be very wrong, even if we allow CIRL-like interaction. But I would say that the real problem in this scenario is that the hardcoded template stinks, and we need a better hardcoded template, or else we shouldn't be using this approach in the first place, at least not by itself. I guess that's "obvious" to me, but it's nice to have this concrete example of how it can go wrong, so thanks for that :-)