How an alien theory of mind might be unlearnable

I like this post but I'm a bit confused about why it would ever come up in AI alignment. Since you can't get an "ought" from an "is", you need to seed the AI with labeled examples of things being good or bad. There are a lot of ways to do that, some direct and some indirect, but you need to do it somehow. And once you do that, it would presumably disambiguate "trust public-emotional supervisor" from "trust private-calm supervisor".

Hmm, maybe the scheme you have in mind is something like IRL? I.e.: (1) AI has a hardcoded template of "Boltzmann rational agent", (2) AI tries to match that template to supervisor as best as it can, (3) AI tries to fulfill the inferred goals of the supervisor? Then this post would be saying that we should be open to the possibility that the "best fit" of this template would be very wrong, even if we allow CIRL-like interaction. But I would say that the real problem in this scenario is that the hardcoded template stinks, and we need a better hardcoded template, or else we shouldn't be using this approach in the first place, at least not by itself. I guess that's "obvious" to me, but it's nice to have this concrete example of how it can go wrong, so thanks for that :-)

Notice that this elaboration is actually true: the level of authenticity of our private-calm expressions is roughly the same as that of the public-emotional ones we have constructed specifically for aAlice. So where theory of mind is concerned, adding true statements can sometimes make misinterpretations worse. ↩︎

[-]Rohin Shah4y40

Fwiw, in this story I find myself surprised at aAlice's confidence in her theory. If I were telling a story about an unlearnable theory of mind, I'd be leaning on huge uncertainty that prevents aAlice from doing anything.

[-]Stuart_Armstrong4y60

It's an interesting question as to whether aAlice is actually overconfident. Her predictions about human behaviour may be spot on, at this point - much better than human predictions about ourselves. So her confidence depends on whether she has the right kind of philosophical uncertainty.

[-]Steven Byrnes4y20

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

13

How an alien theory of mind might be unlearnable

13

Alice learns the ways of aliens

aAlice learns the ways of humans

Can we convince her she's wrong?

Unsolvable?