Thanks to Rebecca Gorman for help with this post.
I recently argued that full understanding of value extrapolation was necessary and almost sufficient for solving the AI alignment problem.
In it, I introduced situations beyond the human moral training distribution, where we aren't sure how to interpret their moral value. I gave a convoluted example of an AI CEO engineering the destruction of its company and the ambiguously-moral creation of personal assistants, all in order to boost the value of its shareholders. In the past, I've also given examples of willing slave races and Croatian, communist, Yugoslav nationalists in the 1980s. We could also consider what happens as children develop moral instincts in a world they realise is more complicated, or ordinary humans encountering counter-intuitive thought-experiments for the first time. We also have some examples from history, when situations changed and new questions appeared.
I won't belabour the point any further. Let's call these situations "morally underdefined".
Most of the examples I gave above are rather mild: yes, we're unsure what the right answer is, but it's probably not a huge disaster if the AI gets it wrong. But morally underdefined situations can be much worse than that. The easiest examples are ones that trade off a huge potential good against a huge potential bad, and thus deciding the wrong way could go extremely wrong; we need to carefully sort out the magnitude of the positives and the negatives before making any decision.
The repugnant conclusion is a classical example of that; we wouldn't want to get to the "huge population with lives barely worth living, filled with musak and potatoes" and then only discover after that there was an argument we had missed against total utilitarianism.
Another good example is if we developed whole brain emulations (ems), but they were imperfect. This would resolve the problem of death and might resolve most of the problem of suffering. But what price would we accept to pay? What if our personalities were radically changed? What if our personalities were subtly manipulated? What if we lost our memories over a year of subjective time - we could back these up in classical computer storage and access these as images and videos, but our internal memories would be lost? What if we became entities that were radically different in seemingly positive ways, but we hadn't had time to think through the real consequences? How much suffering would we still allow - and of what sort?
Or what about democracy among ems? Suppose we had a new system that allowed some reproduction, as long as no more than 50% of the "parents'" values were in the new entities? We'd probably want to think through what that meant before accepting.
Similarly, what would we be willing to risk to avoid possible negatives - would we accept to increase the risk of human extinction by 1%, in order to avoid human-brain-cell-teddies, willing slave races, the repugnant conclusion, and not-quite-human-emulations?
So the problem of morally underdefined situations is not a small issue for AI safety; the problem can be almost arbitrarily huge.
A more precise term than "model splintering". ↩︎
My favourite example might be the behaviour of Abraham Lincoln in the early days of the US civil war. The US constitution seemed to rule out secession; but did it empower the president to actively prevent secession? The answer is a clear "it had never come up before and people hadn't decided", and there were various ways to extend precedents. Lincoln chose one route that was somewhat compatible with these precedents (claiming war powers to prevent succession). His predecessor had chosen another one (claiming succession was illegal but that the federal government couldn't prevent it). ↩︎