Thanks to feedback on my recent post I've been reconsidering some of my thinking about how to address meta-ethical uncertainty as it affect alignment. I've run into something unexpected, and I'm curious to see what others think. To quote the key paragraph from my current draft:

A further issue with assuming moral facts do not exist is that it's unclear how to resolve value conflicts in the absence of assumptions about value norms. Certainly any policy can be used to resolve value conflicts for any reason, but the choice of policy by an AGI we consider aligned would still count as a normative assumption about values in practice if not in principle to the extent the AGI is considered to be aligning its behavior with human interests. Thus we are forced to assume moral facts exist in some form if alignment is to be solvable because otherwise we would have no way of deciding how to resolve value conflicts and could not construct an AGI we would consider aligned.

This is not to say we must assume moral realism, cognitivism, or any other particular position about the nature of moral facts, merely accept that moral facts exist since if they don't alignment is impossible in principle since an AGI will have no way to choose what to do when human values conflict. We could perhaps in practice build aligned AGI without doing this by just picking norms based on what someone preferred, but this seems to violate the spirit of the intention behind "alignment" to be for all moral agents and not just one particular human (to be fair addressing just one human is hard enough!) and not just that particular human's current preference that would not necessarily be endorsed upon reflective equilibrium (but the idea that preferences under reflective equilibrium are better is itself a normative assumption of the sort we would be hard pressed to make if we did not assume the existence of moral facts!).

Also note that all of this is made harder by not having a precise definition of alignment I can lean on. Yes, I wrote one, but it depends in part on results I'm considering formally in this paper so I can't use it.

I'm still thinking this through and updating on feedback I received on my first draft, but this seems like a significant enough departure from my earlier line of thinking to solicit additional feedback to consider issues I may be ignoring. Thanks!



New Comment