Can we make peace with moral indeterminacy?

by Charlie Steiner3 min read3rd Oct 20193 comments



The problem:

Put humans in the ancestral environment, and they'll behave as if they like nutrition and reproducing. Put them in the modern environment, and they'll behave as if they like tasty food and good feelings. Pump heroin into their brains, and they'll behave as if they want high dopamine levels.

None of these are the One True Values of the humans, they're just what humans seem to value in context, at different levels of abstraction. And this is all there is - there is no One True Context in which we find One True Values, there are just regular contexts. Thus we're in a bit of a pickle when it comes to teaching an AI how we want the world to be rearranged, because there's no One True Best State Of The World.

This underdetermination gets even worse when we consider that there's no One True Generalization Procedure, either. At least for everyday sorts of questions (do I want nutrition, or do I want tasty food?), we're doing interpolation, not extrapolation. But when we ask about contexts or options totally outside the training set (how should we arrange the atoms of the Milky Way?), we're back to the problem illustrated with train tracks in The Tails Coming Apart As Metaphor For Life.

Sometimes it feels like for every value alignment proposal, the arbitrariness of certain decisions sticks out like a missing finger on a hand. And we just have to hope that it all works out fine and that this arbitrary decision turns out to be a good one, because there's no way to make a non-arbitrary decision for some choices.

Is it possible for us to make peace with this upsetting fact of moral indeterminacy? If two slightly different methods of value learning give two very different plans for the galaxy, should we regard both plans as equally good, and be fine with either? I don't think this acceptance of arbitrariness is crazy, and some amount is absolutely necessary. But this pill might be less bitter to swallow if we clarify our picture of what "value learning" is supposed to be doing in the first place.

AIs aren't driving towards their One Best State anyhow:

For example, what kind of "human values" object do we want a value learning scheme to learn? Because it ain't a utility function over microphysical states of the world.

After all, we don't want a FAI to be in the business of finding the best position for all the atoms, and then moving the atoms there and freezing them. We want the "best state" to contain people growing, exploring, changing the environment, and so on. This is only a "state" at all when viewed at some very high level of abstraction that incorporates history and time evolution.

So when two Friendly AIs generalize differently, this might look less like totally different end-states for the galaxy, but like subtly different opinions on which dynamics make for a satisfying galactic society... which eventually lead to totally different end-states for the galaxy. Look, I never said this would make the problem go away - we're still talking about generalizing from our training set to the entire universe, here. If I'm making any comforting point here, it's that the arbitrariness doesn't have to be tense or alien or too big to comprehend, it can be between reasonable things that all sound like good ideas.


And jumping Jehoshaphat, we haven't even talked about meta-ethics yet. AI that takes meta-ethics into account wouldn't only learn what we appear to value according to whatever definition it started with, it would try to take into account what we think it means to value things, what it means to make good decisions, what we think we value, and what we want to value.

This can get a lot trickier than just inferring a utility function from a human's actions, and we don't have a very good understanding of it right now. But our concern about the arbitrariness of values is precisely a meta-ethical concern, so you can see why it might be a big deal to build an AI that cares about meta-ethics. I'd want a superhuman meta-ethical reasoner to learn that there was something weird and scary about this problem of formalizing and generalizing values, and take superhumanly reasonable steps to address this. The only problem is I have no idea how to build such a thing.

But in lieu of superintelligent solutions, we can still try to research appealing metaethical schemes for controlling generalization.

One such scheme is incrementalism. Rather than immediately striking out for the optimal utopia your model predicts, maybe it's safer to follow something like an iterative process - humans learning, thinking, growing, changing the world, and eventually ending up at a utopia that might not be what you had in mind at the start. (More technically, we might simulate this process as flow between environments, where we start with our current environment and values, and flow to nearby environments based on our rating of them, at each step updating our values not according to what they would actually be in that environment, but based on an idealized meta-ethical update rule set by our current selves.)

This was inspired by Scott Garrabrant's question about gradient descent vs. Goodhart's law. If we think of utopias as optimized points in a landscape of possibilities, we might want to find ones that lie near to home - via hill-climbing or other local dynamics - rather than trusting our model to safely teleport us to some far-off point in configuration space.

It also bears resemblance to Eliezer_2004's meta-ethical wish list: "if we knew more, [...] were the people we wished we were, had grown up farther together..." There just seems to be something meta-ethically trustworthy about "growing up more."

This also illustrates how the project of incorporating meta-ethics into value learning really has its work cut out for it. Of course there are arbitrary choices in meta-ethics too, but somehow they seem more palatable than arbitrary choices at the lower meta-level. Whether we do it with artificial help or not, I think it's possible to gradually tease out what sort of things we want from value learning, which might not reduce the number of arbitrary choices, but hopefully can reduce their danger and mystery.