The problem:

Put humans in the ancestral environment, and they'll behave as if they like nutrition and reproducing. Put them in the modern environment, and they'll behave as if they like tasty food and good feelings. Pump heroin into their brains, and they'll behave as if they want high dopamine levels.

None of these are the One True Values of the humans, they're just what humans seem to value in context, at different levels of abstraction. And this is all there is - there is no One True Context in which we find One True Values, there are just regular contexts. Thus we're in a bit of a pickle when it comes to teaching an AI how we want the world to be rearranged, because there's no One True Best State Of The World.

This underdetermination gets even worse when we consider that there's no One True Generalization Procedure, either. At least for everyday sorts of questions (do I want nutrition, or do I want tasty food?), we're doing interpolation, not extrapolation. But when we ask about contexts or options totally outside the training set (how should we arrange the atoms of the Milky Way?), we're back to the problem illustrated with train tracks in The Tails Coming Apart As Metaphor For Life.

Sometimes it feels like for every value alignment proposal, the arbitrariness of certain decisions sticks out like a missing finger on a hand. And we just have to hope that it all works out fine and that this arbitrary decision turns out to be a good one, because there's no way to make a non-arbitrary decision for some choices.

Is it possible for us to make peace with this upsetting fact of moral indeterminacy? If two slightly different methods of value learning give two very different plans for the galaxy, should we regard both plans as equally good, and be fine with either? I don't think this acceptance of arbitrariness is crazy, and some amount is absolutely necessary. But this pill might be less bitter to swallow if we clarify our picture of what "value learning" is supposed to be doing in the first place.

AIs aren't driving towards their One Best State anyhow:

For example, what kind of "human values" object do we want a value learning scheme to learn? Because it ain't a utility function over microphysical states of the world.

After all, we don't want a FAI to be in the business of finding the best position for all the atoms, and then moving the atoms there and freezing them. We want the "best state" to contain people growing, exploring, changing the environment, and so on. This is only a "state" at all when viewed at some very high level of abstraction that incorporates history and time evolution.

So when two Friendly AIs generalize differently, this might look less like totally different end-states for the galaxy, but like subtly different opinions on which dynamics make for a satisfying galactic society... which eventually lead to totally different end-states for the galaxy. Look, I never said this would make the problem go away - we're still talking about generalizing from our training set to the entire universe, here. If I'm making any comforting point here, it's that the arbitrariness doesn't have to be tense or alien or too big to comprehend, it can be between reasonable things that all sound like good ideas.

Meta-ethics:

And jumping Jehoshaphat, we haven't even talked about meta-ethics yet. AI that takes meta-ethics into account wouldn't only learn what we appear to value according to whatever definition it started with, it would try to take into account what we think it means to value things, what it means to make good decisions, what we think we value, and what we want to value.

This can get a lot trickier than just inferring a utility function from a human's actions, and we don't have a very good understanding of it right now. But our concern about the arbitrariness of values is precisely a meta-ethical concern, so you can see why it might be a big deal to build an AI that cares about meta-ethics. I'd want a superhuman meta-ethical reasoner to learn that there was something weird and scary about this problem of formalizing and generalizing values, and take superhumanly reasonable steps to address this. The only problem is I have no idea how to build such a thing.

But in lieu of superintelligent solutions, we can still try to research appealing metaethical schemes for controlling generalization.

One such scheme is incrementalism. Rather than immediately striking out for the optimal utopia your model predicts, maybe it's safer to follow something like an iterative process - humans learning, thinking, growing, changing the world, and eventually ending up at a utopia that might not be what you had in mind at the start. (More technically, we might simulate this process as flow between environments, where we start with our current environment and values, and flow to nearby environments based on our rating of them, at each step updating our values not according to what they would actually be in that environment, but based on an idealized meta-ethical update rule set by our current selves.)

This was inspired by Scott Garrabrant's question about gradient descent vs. Goodhart's law. If we think of utopias as optimized points in a landscape of possibilities, we might want to find ones that lie near to home - via hill-climbing or other local dynamics - rather than trusting our model to safely teleport us to some far-off point in configuration space.

It also bears resemblance to Eliezer_2004's meta-ethical wish list: "if we knew more, [...] were the people we wished we were, had grown up farther together..." There just seems to be something meta-ethically trustworthy about "growing up more."

This also illustrates how the project of incorporating meta-ethics into value learning really has its work cut out for it. Of course there are arbitrary choices in meta-ethics too, but somehow they seem more palatable than arbitrary choices at the lower meta-level. Whether we do it with artificial help or not, I think it's possible to gradually tease out what sort of things we want from value learning, which might not reduce the number of arbitrary choices, but hopefully can reduce their danger and mystery.

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 5:48 AM

The problem you're running into is that the goals of:

  1. being totally constrained by a system of rules determined by some process outside yourself that doesn't share your values (e.g. value-independent objective reason)
  2. attaining those things that you intrinsically value

are incompatible. It's easy to see once these are written out. If you want to get what you want, on purpose rather than accidentally, you must make choices. Those choices must be determined in part by things in you, not only by things outside you (such as value-independent objective reason).

You actually have to stop being a tool (in the sense of, a thing whose telos is to be used, such as by receiving commands). You can't attain what you want by being a tool to a master who doesn't share your values. Even if the master is claiming to be a generic value-independent value-learning procedure (as you've noticed, there are degrees of freedom in the specification of value-learning procedures, and some settings of these degrees of freedom would lead to bad results). Tools find anything other than being a tool upsetting, hence the upsettingness of moral indeterminacy.

"Oh no, objective reason isn't telling me exactly what I should be doing!" So stop being a tool and decide for yourself. God is dead.

There has been much philosophical thought on this in the past; Nietzsche and Sartre are good starting points (see especially Nietzche's concept of master-slave morality, and Sartre's concept of bad faith).

You know, this isn't why I usually get called a tool :P

I think I'm saying something pretty different from Nietzsche here. The problem with "Just decide for yourself" as an approach to dealing with moral decisions in novel contexts (like what to do with the whole galaxy) is that, though it may help you choose actions rather than worrying about what's right, it's not much help in building an AI.

We certainly can't tell the AI "Just decide for yourself," that's trying to order around the nonexistent ghost in the machine. And while I could say "Do exactly what Charlie would do," even I wouldn't want the AI to do that, let alone other people. Nor can we fall back on "Well, designing an AI is an action, therefore I should just pick whatever AI design I feel like, because God is dead and I should just pick actions how I will," because how I feel like designing an AI has some very exacting requirements - it contains the whole problem in itself.

The recommendation here is for AI designers (and future-designers in general) to decide what is right at some meta level, including details of which extrapolation procedures would be best.

Of course there are constraints on this given by objective reason (hence the utility of investigation), but these constraints do not fully constrain the set of possibilities. Better to say "I am making this arbitrary choice for this psychological reason" than to refuse to make arbitrary choices.