Future directions for ambitious value learning

Questions for people working or thinking of working in this area:

Is there a way to have an AI "understand" that the values it is learning are not terminal values, or even instrumental values, but "interim" values? That is, they are things that humans want, subject to the fact that the humans are still trying to figure out what their real values are so the AI shouldn't be too attached to those values. Maybe it's possible to stretch the "utility function + mistakes" model to cover this, but it seems like it would be much better if there was a more natural / elegant way to model these "interim" values.

Relatedly, is there a way to apply value learning to the problem of metaphilosophy? In other words, can an AI, by observing humans try to solve philosophical problems, learn how to solve philosophical problems and exceed human level performance?

If the answer to the above question is "no" or "it's too hard", it may seem sufficient that an AI can just learn not to interfere with or manipulate a human's philosophical and moral deliberations. This may be much easier, but if we're headed towards a multi-polar world of AIs that are aligned to different users/owners, we also need our AIs to protect us against manipulation from other-aligned AIs. Such an AI would seemingly need to distinguish between attempts of manipulation and helpful (or at least good-faith) discussion (otherwise how would we talk with anyone else in the world without risking AI manipulation). But being able to make such distinctions seems a small step away from the ability to be actively helpful, so this problem doesn't seem much easier than learning how to do philosophical reasoning. Still, it may be useful to consider this as a separate problem just in case it is much easier.

[-]Rohin Shah7y30

Uncertainty over utility functions + a prior that there are systematic mistakes might be enough to handle this, but I agree that this problem seems hard and not yet tackled in the literature. I personally lean towards "expected explicit utility maximizers are the wrong framework to use".

[-]Kaj_Sotala7y50

One approach which I didn't see obviously listed here, though is related to e.g. "The structure of the planning algorithm", is to first construct a psychological and philosophical model of what exactly human values are and how they are represented in the brain, before trying to translate them into a utility function.

One (but not the only possible) premise for this approach is that the utility function formalism is not particularly suited for things like changing values or dealing with ontology shifts; while a utility function may be a reasonable formalism for describing the choices that an agent would make at any given time, the underlying mechanism that generates those choices is not particularly well-characterized by a utility function. A toy problem that I have used before is the question of how to update your utility function if it was previously based on an ontology defined in N dimensions, but suddenly the ontology gets updated to include N+1 dimensions:

... we can now consider what problems would follow if we started off with a very human-like AI that had the same concepts as we did, but then expanded its conceptual space to allow for entirely new kinds of concepts. This could happen if it self-modified to have new kinds of sensory or thought modalities that it could associate its existing concepts with, thus developing new kinds of quality dimensions.

An analogy helps demonstrate this problem: suppose that you're operating in a two-dimensional space, where a rectangle has been drawn to mark a certain area as "forbidden" or "allowed". Say that you're an inhabitant of Flatland. But then you suddenly become aware that actually, the world is three-dimensional, and has a height dimension as well! That raises the question of, how should the "forbidden" or "allowed" area be understood in this new three-dimensional world? Do the walls of the rectangle extend infinitely in the height dimension, or perhaps just some certain distance in it? If just a certain distance, does the rectangle have a "roof" or "floor", or can you just enter (or leave) the rectangle from the top or the bottom? There doesn't seem to be any clear way to tell.

As a historical curiosity, this dilemma actually kind of really happened when airplanes were invented: could landowners forbid airplanes from flying over their land, or was the ownership of the land limited to some specific height, above which the landowners had no control? Courts and legislation eventually settled on the latter answer.

In a sense, we can say that law is a kind of a utility function representing a subset of human values at some given time; when the ontology that those values are based on shifts, the laws get updated as well. A question to ask is: what is the reasoning process by which humans update their values in such a situation? And given that a mature AI's ontology is bound to be different than ours, how do we want the AI to update its values / utility function in an analogous situation?

Framing the question this way suggests that constructing a utility function is the wrong place to start; rather we want to start with understanding the psychological foundation of human values first, and then figure out how we should derive utility functions from those. That way we can also know how to update the utility function when necessary.

Furthermore, as this post notes, humans routinely make various assumptions about the relation of behavior and preferences, and a proper understanding of the psychology and neuroscience of decision-making seems necessary for evaluating those assumptions.

Some papers that take this kind of an approach are Sotala 2016, Sarma & Hay 2017, Sarma, Safron & Hay 2018.

[-]Rohin Shah7y20

Thanks for the detailed comment! I definitely intended to include all of this within "The structure of the planning algorithm", but I wasn't aware of the papers you cited. I'll add a pointer to this comment to the post.

[-]DanielFilan7y30

One of the most perplexing parts of the impossibility theorem is that we can’t distinguish between fully rational and fully anti-rational behavior, yet we humans seem to do this easily.

Why does it seem to you that humans do this easily? If I saw two people running businesses and was told that one person was optimising for profit and the other was anti-optimising for negative profit, not only would I not anticipate being able to tell which was which, I would be pretty suspicious of the claim that there was any relevant difference between the two.

[-]Rohin Shah7y30

In that scenario I would predict that the thing I was told was wrong, i.e. it is simply not true that one of them is anti-optimizing for negative profit. I have strong priors that people are optimizing for things they want.

Perhaps it's just a prior that people are relatively good at optimizing for things they want. But the impossibility theorem seems to indicate that there are lots of different planners you could hypothesize, and somehow humans just seize upon one. (Though we're often wrong, eg. typical mind fallacy.)

TL;DR: we do surprisingly well at inferring goals, given this impossibility result, and I'm not sure why. Maybe it's a prior we're born with.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

13

Future directions for ambitious value learning

13