Future directions for ambitious value learning

Rohin Shah

To recap the sequence so far:

Ambitious value learning aims to infer a utility function that is safe to maximize, by looking at human behavior.
However, since you only observe human behavior, you must be able to infer and account for the mistakes that humans make in order to exceed human performance. (If we don’t exceed human performance, it’s likely that we’ll use unsafe techniques that do exceed human performance, due to economic incentives.)
You might hope to infer both the mistake model (aka systematic human biases) and the utility function, and then throw away the mistake model and optimize the utility function. This cannot be done without additional assumptions.
One potential assumption you could use would be to codify a specific mistake model. However, humans are sufficiently complicated that any such model would be wrong, leading to model misspecification. Model misspecification causes many problems in general, and is particularly thorny for value learning.

Despite these arguments, we could still hope to infer a broad utility function that is safe to optimize, either by sidestepping the formalism used so far, or by introducing additional assumptions. Often, it is clear that these methods would not find the true human utility function (assuming that such a thing exists), but they are worth pursuing anyway because they could find a utility function that is good enough.

This post provides pointers to approaches that are currently being pursued. Since these are active areas of research, I don’t want to comment on how feasible they may or may not be -- it’s hard to accurately assess the importance and quality of an idea that is being developed just from what is currently written down about that idea.

Assumptions about the mistake model. We could narrow down on the mistake model by making assumptions about it, that could let us avoid the impossibility result. This decision means that we’re accepting the risk of misspecification -- but perhaps as long as the mistake model is not too misspecified, the outcome will still be good.

Learning the Preferences of Ignorant, Inconsistent Agents shows how to infer utility functions when you have an exact mistake model, such as “the human is a hyperbolic time discounter”. (Learning the Preferences of Bounded Agents and the online book Modeling Agents with Probabilistic Programs cover similar ground.)

Inferring Reward Functions from Demonstrators with Unknown Biases takes this a step further by simultaneously learning the mistake model and the utility function, while making weaker assumptions on the mistake model than “the human is noisily optimal”. Of course, it does still make assumptions, or it would fall prey to the impossibility result (in particular, it would be likely to infer the negative of the “true” utility function).

The structure of the planning algorithm. Avoiding the impossibility result requires us to distinguish between (planner, reward) pairs that lead to the same policy. One approach is to look at the internal structure of the planner (this corresponds to looking inside the brains of individual humans). I like this post as an introduction, but many of Stuart Armstrong's other posts are tackling some aspect of this problem. There is also work that aims to build a psychological model of what constitutes human values, and use that to infer values, described in more detail (with citations) in this comment.

Assumptions about the relation of behavior to preferences. One of the most perplexing parts of the impossibility theorem is that we can’t distinguish between fully rational and fully anti-rational behavior, yet we humans seem to do this easily. Perhaps this is because we have built-in priors that relate observations of behavior to preferences, which we could impart to our AI systems. For example, we could encode the assumption that regret is bad, or that lying about values is similar to lying about facts.

From the perspective of the sequence so far, both things we say and things we do count as “human behavior”. But perhaps we could add in an assumption that inferences from speech and inferences from actions should mostly agree, and have rules about what to do if they don’t agree. While there is a lot of work that uses natural language to guide some other learning process, I don’t know of any work that tries to resolve conflicts between speech and actions (or multimodal input more generally), but it’s something that I’m optimistic about. Acknowledging Human Preference Types to Support Value Learning explores this problem in more detail, suggesting some aggregation rules, but doesn't test any of these rules on real problems.

Other schemes for learning utility functions. One could imagine particular ways that value learning could go which would result in learning a good utility function. These cases typically can be recast as making some assumption about the mistake model.

For example, this comment proposes that the AI first asks humans how they would like their life to be while they figure out their utility function, and then uses that information to compute a distribution of "preferred" lives from which it learns the full utility function. The rest of the thread is a good example of applying the “mistake model” way of thinking to a proposal that does not obviously fit in its framework. There has been much more thinking spread across many posts and comment threads in a similar vein that I haven’t collected, but you might be able to find some of it by looking at discussions between Paul Christiano and Wei Dai.

Resolving human values, completely and adequately presents another framework that aims for an adequate utility function instead of a perfect one.

Besides the approaches above, which still seek to infer a single utility function, there are a few other related approaches:

Tolerating a mildly misspecified utility function. The ideas of satisficing and mild optimization are trying to make us more robust to a misspecified utility function, by reducing how much we optimize the utility function. The key example of this is quantilizers, which select an action randomly from the top N% of actions from some distribution, sorted by expected utility.

Uncertainty over utility functions. Much work in value learning involves uncertainty over utility functions. This does not fix the issues presented so far -- we can consider what would happen if the AI updated on all possible information about the utility function. At that point, the AI would take the expectation of the resulting distribution, and maximize that function. This means that we once again end up with the AI optimizing a single function, and all of the same problems arise.

To be clear, most researchers do not think that uncertainty is a solution to these problems -- uncertainty can be helpful for other reasons, which I talk about later in the sequence. I mention this area of work because it works in the same framework of an AI optimizing a utility function, and I suspect many people will automatically associate uncertainty with any kind of value learning since CHAI has typically worked on both, but uncertainty is typically not targeting the problem of learning a utility function that is safe to maximize.

Questions for people working or thinking of working in this area:

Is there a way to have an AI "understand" that the values it is learning are not terminal values, or even instrumental values, but "interim" values? That is, they are things that humans want, subject to the fact that the humans are still trying to figure out what their real values are so the AI shouldn't be too attached to those values. Maybe it's possible to stretch the "utility function + mistakes" model to cover this, but it seems like it would be much better if there was a more natural / elegant way to model these "interim" values.

Relatedly, is there a way to apply value learning to the problem of metaphilosophy? In other words, can an AI, by observing humans try to solve philosophical problems, learn how to solve philosophical problems and exceed human level performance?

If the answer to the above question is "no" or "it's too hard", it may seem sufficient that an AI can just learn not to interfere with or manipulate a human's philosophical and moral deliberations. This may be much easier, but if we're headed towards a multi-polar world of AIs that are aligned to different users/owners, we also need our AIs to protect us against manipulation from other-aligned AIs. Such an AI would seemingly need to distinguish between attempts of manipulation and helpful (or at least good-faith) discussion (otherwise how would we talk with anyone else in the world without risking AI manipulation). But being able to make such distinctions seems a small step away from the ability to be actively helpful, so this problem doesn't seem much easier than learning how to do philosophical reasoning. Still, it may be useful to consider this as a separate problem just in case it is much easier.

Uncertainty over utility functions + a prior that there are systematic mistakes might be enough to handle this, but I agree that this problem seems hard and not yet tackled in the literature. I personally lean towards "expected explicit utility maximizers are the wrong framework to use".

One approach which I didn't see obviously listed here, though is related to e.g. "The structure of the planning algorithm", is to first construct a psychological and philosophical model of what exactly human values are and how they are represented in the brain, before trying to translate them into a utility function.

One (but not the only possible) premise for this approach is that the utility function formalism is not particularly suited for things like changing values or dealing with ontology shifts; while a utility function may be a reasonable formalism for describing the choices that an agent would make at any given time, the underlying mechanism that generates those choices is not particularly well-characterized by a utility function. A toy problem that I have used before is the question of how to update your utility function if it was previously based on an ontology defined in N dimensions, but suddenly the ontology gets updated to include N+1 dimensions:

... we can now consider what problems would follow if we started off with a very human-like AI that had the same concepts as we did, but then expanded its conceptual space to allow for entirely new kinds of concepts. This could happen if it self-modified to have new kinds of sensory or thought modalities that it could associate its existing concepts with, thus developing new kinds of quality dimensions.

An analogy helps demonstrate this problem: suppose that you're operating in a two-dimensional space, where a rectangle has been drawn to mark a certain area as "forbidden" or "allowed". Say that you're an inhabitant of Flatland. But then you suddenly become aware that actually, the world is three-dimensional, and has a height dimension as well! That raises the question of, how should the "forbidden" or "allowed" area be understood in this new three-dimensional world? Do the walls of the rectangle extend infinitely in the height dimension, or perhaps just some certain distance in it? If just a certain distance, does the rectangle have a "roof" or "floor", or can you just enter (or leave) the rectangle from the top or the bottom? There doesn't seem to be any clear way to tell.

As a historical curiosity, this dilemma actually kind of really happened when airplanes were invented: could landowners forbid airplanes from flying over their land, or was the ownership of the land limited to some specific height, above which the landowners had no control? Courts and legislation eventually settled on the latter answer.

In a sense, we can say that law is a kind of a utility function representing a subset of human values at some given time; when the ontology that those values are based on shifts, the laws get updated as well. A question to ask is: what is the reasoning process by which humans update their values in such a situation? And given that a mature AI's ontology is bound to be different than ours, how do we want the AI to update its values / utility function in an analogous situation?

Framing the question this way suggests that constructing a utility function is the wrong place to start; rather we want to start with understanding the psychological foundation of human values first, and then figure out how we should derive utility functions from those. That way we can also know how to update the utility function when necessary.

Furthermore, as this post notes, humans routinely make various assumptions about the relation of behavior and preferences, and a proper understanding of the psychology and neuroscience of decision-making seems necessary for evaluating those assumptions.

Some papers that take this kind of an approach are Sotala 2016, Sarma & Hay 2017, Sarma, Safron & Hay 2018.

Thanks for the detailed comment! I definitely intended to include all of this within "The structure of the planning algorithm", but I wasn't aware of the papers you cited. I'll add a pointer to this comment to the post.

One of the most perplexing parts of the impossibility theorem is that we can’t distinguish between fully rational and fully anti-rational behavior, yet we humans seem to do this easily.

Why does it seem to you that humans do this easily? If I saw two people running businesses and was told that one person was optimising for profit and the other was anti-optimising for negative profit, not only would I not anticipate being able to tell which was which, I would be pretty suspicious of the claim that there was any relevant difference between the two.

In that scenario I would predict that the thing I was told was wrong, i.e. it is simply not true that one of them is anti-optimizing for negative profit. I have strong priors that people are optimizing for things they want.

Perhaps it's just a prior that people are relatively good at optimizing for things they want. But the impossibility theorem seems to indicate that there are lots of different planners you could hypothesize, and somehow humans just seize upon one. (Though we're often wrong, eg. typical mind fallacy.)

TL;DR: we do surprisingly well at inferring goals, given this impossibility result, and I'm not sure why. Maybe it's a prior we're born with.