To recap the sequence so far:
- Ambitious value learning aims to infer a utility function that is safe to maximize, by looking at human behavior.
- However, since you only observe human behavior, you must be able to infer and account for the mistakes that humans make in order to exceed human performance. (If we don’t exceed human performance, it’s likely that we’ll use unsafe techniques that do exceed human performance, due to economic incentives.)
- You might hope to infer both the mistake model (aka systematic human biases) and the utility function, and then throw away the mistake model and optimize the utility function. This cannot be done without additional assumptions.
- One potential assumption you could use would be to codify a specific mistake model. However, humans are sufficiently complicated that any such model would be wrong, leading to model misspecification. Model misspecification causes many problems in general, and is particularly thorny for value learning.
Despite these arguments, we could still hope to infer a broad utility function that is safe to optimize, either by sidestepping the formalism used so far, or by introducing additional assumptions. Often, it is clear that these methods would not find the true human utility function (assuming that such a thing exists), but they are worth pursuing anyway because they could find a utility function that is good enough.
This post provides pointers to approaches that are currently being pursued. Since these are active areas of research, I don’t want to comment on how feasible they may or may not be -- it’s hard to accurately assess the importance and quality of an idea that is being developed just from what is currently written down about that idea.
Assumptions about the mistake model. We could narrow down on the mistake model by making assumptions about it, that could let us avoid the impossibility result. This decision means that we’re accepting the risk of misspecification -- but perhaps as long as the mistake model is not too misspecified, the outcome will still be good.
Learning the Preferences of Ignorant, Inconsistent Agents shows how to infer utility functions when you have an exact mistake model, such as “the human is a hyperbolic time discounter”. (Learning the Preferences of Bounded Agents and the online book Modeling Agents with Probabilistic Programs cover similar ground.)
Inferring Reward Functions from Demonstrators with Unknown Biases takes this a step further by simultaneously learning the mistake model and the utility function, while making weaker assumptions on the mistake model than “the human is noisily optimal”. Of course, it does still make assumptions, or it would fall prey to the impossibility result (in particular, it would be likely to infer the negative of the “true” utility function).
The structure of the planning algorithm. Avoiding the impossibility result requires us to distinguish between (planner, reward) pairs that lead to the same policy. One approach is to look at the internal structure of the planner (this corresponds to looking inside the brains of individual humans). I like this post as an introduction, but many of Stuart Armstrong's other posts are tackling some aspect of this problem. There is also work that aims to build a psychological model of what constitutes human values, and use that to infer values, described in more detail (with citations) in this comment.
Assumptions about the relation of behavior to preferences. One of the most perplexing parts of the impossibility theorem is that we can’t distinguish between fully rational and fully anti-rational behavior, yet we humans seem to do this easily. Perhaps this is because we have built-in priors that relate observations of behavior to preferences, which we could impart to our AI systems. For example, we could encode the assumption that regret is bad, or that lying about values is similar to lying about facts.
From the perspective of the sequence so far, both things we say and things we do count as “human behavior”. But perhaps we could add in an assumption that inferences from speech and inferences from actions should mostly agree, and have rules about what to do if they don’t agree. While there is a lot of work that uses natural language to guide some other learning process, I don’t know of any work that tries to resolve conflicts between speech and actions (or multimodal input more generally), but it’s something that I’m optimistic about. Acknowledging Human Preference Types to Support Value Learning explores this problem in more detail, suggesting some aggregation rules, but doesn't test any of these rules on real problems.
Other schemes for learning utility functions. One could imagine particular ways that value learning could go which would result in learning a good utility function. These cases typically can be recast as making some assumption about the mistake model.
For example, this comment proposes that the AI first asks humans how they would like their life to be while they figure out their utility function, and then uses that information to compute a distribution of "preferred" lives from which it learns the full utility function. The rest of the thread is a good example of applying the “mistake model” way of thinking to a proposal that does not obviously fit in its framework. There has been much more thinking spread across many posts and comment threads in a similar vein that I haven’t collected, but you might be able to find some of it by looking at discussions between Paul Christiano and Wei Dai.
Resolving human values, completely and adequately presents another framework that aims for an adequate utility function instead of a perfect one.
Besides the approaches above, which still seek to infer a single utility function, there are a few other related approaches:
Tolerating a mildly misspecified utility function. The ideas of satisficing and mild optimization are trying to make us more robust to a misspecified utility function, by reducing how much we optimize the utility function. The key example of this is quantilizers, which select an action randomly from the top N% of actions from some distribution, sorted by expected utility.
Uncertainty over utility functions. Much work in value learning involves uncertainty over utility functions. This does not fix the issues presented so far -- we can consider what would happen if the AI updated on all possible information about the utility function. At that point, the AI would take the expectation of the resulting distribution, and maximize that function. This means that we once again end up with the AI optimizing a single function, and all of the same problems arise.
To be clear, most researchers do not think that uncertainty is a solution to these problems -- uncertainty can be helpful for other reasons, which I won’t get into here. I mention this area of work because it works in the same framework of an AI optimizing a utility function, and I suspect many people will automatically associate uncertainty with any kind of value learning since CHAI has typically worked on both, but uncertainty is typically not targeting the problem of learning a utility function that is safe to maximize.
Tomorrow, there'll be a break from AIAF sequences and the new post will be the Alignment Newsletter Issue #32, by Rohin Shah.
Tuesday's AI Alignment Forum sequences post will be 'The Steering Problem' by Paul Christiano, in the sequence 'Iterated Amplification'.