Personal Blog

A putative new idea for AI control; index here.

There are various methods, such as Cooperative Inverse Reinforcement Learning (CIRL), that aim to have an AI deduce human preferences in some fashion.

The problem is that humans are not rational - citation certainly not needed. But, worse than that, they are not rational in ways that seriously complicate the task of fitting a reward or utility function to them. I presented one problem this entails in a previous post. That talked about the problems that emerged when an AI could influence a human's preference through the ways it presented the issues.

But there are other irrationalities which challenge a value learner. Here are some that can't easily be modelled as "true preferences + noise".

#. Humans procrastinate even though they'd generally prefer to proceed with their tasks. #. Most people who fail a suicide attempt never try again. #. People can be prejudiced while not desiring to be so (and, often, while not realising they are so). #. Many rationalists wish to be consistent, and fail at this. #. The young don't want to have the old-age preferences they should expect to have. #. People generally take on the preferences of the social group they belong to. #. Addicts may have stated or revealed preferences to continue to be addicts. However, if they'd never gotten addicted in the first place, they may have stated or revealed preferences to never do so. #. The whole tension between stated and revealed preferences in the first place. #. People desiring to have certain beliefs, or at least to not lost them (eg religious beliefs). #. etc. #. A lot more "etc"'s.

The preference inference problem

The problem with those preferences is not that they can't be resolved (even though philosophers continue to disagree on many of them). It's that it's really hard to resolve them in a principled way, from observation of base level human behaviour.

Look for example at the tension between revealed and stated preferences. CIRL models the human as knowing the true reward function and wanting to maximise it. Given that this is not the case, the AI could legitimately conclude that the revealed preferences are the true ones, OR that the stated preferences are true. There's nothing in the design, as far as I can tell, which would privilege one or the other. It all depends on the AI's model of human rationality -- a model which cannot be accurate because the underlying assumption, that humans properly know the true reward function, is incorrect.

This and many other preference inconsistencies (such as how we allow our preferences to change over time) cannot easily be modelled by an AI -- or rather, we can't expect that the AI will choose to model them in the correct way.

Including the meta-preferences directly

One way of addressing the issue is to make the AI explicitly model human meta-preferences, and have a desire to follow them.

A rough model might be to have the human's revealed preferences as a basis. On top of that, humans have two categories of (first-order) meta-preferences:

#. Unendorsed factors #. Decision desiderata

A factor is something that can cause a human to reach a particular decision. An unendorsed factor is one that the human does not desire to have (eg prejudices, bias because of worry, and so on).

Decision desiderata are properties that the human wants their decision process to follow, but it doesn't. Short-term time-consistency is an example.

Note that the division between these two categories is not sharp -- time-inconsistency can be seen as using the unendorsed factor of time in the decisions.

Note also that by assuming that we start with a reward function, we're setting up a lot of unendorsed factors from the start: following a reward function implies certain assumptions of independence and transitivity that are typically absent in human decision making. To fit this, the AI will need to assume ridiculous alternate theories to make human decisions reward-tracking, and humans will unendorse a lot of these factors ("no, the position off Venus does not affect my cigarette affect!").

Are there sensible meta-meta-preferences? Yes there are. Consider, for instance, the idea that your decision should be independent of unchosen alternatives. That's to avoid situations like this one:

After finishing dinner, Sidney Morgenbesser decides to order dessert. The waitress tells him he has two choices: apple pie and blueberry pie. Sidney orders the apple pie. After a few minutes the waitress returns and says that they also have cherry pie at which point Morgenbesser says "In that case I'll have the blueberry pie."

Wanting to avoid those kinds of decisions is a meta-preference that many humans would endorse. But now consider what happens when you're at the end of a dinner, and, to avoid seeming greedy, you chose the second-largest piece of cake. A perfectly reasonable choice, but in conflict with the meta-preference. In this situation, a meta-meta-preference endorsing the base-level preference over the meta-preference is perfectly fine.

Out of respect for Eliezer's argument, we'll assume that there are meta-meta-meta-preferences, but no more. We could go on to arbitrarily many levels, but there may be issues of convergence then.

An example

How could we do this in practice? Note that the problem of AI manipulating the formulation of preferences remains, unless our model is perfect. One way to avoid manipulations is to have the AI determine preferences from data it cannot influence -- maybe the human's behaviour in the last year. Then the AI will attempt to determine what the human's reward function was, rather than what it is.

The reward function will be modelled as:

  • , for the human's revealed preferences and a bias term.

In this simple model, the is entirely determined by the human's higher order preferences. Something only counts as a bias if the human endorses it as a bias, at some level.

Now, we might want to weight preferences versus meta-preferences in some way -- a desire to avoid procrastination is good, but not if it reduces imagination too much. But it's not clear how such a weighting would work, so here I'm assuming that higher order preferences take precedence over all lower order preferences. Indeed, there could be a third order preference for imagination that overrules part of the second order preference for avoiding procrastination.

The human is assumed to have a noisy knowledge of their meta-preferences. And this is where it gets tricky; see this post for some of the problems. Human behaviour (especially spoken behaviour) reveals some of our meta-preferences. Maybe the AI could work with what meta-preferences we would have explicitly endorsed during that year, if asked? This brings in the spectre of AI manipulation, because it can influence the endorsed meta-preference by the choice of questions.

As stated in this post, the better the AI's model of human irrationality -- the closer it can get to knowing correctly what a meta-preference is, what endorsement is, what human knowledge it -- the better. By allowing the AI to explicitly consider meta-preferences, we've improved its model; by improving the definition of meta-preferences, we can improve it still further.

The ultimate aim

What would be the goal of this? Ultimately, to find an AI capable of doing preference philosophy work better than human -- but still doing human-appropriate work. We'd want an AI who, without knowledge of these considerations, would identify the "independence of unchosen alternative" desiderata, come up with the second-largest desert counterexample, and deduce correctly that humans would endorse that behaviour by adding more complexity to the model.

Ultimately, the AI might be able to solve the issue of changing human preferences, by formalising our meta-etc preferences over these changes.

Note that we're not aiming to "solve" human preferences in some abstract sense; we're aiming to find a preference system that's adequate for human survival and flourishing. Because of this, and because of the risk we might have lost an important element in the AI's formalism, we'll need the AIs preference extrapolation to be quite conservative, hewing close to the meta-preferences we feel strongest about.

Personal Blog


New Comment
2 comments, sorted by Click to highlight new comments since: Today at 11:09 AM

Yup, including better models of human irrationality seems like a promising direction for CIRL. I've been writing up a short note on the subject with more explicit examples- if you want to work on this without duplicating effort, let me know and I'll share the rough draft with you.

Ok, send me the draft.