I’ve shown that we cannot deduce the preferences of a potentially irrational agent. Even simplicity priors don’t help. We need to make extra ‘normative’ assumptions in order to be able to say anything about these preferences.

I then presented a more intuitive example, in which Alice was playing poker, and had two possible beliefs about Bob’s hand, and two possible preferences: wanting money, or wanting Bob (which, in that situations, translated into wanting to lose to Bob).

That example illustrated the impossibility result, within the narrow confines of that situation – if Alice calls, she could be a money-maximiser expecting to win, or a love-maximiser expecting to lose.

As has been pointed out, this uncertainty doesn’t really persist if we move beyond the initial situation. If Alice was motivated by love or money, we would expect to be able to tell which one, by seeing what she does in other situations – how does she respond to Bob’s flirtations, what does she confess to her closest friends, how does she act if she catches a peek of Bob’s cards, etc…

So if we look at her more general behaviour, it seems that we have two possible versions of Alice. First, , who clearly wants money, and , who clearly wants Bob. The actions of these two agents match up in the specific case I described, but not in general. Doesn’t this undermine my claim that we can’t tell the preferences of an agent from their actions?

What’s actually happening here is that we’re already making a lot of extra assumptions when we’re interpreting or ’s actions. We model other humans in very specific and narrow ways, and other humans do the same – and their models are very similar to ours (consider how often humans agree that another human is angry, or that being drunk impairs rationality). The agreement isn’t perfect, but is much better than random.

If we set those assumptions aside, then we can see what the theorem implies. There is a possible agent , whose preference is for love, but that nevertheless acts identically to (and the reverse for money-loving versus ). and are perfectly plausible agents – they just aren’t ‘human’ according to our models of what being human means.

It’s because of this that I’m somewhat optimistic we can solve the value learning problem, and why I often say the problem is “impossible in theory, but doable in practice”. Humans make a whole host of assumptions that allow them to interpret the preferences of other humans (and of themselves). And these assumptions are quite similar from human to human. So we don’t need to solve the value learning problem in some principled way, nor figure out the necessary assumptions abstractly. Instead, we just need to extract the normative assumptions that humans are already making and use these in the value learning process (and then resolve all the contradictions within human values, but that seems doable if messy).

New Comment
15 comments, sorted by Click to highlight new comments since: Today at 12:04 PM

Instead, we just need to extract the normative assumptions that humans are already making and use these in the value learning process

Okay, but how do you do that if you don't already have a value learning algorithm? Why is it easier to learn the algorithms/parameters humans use in inferring each other's values, than to just learn their values?

Because once we have these parameters, we can learn the values of any given human. In contrast, it we learn the values of a given human, we don't get to learn the values of any other one.

I'd argue further: these parameters form part of a definition of human values. We can't just "learn human values", as these don't exist in the world. Whereas "learn what humans model each other's values (and rationality) to be" is something that makes sense in the world.

Because once we have these parameters, we can learn the values of any given human.

This doesn't make the problem easier, you have to start somewhere. I agree this could reduce the total computational work required but it doesn't seem any easier conceptually.

Whereas “learn what humans model each other’s values (and rationality) to be” is something that makes sense in the world.

This has the same problem as value learning. If I think you have X values but you actually have Y values (and I would think you have Y values upon further reflection etc) then a solution that learns the model that causes me to think you have X values is insufficient; to get that the right answer is Y it also has to know that I am a bounded agent, and how I am bounded.

What do you mean by "you actually have Y values"? What are you defining values to be?

I don't know, but a pseudo-definition that works sometimes is "upon having a lot of time to reflect, information, etc, I would conclude that you have Y values"; of course I can't use this definition when I am doing the reflection, though! "Values" is at the moment a pre-formal concept (utility theory doesn't directly apply to humans), so it has some representation in people's brains that is hard to extract/formalize.

In any case, I reject any AI design that concludes that it ought to act as if you have X values just because my current models imply that you have X values, since there are many ways I could be wrong about such a judgment, like having bad information and incoherent philosophy.

We're getting close to something important here, so I'll try and sort things out carefully.

In my current approach, I'm doing two things:

  1. Finding some components of preferences or proto-preferences within the human brain.

  2. Synthesising them together in a way that also respects (proto-)meta-preferences.

The first step is needed because of the No Free Lunch in preference learning result. We need to have some definition of preferences that isn't behavioural. And the stated-values-after-reflection approach has some specific problems that I listed here.

Then it took an initial stab at how one could sythesise the preferences in this post.

If I'm reading you correctly, your main fear is that by focusing on the proto-preferences of the moment, we might end up in a terrible place, foreclosing moral improvements. I share that fear! That's why the process of synthesising values in accordance both with meta-preferences and "far" preferences ("I want everyone to live happy worthwhile lives" is a perfectly valid proto-preference).

Where we might differ the most, is that I'm very reluctant to throw away any proto-preferences, even if our meta-preferences would typically overrule it. I would prefer to keep it around, with a very low weight. Once we get in the habit of ditching proto-preferences, there's no telling where that process might end up.

The overall approach of finding proto-preferences and meta-preferences, resolving them somehow, then extrapolating from there, seems like a reasonable thing to do.

But, suppose you're going to do this. Then you're going to run into a problem: proto-preferences aren't identifiable.

I interpreted you as trying to fix this problem by looking at how humans infer each other's preferences rather than their (proto-)preferences themselves. You could try learning people's proto-preference-learning-algorithms instead of their proto-preferences.

But, this is not an easier problem. Proto-preference-learning-algorithms are just as unidentifiable as proto-preferences.

So I currently continue to strongly object to the following two sentences in the OP: "So we don’t need to solve the value learning problem in some principled way, nor figure out the necessary assumptions abstractly. Instead, we just need to extract the normative assumptions that humans are already making and use these in the value learning process". According to my current view, the second sentence, if true, is misleading because extracting the normative assumptions humans make is no easier than extracting proto-preferences. Do you still endorse these sentences? If you do, what interpretation of them resolves my objection?

Then you're going to run into a problem: proto-preferences aren't identifiable.

I interpreted you as trying to fix this problem by looking at how humans infer each other's preferences...

The proto-preferences are a definition of the components that make up preferences. Methods of figuring them out - be they stated preferences, revealed preferences, FMRI machines, how other people infer each other's preferences... - are just methods. The advantage of having a definition is that this guides us explicitly as to when a specific method for figuring them out, ceases to be applicable.

And I'd argue that proto-preferences are identifiable. We're talking about figuring out how humans model their own situations, and the better-worse judgements they assign in their internal models. This is not unidentifiable, and neuroscience already has some things to say on it. The previous Alice post showed how you could do it a toy model (with my posts on semantics and symbol grounding, relevant to applying this approach to humans).

That second sentence of mine is somewhat poorly phrased, but I agree that "extracting the normative assumptions humans make is no easier than extracting proto-preferences" - I just don't see that second one as being insoluble.

I'm pretty confused by what you mean by proto-preferences. I thought by proto-preferences you meant something like "preferences in the moment, not subject to reflection etc." But you also said there's a definition. What's the definition? (The concept is pre-formal, I don't think you'll be able to provide a satisfactory definition).

You have written a paper about how preferences are not identifiable. Why, then, do you say that proto-preferences are identifiable, if they are just preferences in the moment? The impossibility results apply word-for-word to this case. If you have an algorithm for identifying them, what is it?

What, specifically, has neuroscience said about this that would let anyone even define what it means for a given brain to have a given set of proto-preferences?

(I don't know what you mean by "previous Alice post"; regardless, if you're claiming to have worked out an algorithm that infers people's proto-preferences pretty well given empirical data, I don't believe you. The posts on semantics and symbol grounding seem like gesturing in the direction of something that could someday form a solution, with multiple reformulations being necessary along the way; this is nowhere close to an actual solution.)

Oh, I don't claim to have a full definition yet, but I believe it's better than pre-formal. Here would be my current definition:

  • Humans are partially model-based agents. We often generate models (or at least partial models) of situations (real or hypothetical), and, within those models, label certain actions/outcomes/possibilities as better or worse than others (or sometimes just generically "good" or "bad"). This model, along with the label, is what I'd call a proto-preference (or pre-preference).

That's why neuroscience is relevant, for identifying the mental model human use. The "previous Alice post" I mentioned is here. and was a toy version of this, in the case of an algorithm rather than a human. The reason these get around the No Free Lunch theorem is that they look inside the algorithm (so different algorithms with the same policy can be seen to have different preferences, which breaks NFL), and is making the "normative assumption" that these modelled proto-preferences correspond, (modulo preference synthesis) to the agent's actual preferences.

Note that that definition puts preferences and meta-preferences into the same type, the only difference being the sort of model being considered.

Ok, this seems usefully specific. A few concerns:

  1. It seems that, according to your description, my proto-preferences are my current map of the situation I am in (or ones I have already imagined) along with valence tags. However, the AI is going to be in a different location, so I actually want it to form a different map (otherwise, it would act as if it were in my location, not its location). So what I actually want to get copied is more like a map-building and valence-tagging procedure that can be applied to different contexts, which will take different information into account.

  2. It seems hard for the AI to do significantly better than I could do by, say, controlling the robot. For example, if my ontology about engineering is wrong (in a way that prevents me from inventing nanotech), then the AI is going to also be wrong about engineering in the same way, if it copies my map-building and valence-tagging algorithms, or just my maps and valence tags. (If it doesn't copy my maps, then how does it translate my values about my maps to its values about its maps?)

  3. Related, if the AI uses my models in ways that subject them to more weird edge cases than I would (e.g. by searching over more actions), then they're going to give bad answers pretty often.

  4. Also related, these models are embedded in reality; they don't have all that much meaning except relative to the process that builds and interprets them, which includes my senses, my pattern-recognizers, my reflexes, my tools, my social context, etc. Presumably the AI is going to replace my infrastructure with different infrastructure, but then why would we expect my models to keep working? I'm not sure what would happen if someone with my models woke up with very different sense inputs, actuators, and environment.

  5. Perhaps most concerningly, if you asked a few neuroscientists and cognitive scientists "can we do this / will we be able to do this in 10 years", I predict they would mostly say "no, our models and data gathering procedures aren't actually good enough to do this, and aren't improving super fast either". (Note that you haven't yet named specific neuroscience techniques for identifying humans' models, so the statement that neuroscience has things to say about this seems empty). So a bunch of original cognitive science/neuroscience research is going to have to get done here, in addition to much better data gathering and inference procedures for actually looking inside humans' algorithms.

  6. There's still an unidentifiability issue in that you need assumptions about which things are "my models" and "my valence tags". These things, at the moment, do not have rigorous definitions. For example, if I am modelling you (and therefore running a small copy of you in my brain), then probably my model of you also has models and valence tags, yet these aren't my models and valence tags (for the purposes of inferring my preferences). You'd also need to make decisions about the extent to which e.g. reflexes are embodying values. So there are a bunch of modelling choices required, which could be made with cognitive science models that are much, much better than those available right now.

That said, this does seem to be the value learning approach I am most optimistic about right now.

That said, this does seem to be the value learning approach I am most optimistic about right now.

Thanks! I'm not sure I fully get all your concerns, but I'll try and answer to the best of my understanding.

1-4 (and a little bit of 6): this is why I started looking at semantics vs syntax. Consider the small model "If someone is drowning, I should help them (if it's an easy thing to do)". Then "someone", "downing", "I", and "help them" are vague labels for complex categories (as re most of there rest of the terms, really). The semantics of these categories need to be established before the AI can do anything. And the central examples of the categories will be clearer than the fuzzy edges. Therefore the AI can model me as having a strong preferences in the central example of the categories, which become much weaker as we move to the edges (the meta-preferences will start to become very relevant in the edge cases). I expect that "I should help them" further decomposes into "they should be helped" and "I should get the credit for helping them".

Therefore, it seems to me, that an AI should be able to establish that if someone is drowning, it should try and enable me to save them, and if it can't do that, then it should save them itself (using nanotechnology or anything else). It doesn't seem that it would be seeing the issue from my narrow perspective, because I don't see the issue just from my narrow perspective.

5: I am pretty sure that we could use neuroscience to establish that, for example, people are truthful when they say that they see the anchoring bias as a bias. But I might have been a bit glib when mentioning neuroscience; that is mainly the "science fiction superpowers" end of the spectrum for the moment.

What I'm hoping, with this technique, is that if we end up using indirect normativity or stated preferences, that my keeping in mind this model of what proto-preferences are, we can better automate the limitations of these techniques (eg when we expect lying), rather than putting them in by hand.

6: Currently I don't see reflexes as embodying values at all. However, people's attitudes towards their own reflexes are valid meta-preferences.

You could imagine examining a human brain and seeing how it models other humans. This would let you get some normative assumptions out that could inform a value learning technique.

I would think of this as extracting an algorithm that could infer human preferences out of a human brain. You could run this algorithm for a long time, in which case it would eventually output Y values, even if you would currently judge the person as having X values.

I agree that the fact that humans are quite good at inferring preferences should give us optimism about value learning. In the framework of rationality with a mistake model, I interpret this post as trying to infer the mistake model from the way that humans infer preferences about other humans. I'm not sure whether this sidesteps the impossibility result, but it seems plausible that it does.

What would be the source of data for learning a mistake model? It seems like we have to make some assumption about how the data source leads to a mistake model, since probably the data source is going to be a subset of the full human policy, and the impossibility result already allows you to have access to the full human policy.

In the example in https://www.lesswrong.com/posts/rcXaY3FgoobMkH2jc/figuring-out-what-alice-wants-part-ii , I give examples of two algorithms with the same outputs but where we would attribute different preferences to them. This sidesteps the impossibility result, since it allows us to consider extra information, namely the internal structure of the algorithm, in a way relevant to value-computing.