Research Agenda v0.9: Synthesising a human's preferences into a utility function

[-]Wei Dai7y200

So the first thing to do is to group the partial preferences together according to similarity (for example, preferences for concepts closely related in terms of webs of connotations should generally be grouped together), and generalise them in some regularised way. Generalise means, here, that they are transformed into full preferences, comparing all possible universes. [...] It seems that standard machine learning techniques should already be up to this task (with all the usual current problems).

I don't understand how this is even close to being possible today. For example I have some partial preferences that could generally be described as valuing the existence of positive conscious experiences, but I have no idea how to generalize this to full preferences, since I do not have a way to determine, given an arbitrary physical system, whether it contains a mind that is having a positive conscious experience. This seems like a very hard philosophical problem to solve, and I don't see how "standard machine learning techniques" could possibly be to up to this task.

The way I would approach this problem is to say that humans seem to have a way of trying to generalize (e.g., figure out what we really mean by "positive conscious experience") by "doing philosophy" or "applying philosophical reasoning", and if we better understood what we're doing when we "do philosophy" then maybe we can program or teach an AI to do that. See Some Thoughts on Metaphilosophy where I wrote down some recent thoughts along these lines.

I'm curious to know what your thinking is here, in more detail.

[-]Stuart_Armstrong7y10

I'd say that this problem doesn't belong in section 2.3-2.4 (collecting and generalising preferences), but in section 1.2 (symbol grounding, and especially the web of connotations). That's where these questions should be solved, in my view.

So yeah, I agree that standard machine learning is not up to the task yet, at all.

(as a minor aside, I'm also a bit unsure how necessary it is to make partial preferences total before combining them; this may be unnecessary)

[-]Vaniver7y80

Overall, I was pretty impressed by this; there were several points where I thought "sure, that would be nice, but obstacle X," and then the next section brought up obstacle X.

I remain sort of unconvinced that utility functions are the right type signature for this sort of thing, but I do feel convinced that "we need some sort of formal synthesis process, and a possible end product of that is a utility function."

That is, most of the arguments I see for 'how a utility function could work' go through some twisted steps. Suppose I'm trying to build a robot, and I want it to be corrigible, and I have a corrigibility detector whose type is 'decision process' to 'score'. I need to wrap that detector with a 'world state' to 'decision process' function and a 'score' to 'utility' function, and then I can hand it off to a robot that does a 'decision process' to 'world state' prediction and optimizes utility. If the robot's predictive abilities are superhuman, it can trace out whatever weird dependencies I couldn't see; if they're imperfect, then each new transformation provides another opportunity for errors to creep in. And it may be the case that this is a core part of reflective stability (because if you map through world-histories you bring objective reality into things in a way that will be asymptotically stable with increasing intelligence) that doesn't have another replacement.

I do find myself worrying that embedded agency will require dropping utility functions in a deep way that ends up connected to whether or not this agenda will work (or which parts of it will work), but remain optimistic that you'll find out something useful along the way and have that sort of obstacle in mind as you're working on it.

[-]Charlie Steiner7y50

This is really similar to some stuff I've been thinking about, so I'll be writing up a longer comment with more compare/contrast later.

But one thing really stood out to me - I think one can go farther in grappling with and taking advantage of "where $U_{H}$ lives." $U_{H}$ doesn't live inside the human, it lives in the AI's model of the human. Humans aren't idealized agents, they're clusters of atoms, which means they don't have preferences except after the sort of coarse-graining procedure you describe, and this coarse-graining procedure lives with a particular model of the human - it's not inherent in the atoms.

This means that once you've specified a value learning procedure and human model, there is no residual "actual preferences" the AI can check itself against. The challenge was never to access our "actual preferences," it was always to make a best effort to model humans as they want to be modeled. This is deeply counterintuitive ("What do you mean, the AI isn't going to learn what humans' actual preferences are?!"), but also liberating and motivating.

[-]Stuart_Armstrong7y30

One of the reasons I refer to synthesising (or constructing) the $U_{H}$ , not learning it.

[-]Charlie Steiner7y30

Now that I think about it, it's a pretty big PR problem if I have to start every explanation of my value learning scheme with "humans don't have actual preferences so the AI is just going to try to learn something adequate." Maybe I should figure out a system of jargon such that I can say, in jargon, that the AI is learning peoples' actual preferences, and it will correspond to what laypeople actually want from value learning.

I'm not sure whether such jargon would make actual technical thinking harder, though.

[-]Stuart_Armstrong7y10

"humans don't have actual preferences so the AI is just going to try to learn something adequate."

Try something like: humans don't have actual consistent preferences, so the AI is going to try and find a good approximation that covers all the contradictions and uncertainties in human preferences.

[-]Rohin Shah6y40

For example, imagine that the AI, for example, extinguished all meaningful human interactions because these can sometimes be painful and the AI knows that we prefer to avoid pain. But it's clear to us that most people's partial preferences will not endorse total loneliness as good outcome; if it's clear to us, then it's a fortiori clear to a very intelligent AI; hence the AI will avoid that failure scenario.

I don't understand this. My understanding is that you are proposing that we build a custom preference inference and synthesis algorithm, that's separate from the AI. This produces a utility function that is then fed into the AI. But if this is the case, then you can't use the AI's intelligence to argue that the synthesis algorithm will work well, since they are separate.

Perhaps you do intend for the synthesis algorithm to be part of "the AI"? If so, can you say more about how that works? What assumptions about the AI do you need to be true?

[-]Stuart_Armstrong6y10

That bit was on the value of approximating the ideal. Having a smart AI and an version of $U_{H}$ , even an approximate one, can lead to much better outcomes than the default - at least, that's the argument of that section.

(PS: I edited that first sentence to remove the double "for example").

[-]Rohin Shah6y10

Oh, I see, so the argument is that conditional on the idealized synthesis algorithm being a good definition of human preferences, the AI can approximate the synthesis algorithm, and whatever utility function it comes up with and optimizes should not have any human-identifiable problems. That makes sense. Followup questions:

How do you tell the AI system to optimize for "what the idealized synthesis algorithm would do"?
How can we be confident that the idealized synthesis algorithm actually captures what we care about?

[-]Stuart_Armstrong6y20

How do you tell the AI system to optimize for "what the idealized synthesis algorithm would do"?

Here's a grounding (see 1.2), here's a definition of what we want, here are valid ways of approximating it :-) Basically at that point, it's kind of like approximating full Bayesian updating for an exceedingly complex system.

How can we be confident that the idealized synthesis algorithm actually captures what we care about?

Much of the rest of the research agenda is arguing that this is the case. See especially section 2.8.

But for both these points, more research is still needed.

[-]Raemon6y40

Curated. I found this to be a useful, comprehensive writeup. It was reasonably accessible to me as a layman, while giving a good overview of much of Stuart's past writeups. While I haven't verified this, my sense is it gave me good hooks into technical things if I wanted to dive into them.

[-]Stuart_Armstrong6y10

Thanks!

[-]romeostevensit7y20

It seems like meta preferences take into account the lack of self knowledge of the utility function pretty well. It throws flags on maximizing and tries to move slower/collect more data when it recognizes it is in a tail of its current trade off model. i.e. it has a 'good enough' self model of its own update process.

[-]Vaniver7y10

Fixed a typo.

[-]Evan R. Murphy4y00

This is an impressive piece of work and I'm excited about your agenda.

And maybe, in that situation, if we are confident that is pretty safe, we'd want the AI to subtly manipulate the human's preferences towards it.

Can you elaborate on this? Why would we want to manipulate the human's preferences?

[-]Stuart_Armstrong4y00

Because our preferences are inconsistent, and if an AI says "your true preferences are ", we're likely to react by saying "no! No machine will tell me what my preferences are. My true preferences are $U_{H}^{'}$ , which are different in subtle ways".

[-]Evan R. Murphy4y*10

So the subtle manipulation is to compensate for those rebellious impulses making $U_{H}$ unstable?

Why not just let the human have those moments and alter their $U_{H}$ if that's what they think they want? Over time, then they may learn that being capricious with their AI doesn't ultimately serve them very well. But if they find out the AI is trying to manipulate them, that could make them want to rebel even more and have less trust for the AI.

A partial preference being a preference where the human considers only a small part of the variables describing the universe; see Section 1.1. ↩︎
Actually, this specific problem is not included directly in the research agenda, though see Section 4.3. ↩︎
Likely but not certain: we don't know how effective AIs might become at computing counterfactuals or modelling humans. ↩︎
It makes sense to allow partial preferences to contrast a small number of situations, rather than just two. So "when it comes to watching superhero movies, I'd prefer to watch them with Alan, but Beth will do, and definitely not with Carol". Since partial preferences with $n$ situations can be built out of smaller number of partial preferences with two situations, allowing more situations is a useful practical move, but doesn't change the theory. ↩︎
"One-step" refers to hypotheticals that can be removed from the human's immediate experience ("Imagine that you and your family are in space...") but not very far removed (so no need for lengthy descriptions that could sway the human's opinions by hearing them). ↩︎
Equivalently to reducing the weight, we could increase uncertainty about the partial preference, given the unfamiliarity. There are many options for formalisms that lead to the same outcome. Though note that here, we are imposing a penalty (low weight/high uncertainty) for unfamiliarity, whereas the actual human might have incredibly strong internal certainty in their preferences. It's important to distinguish assumptions that the synthesis process makes, from assumptions that the human might make. ↩︎
Extreme situations are also situations where we have to be very careful to ensure the AI has the right model of all preference possibilities. The flaws of incorrect model can be corrected by enough data, but when data is sparse and unreliable, then model assumptions - including prior - tend to dominate the result. ↩︎
"Natural" does not, of course, mean any of "healthy", "traditional", or "non-polluting". However those using the term "natural" are often assuming all of those. ↩︎
The human's meta-preferences are also relevant to this it. It might be that, whenever asked about this particular contradiction, the human would answer one way. Therefore $H$ 's conditional meta-preferences may contain ways of resolving these contradictions, at least if the meta-preferences have high weight and the preferences have low weight.

Conditional meta-preferences can be tricky, though, as we don't want them to allow the synthesis to get around the one-step hypotheticals restriction. A "if a long theory sounds convincing to me, I want to believe it" meta-preference in practice do away with these restrictions. That particular meta-preference might be cancelled out by the ability of many different theories to sound convincing. ↩︎
We can allow meta-preferences to determine a lot more of their own synthesis if we find an appropriate method that a) always reaches a synthesis, and b) doesn't artificially boost some preferences through a feedback effect. ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

21

Research Agenda v0.9: Synthesising a human's preferences into a utility function

21

0 The fundamental idea

0.1 Executive summary: synthesis process

0.2 Executive summary: agenda difficulty and value

0.3 Executive aside: the value of approximating the theory

0.4 An inspiring just-so story

1 The partial preferences of a human

1.1 Partial models, partial preferences

1.2 Symbol grounding

1.3 Which (real and hypothetical) partial models?

2 Synthesising the preference utility function

2.1 What sort of utility function?

2.2 Why a utility function?

2.3 Extending and normalising partial preferences

2.4 Synthesising the preference function: first step

2.5 Identity preferences

2.6 Synthesising the preference function: meta-preferences

2.7 Synthesising the preference function: meta-preference about synthesis

2.8 Avoiding disasters, and global meta-preferences

2.9 How much to delegate to the process

3 $U_{H}$ in practice

3.1 Synthesis of $U_{H}$ in practice

3.2 (Avoiding) uncertainty and manipulative learning

3.3 Principled patching of other methods

3.4 Simplified $U_{H}$ sufficient for many methods

3.5 Applying the intuitions behind $U_{H}$ to analysing other situations

4 Limits of the method

4.1 Utility at one point in time

4.2 Not a philosophical ideal

4.3 Individual utility versus common utility

4.4 Synthesising $U_{H}$ rather than discovering it (moral anti-realism)

4.5 Self-referential contradictory preferences

4.6 The question of identity and change

4.7 Other Issues not addressed

21

Research Agenda v0.9: Synthesising a human's preferences into a utility function

21

0 The fundamental idea

0.1 Executive summary: synthesis process

0.2 Executive summary: agenda difficulty and value

0.3 Executive aside: the value of approximating the theory

0.4 An inspiring just-so story

1 The partial preferences of a human

1.1 Partial models, partial preferences

1.2 Symbol grounding

1.3 Which (real and hypothetical) partial models?

2 Synthesising the preference utility function

2.1 What sort of utility function?

2.2 Why a utility function?

2.3 Extending and normalising partial preferences

2.4 Synthesising the preference function: first step

2.5 Identity preferences

2.6 Synthesising the preference function: meta-preferences

2.7 Synthesising the preference function: meta-preference about synthesis

2.8 Avoiding disasters, and global meta-preferences

2.9 How much to delegate to the process

3 UH in practice

3.1 Synthesis of UH in practice

3.2 (Avoiding) uncertainty and manipulative learning

3.3 Principled patching of other methods

3.4 Simplified UH sufficient for many methods

3.5 Applying the intuitions behind UH to analysing other situations

4 Limits of the method

4.1 Utility at one point in time

4.2 Not a philosophical ideal

4.3 Individual utility versus common utility

4.4 Synthesising UH rather than discovering it (moral anti-realism)

4.5 Self-referential contradictory preferences

4.6 The question of identity and change

4.7 Other Issues not addressed

3 $U_{H}$ in practice

3.1 Synthesis of $U_{H}$ in practice

3.4 Simplified $U_{H}$ sufficient for many methods

3.5 Applying the intuitions behind $U_{H}$ to analysing other situations

4.4 Synthesising $U_{H}$ rather than discovering it (moral anti-realism)