Deconfusing Human Values Research Agenda v1

3Gordon Seidoh Worley

2Charlie Steiner

New Comment

10 comments, sorted by

top scoring

Click to highlight new comments since: Today at 1:47 AM

Planned summary for the Alignment Newsletter:

This post argues that since 1. human values are necessary for alignment, 2. we are confused about human values, and 3. we couldn't verify it if an AI system discovered the structure of human values, we need to do research to become less confused about human values. This research agenda aims to deconfuse human values by modeling them as the input to a decision process which produces behavior and preferences. The author's best guess is that human values are captured by valence, as modeled by minimization of prediction error.

Planned opinion:

This is similar to the argument in <@Why we need a *theory* of human values@>, and my opinion remains roughly the same: I strongly agree that we are confused about human values, but I don't see an understanding of human values as necessary for value alignment. We could hope to build AI systems in a way where we don't need to specify the ultimate human values (or even a framework for learning them) before running the AI system. As an analogy, my friends and I are all confused about human values, but nonetheless I think they are more or less aligned with me (in the sense that if AI systems were like my friends but superintelligent, that sounds broadly fine).

[-]Gordon Seidoh Worley6y30

Yep, agree with the summary.

I'll push back on your opinion a little bit here as if it were just a regular LW comment on the post.

I strongly agree that we are confused about human values, but I don't see an understanding of human values as necessary for value alignment. We could hope to build AI systems in a way where we don't need to specify the ultimate human values (or even a framework for learning them) before running the AI system.

This is a reasonably hope but I generally think hope is dangerous when it comes to existential risks, so I'm moved to pursue this line of research because I believe it to be neglected, I believe it's likely enough to be useful to building aligned AI to be worth pursuing, and I would rather us have explored it thoroughly and ended up not needing it than have not explored it and end up needing to have. I also don't think it much takes away from other AI safety research, since the skills needed to work on this problem are somewhat different than those needed to address other AI safety problems (or so I think), so I mostly think we can pursue it for a fairly low opportunity cost.

As an analogy, my friends and I are all confused about human values, but nonetheless I think they are more or less aligned with me (in the sense that if AI systems were like my friends but superintelligent, that sounds broadly fine).

I expect we have a disagreement on how robust Goodhart problems are, as in I would expect that if you felt more or less aligned with a superintelligent AI system the way you feel you are aligned with your friends, the AI system would optimize so hard that it would no longer be aligned, and that the level of alignment you are talking about only works because of lack of optimization power. I suspect that at the level of measurement you're talking about where you can infer alignment from observed behavior there is too much room for error between the measure and the target such that deviance is basically guaranteed.

Thankfully I know others are working on ways to engineer us around Goodhart problems, and maybe these solutions will be robust enough to work over such large measurement gaps, but again I am perhaps more conservative here and want to make the gap between the measure and the target much smaller so that we can effectively get "under" Goodhart effects for the targets we care about by measure and modeling the processes that generate those targets rather than the targets themselves.

This is a reasonably hope but I generally think hope is dangerous when it comes to existential risks

When I say "hope", I mean "it is reasonably likely that the research we do pans out and leads to a knowably-aligned AI system", not "we will look at the AI system's behavior, pull a risk estimate out of nowhere, and then proceed to deploy it anyway".

In this sense, literally all AI risk research is based on hope, since no existing AI risk research knowably will lead to us building an aligned AI system.

I'm moved to pursue this line of research because I believe it to be neglected, I believe it's likely enough to be useful to building aligned AI to be worth pursuing, and I would rather us have explored it thoroughly and ended up not needing it than have not explored it and end up needing to have.

This is all reasonable; most of it can be said about most AI risk research. The main distinguishing feature between different kinds of technical AI risk research is:

it's likely enough to be useful to building aligned AI to be worth pursuing

So that's the part you'd have to argue for to convince me (but also it would be reasonable not to bother).

I would expect that if you felt more or less aligned with a superintelligent AI system the way you feel you are aligned with your friends, the AI system would optimize so hard that it would no longer be aligned

Suppose one of your friends became 10x more intelligent, or got a superpower where they could choose at will to stop time for everything except themselves and a laptop (that magically still has Internet access). Is this a net positive change to the world, or a net negative one?

Perhaps you think AI systems will be different in kind to your friends, in which case see next point.

I suspect that at the level of measurement you're talking about where you can infer alignment from observed behavior there is too much room for error between the measure and the target such that deviance is basically guaranteed.

Wait, I infer alignment from way more than just observed behavior. In the case of my friends, I have a model of how humans work in general, informed both by theory (e.g. evolutionary psychology) and empirical evidence (e.g. reasoning about how I would do X, and projecting it onto them). In the case of AI systems, I would want similar additional information beyond just their behavior, e.g. an understanding of what their training process incentivizes, running counterfactual queries on them early in training when they are still relatively unintelligent and I can understand them, etc.

I am perhaps more conservative here and want to make the gap between the measure and the target much smaller so that we can effectively get "under" Goodhart effects for the targets we care about by measure and modeling the processes that generate those targets rather than the targets themselves.

It's not obvious to me that modeling the generators of a thing is easier than modeling the thing. E.g. It's much easier for me to model humans than to model evolution.

Suppose one of your friends became 10x more intelligent, or got a superpower where they could choose at will to stop time for everything except themselves and a laptop (that magically still has Internet access). Is this a net positive change to the world, or a net negative one?

I expect it to be net negative. My model is something like humans are not very agentic (able to reliably achieve/optimize for a goal) in absolute terms even though we may feel as though humans are especially agentic relative to other systems, and because humans bumble a lot they don't tend to have a lot of impact and things work out well or poorly on average as a result of lots of moves that cancel each other out and only leave a small gain or loss in valued outcomes in the end. A 10x smarter human would be more agentic, and if they are not exactly right about how to do good they could more easily do harm that would normally be buffered by their ineffectiveness.

I build this intuition from, for example, the way dictators often screw things up even when they are well intentioned because they now have more power to achieve their goals and it amplifies their mistakes and misunderstandings in ways that cause more impact, more variance, and historically worse outcomes than less agentic methods of leadership.

Although this is not a perfect analogy because 10x smarter is not just 10x more powerful/agentic but 10x better able to think through consequences (which the dictators lacks), I also think the orthogonality thesis is robust enough that it's more likely to me that 10x smarter will not mean a match in ability to think through consequences that will perfectly offset the risks of greater agency.

Wait, I infer alignment from way more than just observed behavior. In the case of my friends, I have a model of how humans work in general, informed both by theory (e.g. evolutionary psychology) and empirical evidence (e.g. reasoning about how I would do X, and projecting it onto them). In the case of AI systems, I would want similar additional information beyond just their behavior, e.g. an understanding of what their training process incentivizes, running counterfactual queries on them early in training when they are still relatively unintelligent and I can understand them, etc.

Exactly, because you can't infer alignment from observed behavior without normative assumptions. I'm saying even with all that (or especially with all of that), the measurement gap is large and we should expect high deviance from the target that will readily lead to Goodharting.

It's not obvious to me that modeling the generators of a thing is easier than modeling the thing. E.g. It's much easier for me to model humans than to model evolution.

It's definitely harder. That's a reasonable consideration when we're trying to engineer a system that will be good enough while racing against the clock, and I think it's quite reasonable, for example, that we're going to try to tackle value alignment via extensions to narrow value learning approaches first because that's easier to build. But I also think those approaches will fail and so I'm looking ahead to where I see the limits of our knowledge for what we'll have to do conditioned on this bet I'm making that value learning approaches similar in kind to those we're trying now won't produce aligned AIs.

I expect it to be net negative.

Man, I do not share that intuition.

I'd be interested in specific examples of well-intentioned dictators that screwed things up (though I anticipate my objections will be that 1. they weren't well-intentioned or 2. they didn't have the power to actually impose decisions centrally, and had to spend most of their power ensuring that they remained in power).

I'm saying even with all that (or especially with all of that), the measurement gap is large and we should expect high deviance from the target that will readily lead to Goodharting.

I know you're saying that, I just don't see many arguments for it. From my perspective, you are asserting that Goodhart problems are robust, rather than arguing for it. That's fine, you can just call it an intuition you have, but to the extent you want to change my mind, restating it in different words is not very likely to work.

It's definitely harder.

This is an assertion, not an argument.

Do you really believe that you can predict facts about humans better just by reasoning about evolution (and using no information you've learned by looking at humans), relative to building a model by looking at humans (and using no information you've learned from the theory of evolution)? I suspect you actually mean some other thing, but idk what.

I've made my case for that here.

I'd be interested in specific examples of well-intentioned dictators that screwed things up (though I anticipate my objections will be that 1. they weren't well-intentioned or 2. they didn't have the power to actually impose decisions centrally, and had to spend most of their power ensuring that they remained in power).

Some examples of actions taken by dictators that I think were well intentioned and meant to further goals that seemed laudable and not about power grabbing to the dictator but had net negative outcomes for the people involved and the world:

Joseph Stalin's collectivization of farms
Tokugawa Iemitsu's closing off of Japan
Hugo Chávez's nationalization of many industries

I know you're saying that, I just don't see many arguments for it. From my perspective, you are asserting that Goodhart problems are robust, rather than arguing for it. That's fine, you can just call it an intuition you have, but to the extent you want to change my mind, restating it in different words is not very likely to work.

Do you really believe that you can predict facts about humans better just by reasoning about evolution (and using no information you've learned by looking at humans), relative to building a model by looking at humans (and using no information you've learned from the theory of evolution)? I suspect you actually mean some other thing, but idk what.

No, it's not my goal that we not look at humans. I instead think we're currently too focused on trying to figure out everything from only looking at the kinds of evidence we can easily collect today, and that we also don't have detailed enough models to know what other evidence is likely relevant. I think understanding whatever is going on with values is hard because there is data further "down the stack", if you will, from observations of behavior that is relevant. I think that because I look at issues like latent preferences that by definition exist because we didn't have enough data to infer their existence but that need not necessarily exist if we gather more data about how those latent preferences are generated such that we could discover them in advance by looking earlier in the process that generates them.

Some examples of actions taken by dictators that I think were well intentioned and meant to further goals that seemed laudable and not about power grabbing to the dictator but had net negative outcomes for the people involved and the world:

What's your model for why those actions weren't undone?

To pop back up to the original question -- if you think making your friend 10x more intelligent would be net negative, would you make them 10x dumber? Or perhaps it's only good to make them 2x smarter, but after that more marginal intelligence is bad?

It would be really shocking if we were at the optimal absolute level of intelligence, so I assume that you think we're at the optimal relative level of intelligence, that is, the best situation is when your friends are about as intelligent as you are. In that case, let's suppose that we increase/decrease all of your friends and your intelligence by a factor of X. For what range of X would you expect this intervention is net positive?

(I'm aware that intelligence is not one-dimensional, but I feel like this is still a mostly meaningful question.)

Just to be clear about my own position, a well intentioned superintelligent AI system totally could make mistakes. However, it seems pretty unlikely that they'd be of the existentially-catastrophic kind. Also, the mistake could be net negative, but the AI system overall should be net positive.

What's your model for why those actions weren't undone?

Not quite sure what you're asking here. In the first two cases they eventually were undone after people got fed up with the situation, the last is recent enough I don't consider it's not having already been undone as evidence people like it, only that they don't have the power to change it. My view is that these changes stayed in place because the dictators and their successors continued to believe the good out weighted the harm when either this was clearly contrary to the ground truth but served some narrow purpose that was viewed as more important or when the ground truth was too hard to discover at the time and we only believe it was net harmful through the lens of historical analysis.

To pop back up to the original question -- if you think making your friend 10x more intelligent would be net negative, would you make them 10x dumber? Or perhaps it's only good to make them 2x smarter, but after that more marginal intelligence is bad?

It would be really shocking if we were at the optimal absolute level of intelligence, so I assume that you think we're at the optimal relative level of intelligence, that is, the best situation is when your friends are about as intelligent as you are. In that case, let's suppose that we increase/decrease all of your friends and your intelligence by a factor of X. For what range of X would you expect this intervention is net positive?

I'm not claiming we're at some optimal level of intelligence for any particular purpose, only that more intelligence leads to greater agency which, in the absence of sufficient mechanisms to constrain actions to beneficial ones, results in greater risk of negative outcomes due to things like deviance and unilateral action. Thus I do in fact think we'd be safer from ourselves, for example screening off existential risks humanity faces due to outside threats like asteroids, if we were dumber.

By comparison, chimpanzees may not live what look to us like very happy lives, they are some factor dumber than us, but also they aren't at risk of making themselves extinct because one chimp really wanted a lot of bananas.

I'm not sure how much smarter we could all get without putting us at too much risk. I think there's an anthropic argument to be made that we are below whatever level of intelligence is dangerous to ourselves without greater safeguards because we wouldn't exist in such universes due to having killed ourselves, but I feel like I have little evidence to make a judgement about how much smarter is safe given, for example, being, say, 95th percentile smart didn't stop people from building things like atomic weapons or developing dangerous chemical applications. I would expect making my friends smarter to risk similarly bad outcomes. Making them dumber seems safer, especially when I'm in the frame of thinking about AGI.

[-]Charlie Steiner6y*20

I almost agree, but still ended up disagreeing with a lot of your bullet points. Since reading your list was useful, I figured it would be worthwhile to just make a parallel list. ✓ for agreement, × for disagreement (• for neutral).

Problem overview

✓ I think we're confused about what we really mean when we talk about human values.

× But our real problem is on the meta-level: we want to understand value learning so that we can build an AI that learns human values even without starting with a precise model waiting to be filled in.

_× We can trust AI to discover that structure for us even though we couldn't verify the result, because the point isn't getting the right answer, it's having a trustworthy process.

_ × We can't just write down the correct structure any more than we can just write down the correct content. We're trying to translate a vague human concept into precise instructions for an AI

✓ Agree with extensional definition of values, and relevance to decision-making.

• Research on the content of human values may be useful information about what humans consider to be human values. I think research on the structure of human values is in much the same boat - information, not the final say.

✓ Agree about Stuart's work being where you'd go to write down a precise set of preferences based on human preferences, and that the problems you mention are problems.

Solution overview

✓ Agree with assumptions.

• I think the basic model leaves out the fact that we're changing levels of description.

_ × Merely causing events (in the physical level of description) is not sufficient to say we're acting (in the agent level of description). We need some notion of "could have done something else," which is an abstraction about agents, not something fundamentally physical.

_ × Similar quibbles apply to the other parts - there is no physically special decision process, we can only find one by changing our level of description of the world to one where we posit such a structure.

_ × The point: Everything in the basic model is a statistical regularity we can observe over the behavior of a physical system. You need a bit more nuanced way to place preferences and meta-preferences.

_ • The simple patch is to just say that there's some level of description where the decision-generation process lives, and preferences live at a higher level of abstraction than that. Therefore preferences are emergent phenomena from the level of description the decision-generation process is on.

_ _ × But I think if one applies this patch, then it's a big mistake to use loaded words like "values" to describe the inputs (all inputs?) to the decision-generation process, which are, after all, at a level of description below the level where we can talk about preferences. I think this conflicts with the extensive definitions from earlier.

× If we recognize that we're talking about different levels of description, then preferences are not either causally after or causally before decisions-on-the-basic-model-level-of-abstraction. They're regular patterns that we can use to model decisions at a slightly higher level of abstraction.

_ • How to describe self-aware agents at a low level of abstraction then? Well, time to put on our GEB hats. The low level of abstraction just has to include a computation of the model we would use on the higher level of abstraction.

✓ Despite all these disagreements, I think you've made a pretty good case that the human brain plausibly computes a single currency (valence) that it uses to rate both most decisions and most predictions.

_ × But I still don't agree that this makes valence human values. I mean values in the sense of "the cluster we sometimes also point at with words like value, preference, affinity, taste, aesthetic, intention, and axiology." So I don't think we're left with a neuroscience problem, I still think what we want the AI to learn is on that higher level of abstraction where preferences live.