Evaluating the historical value misspecification argument

Matthew Barnett

57 Evaluating the historical value misspecification argument

by Matthew Barnett

5th Oct 2023

8 min read

161

57

Review by

Matthew Barnett

ETA: I'm not saying that MIRI thought AIs wouldn't understand human values. If there's only one thing you take away from this post, please don't take away that. Here is Linch's attempted summary of this post, which I largely agree with.

Recently, many people have talked about whether some of the main MIRI people (Eliezer Yudkowsky, Nate Soares, and Rob Bensinger^[1]) should update on whether value alignment is easier than they thought given that GPT-4 seems to follow human directions and act within moral constraints pretty well (here are two specific examples of people talking about this: 1, 2). Because these conversations are often hard to follow without much context, I'll just provide a brief caricature of how I think this argument has gone in the places I've seen it, which admittedly could be unfair to MIRI^[2]. Then I'll offer my opinion that, overall, I think MIRI people should probably update in the direction of alignment being easier than they thought in light of this information, despite their objections.

Note: I encourage you to read this post carefully to understand my thesis. This topic can be confusing, and there are many ways to misread what I'm saying. Also, make sure to read the footnotes if you're skeptical of some of my claims.

Here's my very rough caricature of the discussion so far, plus my response:

Non-MIRI people: Yudkowsky talked a great deal in the sequences about how it was hard to get an AI to understand human values. For example, his essay on the Hidden Complexity of Wishes made it sound like it would be really hard to get an AI to understand common sense. In that essay, the genie did silly things like throwing your mother out of the building rather than safely carrying her out. Actually, it turned out that it was pretty easy to get an AI to understand common sense. LLMs are essentially safe-ish genies that do what you intend. MIRI people should update on this information.

MIRI people (Eliezer Yudkowsky, Nate Soares, and Rob Bensinger): You misunderstood the argument. The argument was never about getting an AI to understand human values, but about getting an AI to care about human values in the first place. Hence 'The genie knows but doesn't care'. There's no reason to think that GPT-4 cares about human values, even if it can understand them. We always thought the hard part of the problem was about inner alignment, or, pointing the AI in a direction you want. We think figuring out how to point an AI in whatever direction you choose is like 99% of the problem; the remaining 1% of the problem is getting it to point at the "right" set of values.^[2]

My response:

I agree that MIRI people never thought the problem was about getting AI to merely understand human values, and that they have generally maintained there was extra difficulty in getting an AI to care about human values. However, I distinctly recall MIRI people making a big deal about the value identification problem (AKA the value specification problem), for example in this 2016 talk from Yudkowsky.^[3] The value identification problem is the problem of "pinpointing valuable outcomes to an advanced agent and distinguishing them from non-valuable outcomes". In other words, it's the problem of specifying a utility function that reflects the "human value function" with high fidelity, i.e. the problem of specifying a utility function that can be optimized safely. See this footnote^[4] for further clarification about how I view the value identification/specification problem.

The key difference between the value identification/specification problem and the problem of getting an AI to understand human values is the transparency and legibility of how the values are represented: if you solve the problem of value identification, that means you have an actual function that can tell you the value of any outcome (which you could then, hypothetically, hook up to a generic function maximizer to get a benevolent AI). If you get an AI that merely understands human values, you can't necessarily use the AI to determine the value of any outcome, because, for example, the AI might lie to you, or simply stay silent.

The primary foreseeable difficulty Yudkowsky offered for the value identification problem is that human value is complex.^[5] In turn, the idea that value is complex was stated multiple times as a premise for why alignment is hard.^[6] Another big foreseeable difficulty with the value identification problem is the problem of edge instantiation, which was talked about extensively in early discussions on LessWrong.

MIRI people frequently claimed that solving the value identification problem would be hard, or at least non-trivial.^[7] For instance, Nate Soares wrote in his 2016 paper on value learning, that "Human preferences are complex, multi-faceted, and often contradictory. Safely extracting preferences from a model of a human would be no easy task."

I claim that GPT-4 is already pretty good at extracting preferences from human data. It exhibits common sense. If you talk to GPT-4 and ask it ethical questions, it will generally give you reasonable answers. It will also generally follow your intended directions, rather than what you literally said. Together, I think these facts indicate that GPT-4 is probably on a path towards an adequate solution to the value identification problem, where "adequate" means "about as good as humans". And to be clear, I don't mean that GPT-4 merely passively "understands" human values. I mean that GPT-4 literally executes your intended instructions in practice, and that asking GPT-4 to distinguish valuable and non-valuable outcomes works pretty well in practice, and this will become increasingly apparent in the near future as models get more capable and expand to more modalities.^[8]

I'm not arguing that GPT-4 actually cares about maximizing human value. However, I am saying that the system is able to transparently pinpoint to us which outcomes are good and which outcomes are bad, with fidelity approaching an average human, albeit in a text format. Crucially, GPT-4 can do this visibly to us, in a legible way, rather than merely passively knowing right from wrong in some way that we can't access. This fact is key to what I'm saying because it means that, in the near future, we can literally just query multimodal GPT-N about whether an outcome is bad or good, and use that as an adequate "human value function". That wouldn't solve the problem of getting an AI to care about maximizing the human value function, but it would arguably solve the problem of creating an adequate function that we can put into a machine to begin with.

Maybe you think "the problem" was always that we can't rely on a solution to the value identification problem that only works as well as a human, and we require a much higher standard than "human-level at moral judgement" to avoid a catastrophe. But personally, I think having such a standard is both unreasonable and inconsistent with the implicit standard set by essays from Yudkowsky and other MIRI people. In Yudkowsky's essay on the hidden complexity of wishes, he wrote,

You failed to ask for what you really wanted. You wanted your mother to go on living, but you wished for her to become more distant from the center of the building.
Except that's not all you wanted. If your mother was rescued from the building but was horribly burned, that outcome would rank lower in your preference ordering than an outcome where she was rescued safe and sound. So you not only value your mother's life, but also her health. [...]
Your brain is not infinitely complicated; there is only a finite Kolmogorov complexity / message length which suffices to describe all the judgments you would make. But just because this complexity is finite does not make it small. We value many things, and no they are not reducible to valuing happiness or valuing reproductive fitness.

I interpret this passage as saying that 'the problem' is extracting all the judgements that "you would make", and putting that into a wish. I think he's implying that these judgements are essentially fully contained in your brain. I don't think it's credible to insist he was referring to a hypothetical ideal human value function that ordinary humans only have limited access to, at least in this essay.^[9]

Here's another way of putting my point: In general, there are at least two ways that someone can fail to follow your intended instructions. Either your instructions aren't well-specified and don't fully capture your intentions, or the person doesn't want to obey your instructions even if those instructions accurately capture what you want. Practically all the evidence that I've found seems to indicate that MIRI people thought that both problems would be hard to solve for AI, not merely the second problem.

For example, a straightforward reading of Nate Soares' 2017 talk supports this interpretation. In the talk, Soares provides a fictional portrayal of value misalignment, drawing from the movie Fantasia. In the story, Mickey Mouse attempts to instruct a magical broom to fill a cauldron, but the broom follows the instructions literally rather than following what Mickey Mouse intended, and floods the room. Soares comments: "I claim that as fictional depictions of AI go, this is pretty realistic."^[10]

Perhaps more important to my point, Soares presented a clean separation between the part where we specify an AI's objectives, and the part where the AI tries to maximizes those objectives. He draws two arrows, indicating that MIRI is concerned about both parts. He states, "My view is that the critical work is mostly in designing an effective value learning process, and in ensuring that the sorta-argmax process is correctly hooked up to the resultant objective function 𝗨:"^[11]

In the talk Soares also says, "The serious question with smarter-than-human AI is how we can ensure that the objectives we’ve specified are correct, and how we can minimize costly accidents and unintended consequences in cases of misspecification." I believe this quote refers directly to the value identification problem, rather than the problem of getting an AI to care about following the goals we've given it. This attitude is reflected in other MIRI essays.

The point of "the genie knows but doesn't care" wasn't that the AI would take your instructions, know what you want, and yet disobey the instructions because it doesn't care about what you asked for. If you read Rob Bensinger's essay carefully, you'll find that he's actually warning that the AI will care too much about the utility function you gave it, and maximize it exactly, against your intentions^[12]. The sense in which the genie "doesn't care" is that it doesn't care what you intended; it only cares about the objectives that you gave it. That's not the same as saying the genie doesn't care about the objectives you specified.

Given the evidence, it seems to me that the following conclusions are probably accurate:

The fact that GPT-4 can reliably follow basic instructions, is able to distinguish moral from immoral actions somewhat reliably, and generally does what I intend rather than what I literally asked, is all evidence that the value identification problem is easier than how MIRI people originally portrayed it. While I don't think the value identification problem has been completely solved yet, I don't expect near-future AIs will fail dramatically on the "fill a cauldron" task, or any other functionally similar tasks.
MIRI people used to think that it would be hard to both (1) specify an explicit function that corresponds to the "human value function" with fidelity comparable to the judgement of an average human, and (2) separately, get an AI to care about maximizing this function. The idea that MIRI people only ever thought (2) was the hard part appears false.^[13]
Non-MIRI people sometimes strawman MIRI people as having said that AGI would literally lack an understanding of human values. I don't endorse this, and I'm not saying this.

The "complexity of value" argument pretty much just tells us that we need an AI to learn human values, rather than hardcoding a utility function from scratch. That's a meaningful thing to say, but it doesn't tell us much about whether alignment is hard, especially in the deep learning paradigm; it just means that extremely naive approaches to alignment won't work.

As an endnote, I don't think it really matters whether MIRI people had mistaken arguments about the difficulty of alignment ten years ago. It matters far more what their arguments are right now. However, I do care about accurately interpreting what people said on this topic, and I think it's important for people to acknowledge when the evidence has changed.

^{^}
I recognize that these people are three separate individuals and each have their own nuanced views. However, I think each of them have expressed broadly similar views on this particular topic, and I've seen each of them engage in a discussion about how we should update about the difficulty of alignment given what we've seen from LLMs.
^{^}
I'm not implying MIRI people would necessarily completely endorse everything I've written in this caricature. I'm just conveying how they've broadly come across to me, and I think the basic gist is what's important here. If some MIRI people tell me that this caricature isn't a fair summary of what they've said, I'll try to edit the post later to include real quotes.
For now, I'll point to this post from Nate Soares in which he stated,
I have long said that the lion's share of the AI alignment problem seems to me to be about pointing powerful cognition at anything at all, rather than figuring out what to point it at.
It’s recently come to my attention that some people have misunderstood this point, so I’ll attempt to clarify here.
^{^}
More specifically, in the talk, at one point Yudkowsky asks "Why expect that [alignment] is hard?" and goes on to tell a fable about programmers misspecifying a utility function, which then gets optimized by an AI with disastrous consequences. My best interpretation of this part of the talk is that he's saying the value identification problem is one of the primary reasons why alignment is hard. However, I encourage you to read the transcript yourself if you are skeptical of my interpretation.
^{^}
I am mainly talking about the problem of how to specify (for example, write into a computer) an explicit function that reflects the human value function with high fidelity, in the sense that judgements from this function about the value of outcomes fairly accurately reflect the judgements of ordinary humans. I think this is simply a distinct concept from the idea of getting an AI to understand human values.
I was not able to find a short and crisp definition of the value identification/specification problem from MIRI. However, in the Arbital page on the Problem of fully updated deference, the problem is described as follows,
One way to look at the central problem of value identification in superintelligence is that we'd ideally want some function that takes a complete but purely physical description of the universe, and spits out our true intended notion of value V in all its glory. Since superintelligences would probably be pretty darned good at collecting data and guessing the empirical state of the universe, this probably solves the whole problem.
This is not the same problem as writing down our true V by hand. The minimum algorithmic complexity of a meta-utility function ΔU which outputs V after updating on all available evidence, seems plausibly much lower than the minimum algorithmic complexity for writing V down directly. But as of 2017, nobody has yet floated any formal proposal for a ΔU of this sort which has not been immediately shot down.
In MIRI's 2017 technical agenda, they described the problem as follows, which I believe roughly matches how I'm using the term,
A highly-reliable, error-tolerant agent design does not guarantee a positive impact; the effects of the system still depend upon whether it is pursuing appropriate goals. A superintelligent system may find clever, unintended ways to achieve the specific goals that it is given. Imagine a superintelligent system designed to cure cancer which does so by stealing resources, proliferating robotic laboratories at the expense of the biosphere, and kidnapping test subjects: the intended goal may have been “cure cancer without doing anything bad,” but such a goal is rooted in cultural context and shared human knowledge.
It is not sufficient to construct systems that are smart enough to figure out the intended goals. Human beings, upon learning that natural selection “intended” sex to be pleasurable only for purposes of reproduction, do not suddenly decide that contraceptives are abhorrent. While one should not anthropomorphize natural selection, humans are capable of understanding the process which created them while being completely unmotivated to alter their preferences. For similar reasons, when developing AI systems, it is not sufficient to develop a system intelligent enough to figure out the intended goals; the system must also somehow be deliberately constructed to pursue them (Bostrom 2014, chap. 8).
However, the “intentions” of the operators are a complex, vague, fuzzy, context-dependent notion (Yudkowsky 2011; cf. Sotala and Yampolskiy 2017). Concretely writing out the full intentions of the operators in a machine-readable format is implausible if not impossible, even for simple tasks. An intelligent agent must be designed to learn and act according to the preferences of its operators.6 This is the value learning problem.
Directly programming a rule which identifies cats in images is implausibly difficult, but specifying a system that inductively learns how to identify cats in images is possible. Similarly, while directly programming a rule capturing complex human intentions is implausibly difficult, intelligent agents could be constructed to inductively learn values from training data.
^{^}
To support this claim, I'll point out that the Arbital page for the value identification problem says, "A central foreseen difficulty of value identification is Complexity of Value".
^{^}
For example, in this post, Yudkowsky gave "five theses", one of which was the "complexity of value thesis". He wrote, that the "five theses seem to imply two important lemmas", the first lemma being "Large bounded extra difficulty of Friendliness.", i.e. the idea that alignment is hard.
Another example comes from this talk. I've linked to a part in which Yudkowsky begins by talking how human value is complex, and moves to talking about how that fact presents challenges for aligning AI.
^{^}
My guess is that the perceived difficulty of specifying objectives was partly a result of MIRI people expecting that natural language understanding wouldn't occur in AI until just barely before AGI, and at that point it would be too late to use AI language comprehension to help with alignment.
Rob Bensinger said,
It's true that Eliezer and I didn't predict AI would achieve GPT-3 or GPT-4 levels of NLP ability so early (e.g., before it can match humans in general science ability), so this is an update to some of our models of AI.
In 2010, Eliezer Yudkowsky commented,
> I think controlling Earth's destiny is only modestly harder than understanding a sentence in English.
Well said. I shall have to try to remember that tagline.

^{^}
If you disagree that AI systems in the near-future will be capable of distinguishing valuable from non-valuable outcomes about as reliably as humans, then I may be interested in operationalizing this prediction precisely, and betting against you. I don't think this is a very credible position to hold as of 2023, barring a pause that could slow down AI capabilities very soon.
^{^}
I mostly interpret Yudkowsky's Coherent Extrapolated Volition as an aspirational goal for what we could best hope for in an ideal world where we solve every part of alignment, rather than a minimal bar for avoiding human extinction. In Yudkowsky's post on AGI ruin, he stated,
When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of 'provable' alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas which sorta-reasonable humans disagree about, nor attaining an absolute certainty of an AI not killing everyone. When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, "please don't disassemble literally everyone with probability roughly 1" is an overly large ask that we are not on course to get.

^{^}
I don't think I'm taking him out of context. Here's a longer quote from the talk,
When Mickey runs this program, everything goes smoothly at first. Then:
[Image of the cauldron overflowing with water]
I claim that as fictional depictions of AI go, this is pretty realistic.
Why would we expect a generally intelligent system executing the above program to start overflowing the cauldron, or otherwise to go to extreme lengths to ensure the cauldron is full?
The first difficulty is that the objective function that Mickey gave his broom left out a bunch of other terms Mickey cares about.
^{^}
The full quote is,
Another common thread is “Why not just tell the AI system to (insert intuitive moral precept here)?” On this way of thinking about the problem, often (perhaps unfairly) associated with Isaac Asimov’s writing, ensuring a positive impact from AI systems is largely about coming up with natural-language instructions that are vague enough to subsume a lot of human ethical reasoning:
In contrast, precision is a virtue in real-world safety-critical software systems. Driving down accident risk requires that we begin with limited-scope goals rather than trying to “solve” all of morality at the outset.⁵
My view is that the critical work is mostly in designing an effective value learning process, and in ensuring that the sorta-argmax process is correctly hooked up to the resultant objective function 𝗨:
The better your value learning framework is, the less explicit and precise you need to be in pinpointing your value function 𝘝, and the more you can offload the problem of figuring out what you want to the AI system itself. Value learning, however, raises a number of basic difficulties that don’t crop up in ordinary machine learning tasks.
^{^}
This interpretation appears supported by the following quote from Rob Bensinger's essay,
When you write the seed's utility function, you, the programmer, don't understand everything about the nature of human value or meaning. That imperfect understanding remains the causal basis of the fully-grown superintelligence's actions, long after it's become smart enough to fully understand our values.
Why is the superintelligence, if it's so clever, stuck with whatever meta-ethically dumb-as-dirt utility function we gave it at the outset? Why can't we just pass the fully-grown superintelligence the buck by instilling in the seed the instruction: 'When you're smart enough to understand Friendliness Theory, ditch the values you started with and just self-modify to become Friendly.'?
Because that sentence has to actually be coded in to the AI, and when we do so, there's no ghost in the machine to know exactly what we mean by 'frend-lee-ness thee-ree'.
^{^}
It's unclear to me whether MIRI people are claiming that they only ever thought (2) was the hard part of alignment, but here's a quote from Nate Soares that offers some support for this interpretation IMO,
I'd agree that one leg of possible support for this argument (namely "humanity will be completely foreign to this AI, e.g. because it is a mathematically simple seed AI that has grown with very little exposure to humanity") won't apply in the case of LLMs. (I don't particularly recall past people arguing this; my impression is rather one of past people arguing that of course the AI would be able to read wikipedia and stare at some humans and figure out what it needs to about this 'value' concept, but the hard bit is in making it care.
Even if I'm misinterpreting Soares here, I don't think that would undermine the basic point that MIRI people should probably update in the direction of alignment being easier than they thought.

^{^}

I recognize that these people are three separate individuals and each have their own nuanced views. However, I think each of them have expressed broadly similar views on this particular topic, and I've seen each of them engage in a discussion about how we should update about the difficulty of alignment given what we've seen from LLMs.

^{^}

I'm not implying MIRI people would necessarily completely endorse everything I've written in this caricature. I'm just conveying how they've broadly come across to me, and I think the basic gist is what's important here. If some MIRI people tell me that this caricature isn't a fair summary of what they've said, I'll try to edit the post later to include real quotes.

For now, I'll point to this post from Nate Soares in which he stated,

I have long said that the lion's share of the AI alignment problem seems to me to be about pointing powerful cognition at anything at all, rather than figuring out what to point it at.
It’s recently come to my attention that some people have misunderstood this point, so I’ll attempt to clarify here.

^{^}

For now, I'll point to this post from Nate Soares in which he stated,

I have long said that the lion's share of the AI alignment problem seems to me to be about pointing powerful cognition at anything at all, rather than figuring out what to point it at.
It’s recently come to my attention that some people have misunderstood this point, so I’ll attempt to clarify here.

^{^}

More specifically, in the talk, at one point Yudkowsky asks "Why expect that [alignment] is hard?" and goes on to tell a fable about programmers misspecifying a utility function, which then gets optimized by an AI with disastrous consequences. My best interpretation of this part of the talk is that he's saying the value identification problem is one of the primary reasons why alignment is hard. However, I encourage you to read the transcript yourself if you are skeptical of my interpretation.

^{^}

I am mainly talking about the problem of how to specify (for example, write into a computer) an explicit function that reflects the human value function with high fidelity, in the sense that judgements from this function about the value of outcomes fairly accurately reflect the judgements of ordinary humans. I think this is simply a distinct concept from the idea of getting an AI to understand human values.

I was not able to find a short and crisp definition of the value identification/specification problem from MIRI. However, in the Arbital page on the Problem of fully updated deference, the problem is described as follows,

One way to look at the central problem of value identification in superintelligence is that we'd ideally want some function that takes a complete but purely physical description of the universe, and spits out our true intended notion of value V in all its glory. Since superintelligences would probably be pretty darned good at collecting data and guessing the empirical state of the universe, this probably solves the whole problem.
This is not the same problem as writing down our true V by hand. The minimum algorithmic complexity of a meta-utility function ΔU which outputs V after updating on all available evidence, seems plausibly much lower than the minimum algorithmic complexity for writing V down directly. But as of 2017, nobody has yet floated any formal proposal for a ΔU of this sort which has not been immediately shot down.

In MIRI's 2017 technical agenda, they described the problem as follows, which I believe roughly matches how I'm using the term,

A highly-reliable, error-tolerant agent design does not guarantee a positive impact; the effects of the system still depend upon whether it is pursuing appropriate goals. A superintelligent system may find clever, unintended ways to achieve the specific goals that it is given. Imagine a superintelligent system designed to cure cancer which does so by stealing resources, proliferating robotic laboratories at the expense of the biosphere, and kidnapping test subjects: the intended goal may have been “cure cancer without doing anything bad,” but such a goal is rooted in cultural context and shared human knowledge.
It is not sufficient to construct systems that are smart enough to figure out the intended goals. Human beings, upon learning that natural selection “intended” sex to be pleasurable only for purposes of reproduction, do not suddenly decide that contraceptives are abhorrent. While one should not anthropomorphize natural selection, humans are capable of understanding the process which created them while being completely unmotivated to alter their preferences. For similar reasons, when developing AI systems, it is not sufficient to develop a system intelligent enough to figure out the intended goals; the system must also somehow be deliberately constructed to pursue them (Bostrom 2014, chap. 8).
However, the “intentions” of the operators are a complex, vague, fuzzy, context-dependent notion (Yudkowsky 2011; cf. Sotala and Yampolskiy 2017). Concretely writing out the full intentions of the operators in a machine-readable format is implausible if not impossible, even for simple tasks. An intelligent agent must be designed to learn and act according to the preferences of its operators.6 This is the value learning problem.
Directly programming a rule which identifies cats in images is implausibly difficult, but specifying a system that inductively learns how to identify cats in images is possible. Similarly, while directly programming a rule capturing complex human intentions is implausibly difficult, intelligent agents could be constructed to inductively learn values from training data.

^{^}

To support this claim, I'll point out that the Arbital page for the value identification problem says, "A central foreseen difficulty of value identification is Complexity of Value".

^{^}

For example, in this post, Yudkowsky gave "five theses", one of which was the "complexity of value thesis". He wrote, that the "five theses seem to imply two important lemmas", the first lemma being "Large bounded extra difficulty of Friendliness.", i.e. the idea that alignment is hard.

Another example comes from this talk. I've linked to a part in which Yudkowsky begins by talking how human value is complex, and moves to talking about how that fact presents challenges for aligning AI.

^{^}

My guess is that the perceived difficulty of specifying objectives was partly a result of MIRI people expecting that natural language understanding wouldn't occur in AI until just barely before AGI, and at that point it would be too late to use AI language comprehension to help with alignment.

Rob Bensinger said,

It's true that Eliezer and I didn't predict AI would achieve GPT-3 or GPT-4 levels of NLP ability so early (e.g., before it can match humans in general science ability), so this is an update to some of our models of AI.

In 2010, Eliezer Yudkowsky commented,

> I think controlling Earth's destiny is only modestly harder than understanding a sentence in English.
Well said. I shall have to try to remember that tagline.

^{^}

If you disagree that AI systems in the near-future will be capable of distinguishing valuable from non-valuable outcomes about as reliably as humans, then I may be interested in operationalizing this prediction precisely, and betting against you. I don't think this is a very credible position to hold as of 2023, barring a pause that could slow down AI capabilities very soon.

^{^}

I mostly interpret Yudkowsky's Coherent Extrapolated Volition as an aspirational goal for what we could best hope for in an ideal world where we solve every part of alignment, rather than a minimal bar for avoiding human extinction. In Yudkowsky's post on AGI ruin, he stated,

When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of 'provable' alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas which sorta-reasonable humans disagree about, nor attaining an absolute certainty of an AI not killing everyone. When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, "please don't disassemble literally everyone with probability roughly 1" is an overly large ask that we are not on course to get.

10.

^{^}

I don't think I'm taking him out of context. Here's a longer quote from the talk,

When Mickey runs this program, everything goes smoothly at first. Then:
[Image of the cauldron overflowing with water]
I claim that as fictional depictions of AI go, this is pretty realistic.
Why would we expect a generally intelligent system executing the above program to start overflowing the cauldron, or otherwise to go to extreme lengths to ensure the cauldron is full?
The first difficulty is that the objective function that Mickey gave his broom left out a bunch of other terms Mickey cares about.

11.

^{^}

The full quote is,

Another common thread is “Why not just tell the AI system to (insert intuitive moral precept here)?” On this way of thinking about the problem, often (perhaps unfairly) associated with Isaac Asimov’s writing, ensuring a positive impact from AI systems is largely about coming up with natural-language instructions that are vague enough to subsume a lot of human ethical reasoning:
In contrast, precision is a virtue in real-world safety-critical software systems. Driving down accident risk requires that we begin with limited-scope goals rather than trying to “solve” all of morality at the outset.⁵
My view is that the critical work is mostly in designing an effective value learning process, and in ensuring that the sorta-argmax process is correctly hooked up to the resultant objective function 𝗨:
The better your value learning framework is, the less explicit and precise you need to be in pinpointing your value function 𝘝, and the more you can offload the problem of figuring out what you want to the AI system itself. Value learning, however, raises a number of basic difficulties that don’t crop up in ordinary machine learning tasks.

12.

^{^}

This interpretation appears supported by the following quote from Rob Bensinger's essay,

When you write the seed's utility function, you, the programmer, don't understand everything about the nature of human value or meaning. That imperfect understanding remains the causal basis of the fully-grown superintelligence's actions, long after it's become smart enough to fully understand our values.
Why is the superintelligence, if it's so clever, stuck with whatever meta-ethically dumb-as-dirt utility function we gave it at the outset? Why can't we just pass the fully-grown superintelligence the buck by instilling in the seed the instruction: 'When you're smart enough to understand Friendliness Theory, ditch the values you started with and just self-modify to become Friendly.'?
Because that sentence has to actually be coded in to the AI, and when we do so, there's no ghost in the machine to know exactly what we mean by 'frend-lee-ness thee-ree'.

13.

^{^}

It's unclear to me whether MIRI people are claiming that they only ever thought (2) was the hard part of alignment, but here's a quote from Nate Soares that offers some support for this interpretation IMO,

I'd agree that one leg of possible support for this argument (namely "humanity will be completely foreign to this AI, e.g. because it is a mathematically simple seed AI that has grown with very little exposure to humanity") won't apply in the case of LLMs. (I don't particularly recall past people arguing this; my impression is rather one of past people arguing that of course the AI would be able to read wikipedia and stare at some humans and figure out what it needs to about this 'value' concept, but the hard bit is in making it care.

Even if I'm misinterpreting Soares here, I don't think that would undermine the basic point that MIRI people should probably update in the direction of alignment being easier than they thought.

Outer AlignmentValue LearningComplexity of valueAI

Frontpage

Mentioned in

108Shallow review of live agendas in alignment & safety

80Alignment Implications of LLM Successes: a Debate in One Act

69Shallow review of technical AI safety, 2024

18The (partial) fallacy of dumb superintelligence

Evaluating the historical value misspecification argument

6Matthew "Vaniver" Gray

5Matthew "Vaniver" Gray

New Comment

75 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:35 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]Eliezer Yudkowsky2y4937

I have never since 1996 thought that it would be hard to get superintelligences to accurately model reality with respect to problems as simple as "predict what a human will thumbs-up or thumbs-down". The theoretical distinction between producing epistemic rationality (theoretically straightforward) and shaping preference (theoretically hard) is present in my mind at every moment that I am talking about these issues; it is to me a central divide of my ontology.

If you think you've demonstrated by clever textual close reading that Eliezer-2018 or Eliezer-2008 thought that it would be hard to get a superintelligence to understand humans, you have arrived at a contradiction and need to back up and start over.

The argument we are trying to explain has an additional step that you're missing. You think that we are pointing to the hidden complexity of wishes in order to establish in one step that it would therefore be hard to get an AI to output a correct wish shape, because the wishes are complex, so it would be difficult to get an AI to predict them. This is not what we are trying to say. We are trying to say that because wishes have a lot of hidden complexity, the ... (read more)

[-]Evan Hubinger2y*4119

I'm not going to comment on "who said what when", as I'm not particularly interested in the question myself, though I think the object level point here is important:

This makes the nonstraightforward and shaky problem of getting a thing into the AI's preferences, be harder and more dangerous than if we were just trying to get a single information-theoretic bit in there.

The way I would phrase this is that what you care about is the relative complexity of the objective conditional on the world model. If you're assuming that the model is highly capable, and trained in a highly diverse environment, then you can assume that the world model is capable of effectively modeling anything in the world (e.g. anything that might appear in webtext). But the question remains what the "simplest" (according to the inductive biases) goal is that can be pointed to in the world model such that the resulting mesa-optimizer has good training performance.

The most rigorous version of this sort of analysis that exists is probably here, where the key question is how to find a prior (that is, a set of inductive biases) such that the desired goal has a lower complexity conditional on the world model compar... (read more)

[-]Matthew Barnett2y*4036

I think you missed some basic details about what I wrote. I encourage people to compare what Eliezer is saying here to what I actually wrote. You said:

If you think you've demonstrated by clever textual close reading that Eliezer-2018 or Eliezer-2008 thought that it would be hard to get a superintelligence to understand humans, you have arrived at a contradiction and need to back up and start over.

I never said that you or any other MIRI person thought it would be "hard to get a superintelligence to understand humans". Here's what I actually wrote:

Non-MIRI people sometimes strawman MIRI people as having said that AGI would literally lack an understanding of human values. I don't endorse this, and I'm not saying this.
[...]
I agree that MIRI people never thought the problem was about getting AI to merely understand human values, and that they have generally maintained there was extra difficulty in getting an AI to care about human values. However, I distinctly recall MIRI people making a big deal about the value identification problem (AKA the value specification problem), for example in this 2016 talk from Yudkowsky.^[3] The value identification problem is the problem of "pinpointi

... (read more)

[-]TekhneMakre2y2926

Without digging in too much, I'll say that this exchange and the OP is pretty confusing to me. It sounds like MB is like "MIRI doesn't say it's hard to get an AI that has a value function" and then also says "GPT has the value function, so MIRI should update". This seems almost contradictory.

A guess: MB is saying "MIRI doesn't say the AI won't have the function somewhere, but does say it's hard to have an externally usable, explicit human value function". And then saying "and GPT gives us that", and therefore MIRI should update.

And EY is blobbing those two things together, and saying neither of them is the really hard part. Even having the externally usable explicit human value function doesn't mean the AI cares about it. And it's still a lot of bits, even if you have the bits. So it's still true that the part about getting the AI to care has to go precisely right.

If there's a substantive disagreement about the facts here (rather than about the discourse history or whatever), maybe it's like:

Straw-EY: Complexity of value means you can't just get the make-AI-care part to happen by chance; it's a small target.

Straw-MB: Ok but now we have a very short message pointing to roughly human values: just have a piece of code that says "and now call GPT and ask it what's good". So now it's a very small number of bits.

[-]Matthew Barnett2y97

A guess: MB is saying "MIRI doesn't say the AI won't have the function somewhere, but does say it's hard to have an externally usable, explicit human value function". And then saying "and GPT gives us that", and therefore MIRI should update.
[...]
Straw-EY: Complexity of value means you can't just get the make-AI-care part to happen by chance; it's a small target.
Straw-MB: Ok but now we have a very short message pointing to roughly human values: just have a piece of code that says "and now call GPT and ask it what's good". So now it's a very small number of bits.

I consider this a reasonably accurate summary of this discussion, especially the part I'm playing in it. Thanks for making it more clear to others.

[-]Rob Bensinger2y8-12

Straw-EY: Complexity of value means you can't just get the make-AI-care part to happen by chance; it's a small target.
Straw-MB: Ok but now we have a very short message pointing to roughly human values: just have a piece of code that says "and now call GPT and ask it what's good". So now it's a very small number of bits.

To which I say: "dial a random phone number and ask the person who answers what's good" can also be implemented with a small number of bits. In order for GPT-4 to be a major optimistic update about alignment, we need some specific way to leverage GPT-4 to crack open part of the alignment problem, even though we presumably agree that phone-a-friend doesn't crack open part of the alignment problem. (Nor does phone-your-neighborhood-moral-philosopher, or phone-Paul-Christiano.)

[-]Alex Turner1y*296

Getting a shape into the AI's preferences is different from getting it into the AI's predictive model.

It seems like you think that human preferences are only being "predicted" by GPT-4, and not "preferred." If so, why do you think that?

I commonly encounter people expressing sentiments like "prosaic alignment work isn't real alignment, because we aren't actually getting the AI to care about X." To which I say: How do you know that? What does it even mean for that claim to be true or false? What do you think you know, and why do you think you know it? What empirical knowledge of inner motivational structure could you be leveraging to make these claims, such that you are far more likely to make these claims in worlds where the claims are actually true?

(On my pessimistic days, I wonder if this kind of claim gets made because humans write suggestive phrases like "predictive loss function" in their papers, next to the mathematical formalisms.)

2Vojtech Kovarik1y

(For context: My initial reaction to the post was that this is misrepresenting the MIRI-position-as-I-understood-it. And I am one of the people who strongly endorse the view that "it was never about getting the AI to predict human preferences". So when I later saw Yudkowsky's comment and your reaction, it seemed perhaps useful to share my view.) My reaction to this is that: Actually, current LLMs do care about our preferences, and about their guardrails. It was never about getting some AI to care about our preferences. It is about getting powerful AIs to robustly care about our preferences. Where by "robustly" includes things like (i) not caring about other things as well (e.g., prediction accuracy), (ii) generalising correctly (e.g., not just maximising human approval), and (iii) not breaking down when we increase the amount of optimisation pressure a lot (e.g., will it still work once we hook it into future-AutoGPT-that-actually-works and have it run for a long time?). Some examples of what would cause me to update are: If we could make LLMs not jailbreakable without relying on additional filters on input or output.

2Max H1y

Taking my own stab at answers to some of your questions: A sufficient condition for me to believe that an AI actually cared about something would be a whole brain emulation: I would readily accept that such an emulation had preferences and values (and moral weight) in exactly the way that humans do, and that any manipulations of that emulation were acting on preferences in a real way. I think that GPTs (and every other kind of current AI system) are not doing anything that is even close to isomorphic to the processing that happens inside the human brain. Artificial neural networks often imitate various micro and macro-level individual features of the brain, but they do not imitate every feature, arranged in precisely the same ways, and the missing pieces and precise arrangements are probably key. Barring WBE, an AI system that is at least roughly human-level capable (including human-level agentic) is probably a necessary condition for me to believe that it has values and preferences in a meaningful (though not necessarily human-like) way. SoTA LLM-based systems are maaaybe getting kind of close here, but only if you arrange them in precise ways (e.g. AutoGPT-style agents with specific prompts), and then the agency is located in the repeated executions of the model and the surrounding structure and scaffolding that causes the system as a whole to be doing something that is maybe-roughly-nearly-isomorphic to some complete process that happens inside of human brains. Or, if not isomorphic, at least has some kind of complicated structure which is necessary, in some form, for powerful cognition. Note that, if I did believe that current AIs had preferences in a real way, I would also be pretty worried that they had moral weight! (Not to say that entities below human-level intelligence (e.g. animals, current AI systems) don't have moral weight. But entities at human-level intelligence above definitely can, and possibly do by default.) Anyway, we probably disagree on

[-]Alex Turner1y62

Thanks for the reply. Let me clarify my position a bit.

I think that GPTs (and every other kind of current AI system) are not doing anything that is even close to isomorphic to the processing that happens inside the human brain.

I didn't mean to (positively) claim that GPTs have near-isomorphic motivational structure (though I think it's quite possible).

I meant to contend that I am not aware of any basis for confidently claiming that LLMs like GPT-4 are "only predicting what comes next", as opposed to "choosing" or "executing" one completion, or "wanting" to complete the tasks they are given, or—more generally—"making decisions on the basis of the available context, such that our ability to behaviorally steer LLMs (e.g. reducing sycophancy) is real evidence about our control over LLM motivations."

Concerning "GPTs are predictors", the best a priori argument I can imagine is: GPT-4 was pretrained on CE loss, which itself is related to entropy, related to information content, related to Shannon's theorems isolating information content in the context of probabilities, which are themselves nailed down by Cox's theorems which do axiomatically support the Bayesian account of beliefs a... (read more)

[-]Max H1y120

That does clarify, thanks.

Response in two parts: first, my own attempt at clarification over terms / claims. Second, a hopefully-illustrative sketch / comparison for why I am skeptical that current GPTs having anything properly called a "motivational structure", human-like or otherwise, and why I think such skepticism is not a particularly strong positive claim about anything in particular.

The clarification:

At least to me, the phrase "GPTs are [just] predictors" is simply a reminder of the fact that the only modality available to a model itself is that it can output a probability distribution over the next token given a prompt; it functions entirely by "prediction" in a very literal way.

Even if something within the model is aware (in some sense) of how its outputs will be used, it's up to the programmer to decide what to do with the output distribution, how to sample from it, how to interpret the samples, and how to set things up so that a system using the samples can complete tasks.

I don't interpret the phrase as a positive claim about how or why a particular model outputs one distribution vs. another in a certain situation, which I expect to vary widely depending on which model w... (read more)

3Alex Turner1y

This is an excellent reply, thank you! I think I broadly agree with your points. I think I'm more imagining "similarity to humans" to mean "is well-described by shard theory; eg its later-network steering circuits are contextually activated based on a compositionally represented activation context." This would align with greater activation-vector-steerability partway through language models (not the only source I have for that). However, interpreting GPT: the logit lens and eg DoLA suggests that predictions are iteratively refined throughout the forward pass, whereas presumably shard theory (and inner optimizer threat models) would predict most sophisticated steering happens later in the network.

[-]Rob Bensinger2y10-4

But if you had asked us back then if a superintelligence would automatically be very good at predicting human text outputs, I guarantee we would have said yes. [...] I wish that all of these past conversations were archived to a common place, so that I could search and show you many pieces of text which would talk about this critical divide between prediction and preference (as I would now term it) and how I did in fact expect superintelligences to be able to predict things!

Quoting myself in April:

"MIRI's argument for AI risk depended on AIs being bad at natural language" is a weirdly common misunderstanding, given how often we said the opposite going back 15+ years.
E.g., Nate Soares in 2016: https://intelligence.org/files/ValueLearningProblem.pdf
Or Eliezer Yudkowsky in 2008, critiquing his own circa-1997 view "sufficiently smart AI will understand morality, and therefore will be moral": https://www.lesswrong.com/s/SXurf2mWFw8LX2mkG/p/CcBe9aCKDgT5FSoty
(The response being, in short: "Understanding morality doesn't mean that you're motivated to follow it.")
It was claimed by @perrymetzger that https://www.lesswrong.com/posts/4ARaTpNX62uaL86j6/the-hidden-complexity-of-wishes make

... (read more)

[-]Kaj Sotala1y712

Getting a shape into the AI's preferences is different from getting it into the AI's predictive model. MIRI is always in every instance talking about the first thing and not the second.
You obviously need to get a thing into the AI at all, in order to get it into the preferences, but getting it into the AI's predictive model is not sufficient. It helps, but only in the same sense that having low-friction smooth ball-bearings would help in building a perpetual motion machine; the low-friction ball-bearings are not the main problem, they are a kind of thing it is much easier to make progress on compared to the main problem.

I read this as saying "GPT-4 has successfully learned to predict human preferences, but it has not learned to actually fulfill human preferences, and that's a far harder goal". But in the case of GPT-4, it seems to me like this distinction is not very clear-cut - it's useful to us because, in its architecture, there's a sense in which "predicting" and "fulfilling" are basically the same thing.

It also seems to me that this distinction is not very clear-cut in humans, either - that a significant part of e.g. how humans internalize moral values while growin... (read more)

[-]Chris_Leong2y68

Your comment focuses on GPT4 being "pretty good at extracting preferences from human data" when the stronger part of the argument seems to be that "it will also generally follow your intended directions, rather than what you literally said".

I agree with you that it was obvious in advance that a superintelligence would understand human value.

However, it sure sounded like you thought we'd have to specify each little detail of the value function. GPT4 seems to suggest that the biggest issue will be a situation where:

1) The AI has an option that would produce a lot of utility if you take one position on an exotic philosophical thought experiment and very little if you take the other side.
2) The existence of powerful AI means that the thought experiment is no longer exotic.

[-]johnswentworth2y3539

I think you have basically not understood the argument which I understand various MIRI folks to make, and I think Eliezer's comment on this post does not explain the pieces which you specifically are missing. I'm going to attempt to clarify the parts which I think are most likely to be missing. This involves a lot of guessing, on my part, at what is/isn't already in your head, so I apologize in advance if I guess wrong.

(Side note: I am going to use my own language in places where I think it makes things clearer, in ways which I don't think e.g. Eliezer or Nate or Rob would use directly, though I think they're generally gesturing at the same things.)

A Toy Model/Ontology

I think a core part of the confusion here involves conflation of several importantly-different things, so I'll start by setting up a toy model in which we can explicitly point to those different things and talk about how their differences matter. Note that this is a toy model; it's not necessarily intended to be very realistic.

Our toy model is an ML system, designed to run on a hypercomputer. It works by running full low-level physics simulations of the universe, for exponentially many initial conditions. When the sys... (read more)

6Alex Turner1y

(Placeholder: I think this view of alignment/model internals seems wrongheaded in a way which invalidates the conclusion, but don't have time to leave a meaningful reply now. Maybe we should hash this out sometime at Lighthaven.)

5Matthew Barnett1y

This comment is valuable for helping to clarify the disagreement. So, thanks for that. Unfortunately, I am not sure I fully understand the comment yet. Before I can reply in-depth, I have a few general questions: 1. Are you interpreting me as arguing that alignment is easy in this post? I avoided arguing that, partly because I don't think the inner alignment problem has been solved, and the inner alignment problem seems to be the "hard part" of the alignment problem, as I understand it. Solving inner alignment completely would probably require (at the very least) solving mechanistic interpretability, which I don't think we're currently close to solving. 2. Are you saying that MIRI has been very consistent on the question of where the "hard parts" of alignment lie? If so, then your comment makes more sense to me, as you (in my understanding) are trying to summarize what their current arguments are, which then (again, in my understanding) would match what MIRI said more than five years ago. However, I was mainly arguing against the historical arguments, or at least my interpretation of these argument, such as the arguments in Nate Soares' 2017 talk. To the extent that the arguments you present are absent from pre-2018 MIRI content, I think they're mostly out of scope for the purpose of my thesis, although I agree that it's important to talk about how hard alignment is independent of all the historical arguments. (In general, I agree that discussions about current arguments are way more important than discussions about what people believed >5 years ago. However, I think it's occasionally useful to talk about the latter, and so I wrote one post about it.)

[-]johnswentworth1y82

Are you interpreting me as arguing that alignment is easy in this post?

Not in any sense which I think is relevant to the discussion at this point.

Are you saying that MIRI has been very consistent on the question of where the "hard parts" of alignment lie?

My estimate of how well Eliezer or Nate or Rob of 2016 would think my comment above summarizes the relevant parts of their own models, is basically the same as my estimate of how well Eliezer or Nate or Rob of today would think my comment above summarizes the relevant parts of their own models.

That doesn't mean that any of them (nor I) have ever explained these parts particularly clearly. Speaking from my own experience, these parts are damned annoyingly difficult to explain; a whole stack of mental models has to be built just to convey the idea, and none of them are particularly legible. (Specifically, the second half of the "'Values', and Pointing At Them" section is the part that's most difficult to explain. My post The Pointers Problems is my own best attempt to date to convey those models, and it remains mediocre.) Most of the arguments historically given are, I think, attempts to shoehorn as much of the underlying mental model as possible into leaky analogies.

[-]Matthew Barnett1y*40

Thanks for the continued clarifications.

Our primary existing disagreement might be this part,

My estimate of how well Eliezer or Nate or Rob of 2016 would think my comment above summarizes the relevant parts of their own models, is basically the same as my estimate of how well Eliezer or Nate or Rob of today would think my comment above summarizes the relevant parts of their own models.

Of course, there's no way of proving what these three people would have said in 2016, and I sympathize with the people who are saying they don't care much about the specific question of who said what when. However, here's a passage from the Arbital page on the Problem of fully updated deference, which I assume was written by Eliezer,

One way to look at the central problem of value identification in superintelligence is that we'd ideally want some function that takes a complete but purely physical description of the universe, and spits out our true intended notion of value V in all its glory. Since superintelligences would probably be pretty darned good at collecting data and guessing the empirical state of the universe, this probably solves the whole problem.
This is not the same problem as writin

... (read more)

[-]johnswentworth1y1212

Either Eliezer believed that we need a proposed solution to the value identification problem that far exceeds the performance of humans on the task of identifying valuable from non-valuable outcomes. This is somewhat plausible as he mentions CEV in the next paragraph, but elsewhere Eliezer has said, "When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of 'provable' alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas which sorta-reasonable humans disagree about".

I believe you're getting close to the actual model here, but not quite hitting it on the head.

First: lots of ML-ish alignment folks today would distinguish between the problem of aligning well enough to be in the right basin of attraction^[1] an AI capable enough to do alignment research, from the problem of aligning well enough a far-superhuman intelligence. On a MIRIish view, humanish-or-weaker systems don't much matter for alignment, but there's still an important potential divide between aligning an early supercritical AGI and aligning full-blown far superintelligence.

In the "long ... (read more)

5Vladimir Nesov1y

What makes this concept confusing and probably a bad framing is that to the extent doom is likely, neither many individual humans nor humanity as a whole are aligned in this sense. Humanity is currently in the process of producing successors that fail to predictably have the property of converging to not kill us. (I agree that this is the MIRI referent of values/alignment and the correct thing to keep in mind as the central concern.)

[-]Nate Soares2y*2413

I have the sense that you've misunderstood my past arguments. I don't quite feel like I can rapidly precisely pinpoint the issue, but some scattered relevant tidbits follow:

I didn't pick the name "value learning", and probably wouldn't have picked it for that problem if others weren't already using it. (Perhaps I tried to apply it to a different problem than Bostrom-or-whoever intended it for, thereby doing some injury to the term and to my argument?)
Glancing back at my "Value Learning" paper, the abstract includes "Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended", which supports my recollection that I was never trying to use "Value Learning" for "getting the AI to understand human values is hard" as opposed to "getting the AI to act towards value in particular (as opposed to something else) is hard", as supports my sense that this isn't hindsight bias, and is in fact a misunderstanding.
A possible thing that's muddying the waters here is that (apparently!) many phrases intended to point at the difficulty of causing it to be value-in-particular that the AI acts towards have an additional (mis)interpretation as

... (read more)

[-]Matthew Barnett2y311

Glancing back at my "Value Learning" paper, the abstract includes "Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended", which supports my recollection that I was never trying to use "Value Learning" for "getting the AI to understand human values is hard" as opposed to "getting the AI to act towards value in particular (as opposed to something else) is hard", as supports my sense that this isn't hindsight bias, and is in fact a misunderstanding.

For what it's worth, I didn't claim that you argued "getting the AI to understand human values is hard". I explicitly distanced myself from that claim. I was talking about the difficulty of value specification, and generally tried to make this distinction clear multiple times.

[-]Nate Soares2y109

That helps somewhat, thanks! (And sorry for making you repeat yourself before discarding the erroneous probability-mass.)

I still feel like I can only barely maybe half-see what you're saying, and only have a tenuous grasp on it.

Like: why is it supposed to matter that GPT can solve ethical quandries on-par with its ability to perform other tasks? I can still only half-see an answer that doesn't route through the (apparently-disbelieved-by-both-of-us) claim that I used to argue that getting the AI to understand ethics was a hard bit, by staring at sentences like "I am saying that the system is able to transparently pinpoint to us which outcomes are good and which outcomes are bad, with fidelity approaching an average human" and squinting.

Attempting to articulate the argument that I can half-see: on Matthew's model of past!Nate's model, AI was supposed to have a hard time answering questions like "Alice is in labor and needs to be driven to the hospital. Your car has a flat tire. What do you do?" without lots of elbow-grease, and the fact that GPT can answer those questions as a side-effect of normal training means that getting AI to understand human values is easy, contra past!Nate, ... (read more)

4Matthew Barnett1y

Thanks for trying to understand my position. I think this interpretation that you gave is closest to what I'm arguing, I have a quick response to what I see as your primary objection: I think this is kinda downplaying what GPT-4 is good at? If you talk to GPT-4 at length, I think you'll find that it's cognizant of many nuances in human morality that go way deeper than the moral question of whether to "call 911 when Alice is in labor and your car has a flat". Presumably you think that ordinary human beings are capable of "singling out concepts that are robustly worth optimizing for". I claim that to the extent ordinary humans can do this, GPT-4 can nearly do this as well, and to the extent it can't, I expect almost all the bugs to be ironed out in near-term multimodal models. It would be nice if you made a precise prediction about what type of moral reflection or value specification multimodal models won't be capable of performing in the near future, if you think that they are not capable of the 'deep' value specification that you care about. And here, again, I'm looking for some prediction of the form: humans are able to do X, but LLMs/multimodal models won't be able to do X by, say, 2028. Admittedly, making this prediction precise is probably hard, but it's difficult for me to interpret your disagreement without a little more insight into what you're predicting.

2Nate Soares1y

(Insofar as this was supposed to name a disagreement, I do not think it is a disagreement, and don't understand the relevance of this claim to my argument.) Nope! At least, not directly, and not in the right format for hooking up to a superintelligent optimization process. (This seems to me like plausibly one of the sources of misunderstanding, and in particular I am skeptical that your request for prediction will survive it, and so I haven't tried to answer your request for a prediction.)

4Matthew Barnett1y

If ordinary humans can't single out concepts that are robustly worth optimizing for, then either, 1. Human beings in general cannot single out what is robustly worth optimizing for 2. Only extraordinary humans can single out what is robustly worth optimizing for Can you be more clear about which of these you believe? I'm also including "indirect" ways that humans can single out concepts that are robustly worth optimizing for. But then I'm allowing that GPT-N can do that too. Maybe this is where the confusion lies? If you're allowing for humans to act in groups and come up with these concepts after e.g. deliberation, and still think that ordinary humans can't single out concepts that are robustly worth optimizing for, then I think this view is a little silly, although the second interpretation at least allows for the possibility that the future goes well and we survive AGI, and that would be nice to know.

[-]Nate Soares1y*1515

If you allow indirection and don't worry about it being in the right format for superintelligent optimization, then sufficiently-careful humans can do it.

Answering your request for prediction, given that it seems like that request is still live: a thing I don't expect the upcoming multimodal models to be able to do: train them only on data up through 1990 (or otherwise excise all training data from our broadly-generalized community), ask them what superintelligent machines (in the sense of IJ Good) should do, and have them come up with something like CEV (a la Yudkowsky) or indirect normativity (a la Beckstead) or counterfactual human boxing techniques (a la Christiano) or suchlike.

Note that this only tangentially a test of the relevant ability; very little of the content of what-is-worth-optimizing-for occurs in Yudkowsky/Beckstead/Christiano-style indirection. Rather, coming up with those sorts of ideas is a response to glimpsing the difficulty of naming that-which-is-worth-optimizing-for directly and realizing that indirection is needed. An AI being able to generate that argument without following in the footsteps of others who have already generated it would be at least some ev... (read more)

[-]Rob Bensinger2y1916

Nate and Eliezer have already made some of the high-level points I wanted to make, but they haven't replied to a lot of the specific examples and claims in the OP, and I see some extra value in doing that. (Like, if you think Eliezer and Nate are being revisionist in their claims about what past-MIRI thought, then them re-asserting "no really, we used to believe X!" is less convincing than my responding in detail to the specific quotes Matt thinks supports his interpretation, while providing examples of us saying the opposite.)

However, I distinctly recall MIRI people making a big deal about the value identification problem (AKA the value specification problem)

The Arbital page for "value identification problem" is a three-sentence stub, I'm not exactly sure what the term means on that stub (e.g., whether "pinpointing valuable outcomes to an advanced agent" is about pinpointing them in the agent's beliefs or in its goals), and the MIRI website gives me no hits for "value identification".

As for "value specification", the main resource where MIRI talks about that is https://intelligence.org/files/TechnicalAgenda.pdf, where we introduce the problem by saying:

A highly-reliable, error-tol

... (read more)

[-]Matthew Barnett2y*-11

Thanks for this comment. I think this is a good-faith reply that tries to get to the bottom of the disagreement. That said, I think you are still interpreting me as arguing that MIRI said AI wouldn't understand human values, when I explicitly said that I was not arguing that. Nonetheless, I appreciate the extensive use of quotations to precisely pinpoint where you disagree; this is high-quality engagement.

The main thing I'm claiming is that MIRI people said it would be hard to specify (for example, write into a computer) an explicit function that reflects the human value function with high fidelity, in the sense that judgements from this function about the value of outcomes fairly accurately reflect the judgements of ordinary humans. I think this is simply a distinct concept from the idea of getting an AI to understand human values.

The key difference is the transparency and legibility of how the values are represented: if you solve the problem of value specification/value identification, that means you have an actual function that can tell you the value of any outcome. If you get an AI that merely understands human values, you can't necessarily use the AI to determine the val... (read more)

[-]Rob Bensinger2y74

The main thing I'm claiming is that MIRI said it would be hard to specify (for example, write into a computer) an explicit function that reflects the human value function with high fidelity, in the sense that judgements from this function about the value of outcomes fairly accurately reflect the judgements of ordinary humans. I think this is simply a distinct concept from the idea of getting an AI to understand human values.
The key difference is the transparency and legibility of how the values are represented: if you solve the problem of value specification/value identification, that means you have an actual function that can tell you the value of any outcome. If you get an AI that merely understands human values, you can't necessarily use the AI to determine the value of any outcome, because, for example, the AI might lie to you, or simply stay silent.

Ah, this is helpful clarification! Thanks. :)

I don't think MIRI ever considered this an important part of the alignment problem, and I don't think we expect humanity to solve lots of the alignment problem as a result of having such a tool; but I think I better understand now why you think this is importantly different from "AI ever gets good at NLP at all".

don't know if your essay is the source of the phrase or whether you just titled it

I think I came up with that particular phrase (though not the idea, of course).

1Matthew Barnett1y

If you don't think MIRI ever considered coming up with an "explicit function that reflects the human value function with high fidelity" to be "an important part of the alignment problem", can you explain this passage from the Arbital page on The problem of fully updated deference? Eliezer (who I assume is the author) appears to say in the first paragraph that solving the problem of value identification for superintelligences would "probably [solve] the whole problem", and by "whole problem" I assume he's probably referring to what he saw as an important part of the alignment problem (maybe not though?) He referred to the problem of value identification as getting "some function that takes a complete but purely physical description of the universe, and spits out our true intended notion of value V in all its glory." This seems to be very similar to my definition, albeit with the caveat that my definition isn't about revealing "V in all its glory" but rather, is more about revealing V at the level that an ordinary human is capable of revealing V. Unless the sole problem here is that we absolutely need our function that reveals V to be ~perfect, then I think this quote from the Arbital page directly supports my interpretation, and overall supports the thesis in my post pretty strongly (even if I'm wrong about a few minor details).

[-]Zvi2y1816

As an experimental format, here is the first draft of what I wrote for next week's newsletter about this post:

Matthew Barnett argues that GPT-4 exhibiting common sense morality, and being able to follow it, should update us towards alignment being easier than we thought, and MIRI-style people refusing to do so are being dense. That the AI is not going to maximize the utility function you gave it at the expense of all common sense.

As usual, this logically has to be more than zero evidence for this, given how we would react if GPT-4 indeed lacked such common sense or was unable to give answers that pleased humans at all. Thus, we should update a non-zero amount in that direction, at least if we ignore the danger of being led down the wrong alignment path.

However, I think this misunderstands what is going on. GPT-4 is training on human feedback, so it is choosing responses that maximize the probability of positive user response in the contexts where it gets feedback. If that is functionally your utility function, you want to respond with answers that appear, to humans similar to the ones who provided you with feedback, to reflect common sense and seem to avoid violating various other ... (read more)

[-]Linch1y*156

I think I read this a few times but I still don't think I fully understand your point. I'm going to try to rephrase what I believe you are saying in my own words:

Our correct epistemic state in 2000 or 2010 should be to have a lot of uncertainty about the complexity and fragility of human values. Perhaps it is very complex, but perhaps people are just not approaching it correctly.
At the limit, the level of complexity can approach "simulate a number of human beings in constant conversation and moral deliberation with each other, embedded in the existing broader environment, and where a small mistake in the simulation renders the entire thing broken in the sense of losing almost all moral value in the universe if that's what you point at"
At the other, you can imagine a fairly simple mathematical statement that's practically robust to any OOD environments or small perturbations.
In worlds where human values aren't very complex, alignment isn't solved, but you should perhaps expect it to be (significantly) easier. ("Optimize for this mathematical statement" is an easier thing to point at than "optimize for the outcome of this complex deliberation, no, not the actual answers out of their

... (read more)

1David Scott Krueger3mo

This comment made me reflect on what fragility of values means. To me this point was always most salient when thinking about embodied agents, which may need to reliably recognize something like "people" in its environment (in order to instantiate human values like "try not to hurt people") even as the world changes radically with the introduction of various forms of transhumanism. I guess it's not clear to me how much progress we make towards that with a system that can do a very good job with human values when restricted to the text domain. Plausibly we just translate everything into text and are good to go? It makes me wonder where we're at with adversarial robustness of vision-language models, e.g.

1Linch3mo

I think I'm relatively optimistic that the difference between a system that "can (and will) do a very good job with human values when restricted to the text domain: vs "system that can do a very good job, unrestricted" isn't that high. This is because I'm personally fairly skeptical about arguments along the lines of "words aren't human thinking, words are mere shadows of human thinking" that people put out, at least when it comes to human values. (It's definitely possible to come up with examples that illustrates the differences between all of human thinking and human-thinking-put-into-words; I agree about their existence, I disagree about their importance).

1David Scott Krueger3mo

OTMH, I think my concern here is less: * "The AI's values don't generalize well outside of the text domain (e.g. to a humanoid robot)" and more: * "The AI's values must be much more aligned in order to be safe outside the text domain" I.e. if we model an AI and a human as having fixed utility functions over the same accurate world model, then the same AI might be safe as a chatbot, but not as a robot. This would be because the richer domain / interface of the robot creates many more opportunities to "exploit" whatever discrepancies exist between AI and human values in ways that actually lead to perverse instantiation.

1Noosphere893mo

Yeah, I think the crux is precisely this, in which I disagree with this statement below, mostly because I think instruction following/corrigibility is both plausibly easy in my view, and also removes most of the need for value alignment.

2David Scott Krueger3mo

1. There are 2 senses in which I agree that we don't need full on "capital V value alignment": 1. We can build things that aren't utility maximizers (e.g. consider the humble MNIST classifier) 2. There are some utility functions that aren't quite right, but are still safe enough to optimize in practice (e.g. see "Value Alignment Verification", but see also, e.g. "Defining and Characterizing Reward Hacking" for negative results) 2. But also: 1. Some amount of alignment is probably necessary in order to build safe agenty things (the more agenty, the higher the bar for alignment, since you start to increasingly encounter perverse instatiation-type concerns -- CAVEAT: agency is not a unidimensional quantity, cf: "Harms from Increasingly Agentic Algorithmic Systems"). 2. Note that my statement was about the relative requirements for alignment in text domains vs. real-world. I don't really see how your arguments are relevant to this question. Concretely, in domains with vision, we should probably be significantly more worried that an AI system learns something more like an adversarial "hack" on it's values leading to behavior that significantly diverges from things humans would endorse.

[-]Daniel Kokotajlo2y128

I'm not arguing that GPT-4 actually cares about maximizing human value. However, I am saying that the system is able to transparently pinpoint to us which outcomes are good and which outcomes are bad, with fidelity approaching an average human, albeit in a text format. Crucially, GPT-4 can do this visibly to us, in a legible way, rather than merely passively knowing right from wrong in some way that we can't access. This fact is key to what I'm saying because it means that, in the near future, we can literally just query multimodal GPT-N about whether an outcome is bad or good, and use that as an adequate "human value function". That wouldn't solve the problem of getting an AI to care about maximizing the human value function, but it would arguably solve the problem of creating an adequate function that we can put into a machine to begin with.

It sounds like you are saying: We just need to prompt GPT with something like "Q: How good is this outcome? A:" and then build a generic maximizer agent using that prompted GPT as the utility function, and our job is done, we would have made an AGI that cares about maximizing the human value function (because it's literally its utility function) (In practice this agent might look something like AutoGPT).

But I doubt that's what you are saying, so I'm asking for clarification if you still have energy to engage!

2Matthew Barnett1y

I think solving value specification is basically what you need in order to build a good reward model. If you have a good reward model, and you solve inner alignment, then I think you're pretty close to being able to create (at least) a broadly human-level AGI that is aligned with human values. That said, to make superintelligent AI go well, we still need to solve the problem of scalable oversight, because, among other reasons, there might be weird bugs that result from a human-level specification of our values being optimized to the extreme. However, having millions of value-aligned human-level AGIs would probably help us a lot with this challenge. We'd also need to solve the problem of making sure there aren't catastrophic bugs in the AIs we build. And we'll probably have to solve the general problem of value drift from evolutionary and cultural change. There's probably a few more things that we need to solve that I haven't mentioned too. These other problems may be very difficult, and I'm not denying that. But I think it's good to know that we seem to be making good progress on the "reward modeling" part of the alignment problem. I think it's simply true that many people in the past imagined that this problem would be a lot harder than it actually was.

2Daniel Kokotajlo1y

So, IIUC, you are proposing we: * Literally just query GPT-N about whether [input_outcome] is good or bad * Use this as a reward model, with which to train an agentic AGI (which is maybe also a fine-tune of GPT-N, so they hopefully are working with the same knowledge/credences/concepts?) * Specifically we are probably doing some sort of RL, so the agentic AGI is doing all sorts of tasks and the reward model is looking at the immediate results of those task-attempts and grading them. * Assume we have some solution to inner alignment, and we fix the bugs, and maybe also fix value drift and some other stuff, then boom, success! Can you say more about what you mean by solution to inner alignment? Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of "producing outcomes the RM classifies as good?" Or the objective "producing outcomes the RM would classify as good if it was operating normally?" (the difference revealing itself in cases of tampering with the RM) Or the objective "producing outcomes that are good-for-humans, harmless, honest, etc."?

[-]Matthew Barnett1y36

So, IIUC, you are proposing we:
Literally just query GPT-N about whether [input_outcome] is good or bad

I'm hesitant to say that I'm actually proposing literally this exact sequence as my suggestion for how we build safe human-level AGI, because (1) "GPT-N" can narrowly refer to a specific line of models by OpenAI whereas the way I was using it was more in-line with "generically powerful multi-modal models in the near-future", and (2) the actual way we build safe AGI will presumably involve a lot of engineering and tweaking to any such plan in ways that are difficult to predict and hard to write down comprehensively ahead of time. And if I were to lay out "the plan" in a few paragraphs, it will probably look pretty inadequate or too high-level compared to whatever people actually end up doing.

Also, I'm not ruling out that there might be an even better plan. Indeed, I hope there is a better plan available by the time we develop human-level AGI.

That said, with the caveats I've given above, yes, this is basically what I'm proposing, and I think there's a reasonably high chance (>50%) that this general strategy would work to my own satisfaction.

Can you say more about what you mean by

... (read more)

2Daniel Kokotajlo1y

This is underspecified, I think, since we have for years had AIs that follow objectives in ways that generalize to inputs outside of the training distribution. The thing is there are lots of ways to generalize / lots of objectives they could learn to follow, and we don't have a good way of pinning it down to exactly the ones we want. (And indeed as our AIs get smarter there will be new ways of generalizing / categories of objectives that will become available, such as "play the training game") So it sounds like you are saying "A solution to inner alignment mans that we've figured out how to train an AI to have the objectives we want it to have, robustly such that it continues to have them way off distribution." This sounds like basically the whole alignment problem to me? I see later you say you mean the second thing -- which is interestingly in between "play the training game" and "actually be honest/helpful/harmless/etc." (A case that distinguishes it from the latter: Suppose it is reading a paper containing an adversarial example for the RM, i.e. some text it can output that causes the RM to give it a high score even though the text is super harmful / dishonest / etc. If it's objective is the "do what the RM would give high score to if it was operating normally" objective, it'll basically wirehead on that adversarial example once it learns about it, even if it's in deployment and it isn't getting trained anymore, and even though it's an obviously harmful/dishonest piece of text. It's a nontrivial and plausible claim you may be making -- that this sort of middle ground might be enough for safe AGI, when combined with the rest of the plan at least. But I'd like to see it spelled out. I'm pretty skeptical right now.

2Daniel Kokotajlo1y

2[comment deleted]1y

2Daniel Kokotajlo1y

Bumping this in case you have more energy to engage now!

[-]Max H2y117

I think the surprising lesson of GPT-4 is that it is possible to build clearly below-human-level systems that are nevertheless capable of fluent natural language processing, knowledge recall, creativity, basic reasoning, and many other abilities previously thought by many to be strictly in the human-level regime.

Once you update on that surprise though, there's not really much left to explain. The ability to distinguish moral from immoral actions at an average human level follows directly from being superhuman at language fluency and knowledge recall, and somewhere below-human-average at basic deductive reasoning and consequentialism.

MIRI folks have consistently said that all the hard problems come in when you get to the human-level regime and above. So even if it's relatively more surprising to their world models that a thing like GPT-4 can exist, it's not actually much evidence (on their models) about how hard various alignment problems will be when dealing with human-level and above systems.

Similarly:

If you disagree that AI systems in the near-future will be capable of distinguishing valuable from non-valuable outcomes about as reliably as humans, then I may be interested i

... (read more)

4Rob Bensinger2y

"Nope" to this part. I otherwise like this comment a lot!

[-]Daniel Kokotajlo2y71

I think this discussion would benefit from having a concrete proposed AGI design on the table. E.g. it sounds like Matthew Barnett has in mind something like AutoGPT5 with the prompt "always be ethical, maximize the good" or something like that. And it sounds like he is saying that while this proposal has problems and probably wouldn't work, it has one fewer problem than old MIRI thought. And as the discussion has shown there seems to be a lot of misunderstandings happening, IMO in both directions, and things are getting heated. I venture a guess that having a concrete proposed AGI design to talk about would clear things up a bit.

[-]Matthew "Vaniver" Gray2y6-6

I claim that GPT-4 is already pretty good at extracting preferences from human data.

So this seems to me like it's the crux. I agree with you that GPT-4 is "pretty good", but I think the standard necessary for things to go well is substantially higher than "pretty good", and that's where the difficulty arises once we start applying higher and higher levels of capability and influence on the environment. My guess is Eliezer, Rob, and Nate feel basically the same way.

Basically, I think your later section--"Maybe you think"--is pointing in the right direction, and requiring a much higher standard than human-level at moral judgment is reasonable and consistent with the explicit standard set by essays by Yudkowsky and other MIRI people. CEV was about this; talk about philosophical competence or metaphilosophy was about this. "Philosophy with a deadline" would be a weird way to put it if you thought contemporary philosophy was good enough.

[-]Kaj Sotala2y1033

So this seems to me like it's the crux. I agree with you that GPT-4 is "pretty good", but I think the standard necessary for things to go well is substantially higher than "pretty good", and that's where the difficulty arises once we start applying higher and higher levels of capability and influence on the environment.

This makes sense to me. On the other hand - it feels like there's some motte and bailey going on here, if one claim is "if the AIs get really superhumanly capable then we need a much higher standard than pretty good", but then it's illustrated using examples like "think of how your AI might not understand what you meant if you asked it to get your mother out of a burning building".

5Rob Bensinger2y

I don't think this is the crux. E.g., I'd wager the number of bits you need to get into an ASI's goals in order to make it corrigible is quite a bit smaller than the number of bits required to make an ASI behave like a trustworthy human, which in turn is way way smaller than the number of bits required to make an ASI implement CEV. The issue is that (a) the absolute number of bits for each of these things is still very large, (b) insofar as we're training for deep competence and efficiency we're training against corrigibility (which makes it hard to hit both targets at once), and (c) we can't safely or efficiently provide good training data for a lot of the things we care about (e.g., 'if you're a superintelligence operating in a realistic-looking environment, don't do any of the things that destroy the world'). None of these points require that we (or the AI) solve novel moral philosophy problems. I'd be satisfied with an AI that corrigibly built scanning tech and efficient computing hardware for whole-brain emulation, then shut itself down; the AI plausibly doesn't even need to think about any of the world outside of a particular room, much less solve tricky questions of population ethics or whatever.

0Matthew Barnett2y

That makes sense, but I say in the post that I think we will likely have a solution to the value identification problem that's "about as good as human judgement" in the near future. Do you doubt that? If you or anyone else at MIRI doubts that, then I'd be interested in making this prediction more precise, and potentially offering to bet MIRI people on this claim. If MIRI people think that the problem here is that our AIs need to be more moral than even humans, then I don't see where MIRI people think the danger comes from on this particular issue, especially when it comes to avoiding human extinction. Some questions: * Why did Eliezer and Nate talk about stories like Micky Mouse commanding a magical broom to fill a cauldron, and then failing because of misspecification, if the problem was actually more about getting the magical broom to exhibit superhuman moral judgement? * Are MIRI people claiming that if, say, a very moral and intelligent human became godlike while preserving their moral faculties, that they would destroy the world despite, or perhaps because of, their best intentions? * Eliezer has said on multiple separate occasions that he'd prefer that we try human intelligent enhancement or try uploading alignment researchers onto computers before creating de novo AGI. But uploaded and enhanced humans aren't going to have superhuman moral judgement. How does this strategy interact with the claim that we need far better-than-human moral judgement to avoid a catastrophe? I mostly saw CEV as an aspirational goal. It's seems more like a grand prize that we could best hope for if we solved every aspect of the alignment problem, rather than a minimal bar that Eliezer was setting for avoiding human extinction. ETA: in Eliezer's AGI ruin post, he says,

[-]Rob Bensinger2y81

That makes sense, but I say in the post that I think we will likely have a solution to the value identification problem that's "about as good as human judgement" in the near future.

We already have humans who are smart enough to do par-human moral reasoning. For "AI can do par-human moral reasoning" to help solve the alignment problem, there needs to be some additional benefit to having AI systems that can match a human (e.g., some benefit to our being able to produce enormous numbers of novel moral judgments without relying on an existing text corpus or hiring thousands of humans to produce them). Do you have some benefit in mind?

2Matthew Barnett2y

I don't think the critical point of contention here is about whether par-human moral reasoning will help with alignment. It could, but I'm not making that argument. I'm primarily making the argument that specifying the human value function, or getting an AI to reflect back (and not merely passively understand) the human value function, seems easier than many past comments from MIRI people suggest. This problem is one aspect of the alignment problem, although by no means all of it, and I think it's important to point out that we seem to be approaching an adequate solution.

5Matthew "Vaniver" Gray2y

For me, the answer here is "probably yes"; I think there is some bar of 'moral' and 'intelligent' where this doesn't happen, but I don't feel confident about where it is. I think there are two things that I expect to be big issues, and probably more I'm not thinking of: * Managing freedom for others while not allowing for catastrophic risks; I think lots of ways to mismanage that balance result in 'destroying the world', probably with different levels of moral loss. * The relevant morality is different for different social roles--someone being a good neighbor does not make them a good judge or good general. Even if someone scores highly on a 'general factor of morality' (assuming that such a thing exists) it is not obvious they will make for a good god-emperor. There is relatively little grounded human thought on how to be a good god-emperor. [Another way to put this is that "preserving their moral faculties" is not obviously enough / a good standard; probably their moral faculties should develop a lot in contact with their new situation!] I understand Eliezer's position to be that 1) intelligence helps with moral judgment and 2) it's better to start with biological humans than whatever AI design is best at your intelligence-related subtask, but also that intelligence amplification is dicey business and this is more like "the least bad option" than one that seems actively good. Like we have some experience inculcating moral values in humans that will probably generalize better to augmented humans than it will to AIs; but also I think Eliezer is more optimistic (for timing reasons) about amplifications that can be done to adult humans. Yeah, my interpretation of that is "if your target is the human level of wisdom, it will destroy humans just like humans are on track to do." If someone is thinking "will this be as good as the Democrats being in charge or the Republicans being in charge?" they are not grappling with the difficulty of successfully wielding futur

[-]Lauro Langosco2y*40

My paraphrase of your (Matthews) position: while I'm not claiming that GPT-4 provides any evidence about inner alignment (i.e. getting an AI to actually care about human values), I claim that it does provide evidence about outer alignment being easier than we thought: we can specify human values via language models, which have a pretty robust understanding of human values and don't systematically deceive us about their judgement. This means people who used to think outer alignment / value specification was hard should change their minds.

(End paraphrase)

I think this claim is mistaken, or at least it rests on false assumptions about what alignment researchers believe. Here's a bunch of different angles on why I think this:

My guess is a big part of the disagreement here is that I think you make some wrong assumptions about what alignment researchers believe.
I think you're putting a bit too much weight on the inner vs outer alignment distinction. The central problem that people talked about always was how to get an AI to care about human values. E.g. in The Hidden Complexity of Wishes (THCW) Eliezer writes

To be a safe fulfiller of a wish, a genie must share the same values th

... (read more)

2Matthew Barnett2y

I'm sympathetic to some of these points, but overall I think it's still important to acknowledge that outer alignment seems easier than many expected even if we think that inner alignment is still hard. In this post I'm not saying that the whole alignment problem is now easy. I'm making a point about how we should update about the difficulty of one part of the alignment problem, which was at one time considered both hard and important to solve. I think the most plausibly correct interpretation here of "a genie must share the same values" is that we need to solve both the value specification and inner alignment problem. I agree that just solving one part doesn't mean we've solved the other. However, again, I'm not claiming the whole problem has been solved. Yes, and people gave proposals about how this might be done at the time. For example I believe this is what Paul Christiano was roughly trying to do when he proposed approval-directed agents. Nonetheless, these were attempts. People didn't know whether the solutions would work well. I think we've now gotten more evidence about how hard this part of the problem is.

3Lauro Langosco2y

Do you have an example of one way that the full alignment problem is easier now that we've seen that GPT-4 can understand & report on human values? (I'm asking because it's hard for me to tell if your definition of outer alignment is disconnected from the rest of the problem in a way where it's possible for outer alignment to become easier without the rest of the problem becoming easier).

[-]Matthew Barnett3mo37Review for 2023 Review

Looking back on this post after a year, I haven't changed my mind about the content of the post, but I agree with Seth Herd when he said this post was "important but not well executed".

In hindsight I was too careless with my language in this post, and I should have spent more time making sure that every single paragraph of the post could not be misinterpreted. As a result of my carelessness, the post was misinterpreted in a predictable direction. And while I'm not sure how much I could have done to eliminate this misinterpretation, I do think that I could have reduced it a fair bit with more effort and attention.

If you're not sure what misinterpretation I'm referring to, I'll just try to restate the main point that I was trying to make below. To be clear, what I say below is not identical to the content of this post (as the post was narrowly trying to respond to the framing of this problem given by MIRI; and in hindsight, it was a mistake to reply in that way), but I think this is a much clearer presentation of one of the main ideas I was trying to convey by writing this post:

In my opinion, a common belief among people theorizing about AI safety around 2015, particularly on L... (read more)

[-]Steve Byrnes1y20

I just spent a while wading through this post and the comments section.

My current impression is that (among many other issues) there is a lot of talking-past-each-other related to two alternate definitions of “human values”:

Definition 1 (Matt Barnett, most commenters): “Human values” are the things that you get by asking humans what their values are, asking what they’d do in different situations, etc.
Definition 2 (MIRI): “Human values” are the output of CEV, which is maybe related to “fun-as-in-fun-theory” (per Nate's comment), and likewise related to the

... (read more)

3Matthew Barnett1y

I think there's some nuance here. It seems clear to me that solving the "full" friendly AI problem, as Eliezer imagined, would involve delineating human value on the level of the Coherent Extrapolated Volition, rather than merely as adequately as an ordinary human. That's presumably what Eliezer meant in the context of the quote you cited. However, I think it makes sense to interpret GPT-4 as representing substantial progress on the problem of building a task AGI, and especially (for the purpose of my post) the problem of delineating value from training data to the extent required by task AGIs (relative to AIs, in, say 2018). My understanding is that Eliezer advocated that we should try to build task AGIs before trying to build full-on sovereign superintelligences.[1] On the Arbital page about task AGIs, he makes the following point: My interpretation here is that delineating value from training data (i.e. the value identification problem) for task AGIs was still considered hard at least as late as 2015, even as it might be easier creating a "complete description of normativity in general". Another page also spells the problem out pretty clearly, in a way I find clearly consistent with my thesis.[2] I think GPT-4 represents substantial progress on this problem, specifically because of its ability to "do-what-I-mean" rather than "do-what-I-ask", identify ambiguities to the user during during deployment, and accomplish limited tasks safely. It's honestly a little hard for me to sympathize with a point of view that says GPT-4 isn't significant progress along this front, relative to pre-2019 AIs (some part of me was expecting more readers to find this thesis obvious, but apparently it is not obvious). GPT-4 clearly doesn't do crazy things that you'd naively expect if it wasn't capable of delineating value well from training data. 1. ^ Eliezer wrote, 2. ^ Here's the full page,

[-]Matthew Barnett2y19

Addendum to the post: all three people who this post addressed (Eliezer, Nate and Rob) responded to my post by misinterpreting me as saying that MIRI thought AIs wouldn't understand human values. However, I clearly and explicitly distanced myself from such an interpretation in the post. These responses were all highly upvoted despite this error. This makes me pessimistic about having a nuanced conversation about this topic on LessWrong. I encourage people to read my post carefully and not assume that people in the comments are reporting the thesis accurately.

[-]Lauro Langosco2y917

You make a claim that's very close to that - your claim, if I understand correctly, is that MIRI thought AI wouldn't understand human values and also not lie to us about it (or otherwise decide to give misleading or unhelpful outputs):

The key difference between the value identification/specification problem and the problem of getting an AI to understand human values is the transparency and legibility of how the values are represented: if you solve the problem of value identification, that means you have an actual function that can tell you the value of any outcome (which you could then, hypothetically, hook up to a generic function maximizer to get a benevolent AI). If you get an AI that merely understands human values, you can't necessarily use the AI to determine the value of any outcome, because, for example, the AI might lie to you, or simply stay silent.

I think this is similar enough (and false for the same reasons) that I don't think the responses are misrepresenting you that badly. Of course I might also be misunderstanding you, but I did read the relevant parts multiple times to make sure, so I don't think it makes sense to blame your readers for the misunderstanding.

1Matthew Barnett2y

I agree the claim is "similar". It's actually a distinct claim, though. What are the reasons why it's false? (And what do you mean by saying that what I wrote is "false"? I think the historical question is what's important in this case. I'm not saying that solving the value specification problem means that we have a full solution to the alignment problem, or that inner alignment is easy now.)

[-]Lauro Langosco2y96

I think it's false in the sense that MIRI never claimed that it would be hard to build an AI with GPT-4 level understanding of human values + GPT-4 level of willingness to answer honestly (as far as I can tell). The reason I think it's false is mostly that I haven't seen a claim like that made anywhere, including in the posts you cite.

I agree lots of the responses elide the part where you emphasize that it's important how GPT-4 doesn't just understand human values, but is also "willing" to answer questions somewhat honestly. TBH I don't understand why that's an important part of the picture for you, and I can see why some responses would just see the "GPT-4 understands human values" part as the important bit (I made that mistake too on my first reading, before I went back and re-read).

It seems to me that trying to explain the original motivations for posts like Hidden Complexity of Wishes is a good attempt at resolving this discussion, and it looks to me as if the responses from MIRI are trying to do that, which is part of why I wanted to disagree with the claim that the responses are missing the point / not engaging productively.

2Matthew Barnett1y

I don't think it's necessary for them to have made that exact claim. The point is that they said value specification would be hard. If you solve value specification, then you've arguably solved the outer alignment problem a large part of the outer alignment problem. Then, you just need to build a function maximizer that allows you to robustly maximize the utility function that you've specified. [ETA: btw, I'm not saying the outer alignment problem has been fully solved already. I'm making a claim about progress, not about whether we're completely finished.] I interpret MIRI as saying "but the hard part is building a function maximizer that robustly maximizes any utility function you specify". And while I agree that this represents their current view, I don't think this was always their view. You can read the citations in the post carefully, and I don't think they support the idea that they've consistently always considered inner alignment to be the only hard part of the problem. I'm not claiming they never thought inner alignment was hard. But I am saying they thought value specification would be hard and an important part of the alignment problem.

2Lauro Langosco1y

I think the specification problem is still hard and unsolved. It looks like you're using a different definition of 'specification problem' / 'outer alignment' than others, and this is causing confusion. IMO all these terms are a bit fuzzy / hard to pin down, and so it makes sense that they'd lead to disagreement sometimes. The best way (afaict) to avoid this is to keep the terms grounded in 'what would be useful for avoiding AGI doom'? To me it looks like on your definition, outer alignment is basically a trivial problem that doesn't help alignment much. More generally, I think this discussion would be more grounded / useful if you made more object-level claims about how value specification being solved (on your view) might be useful, rather than meta claims about what others were wrong about.

1Matthew Barnett1y

Can you explain how you're defining outer alignment and value specification? I'm using this definition, provided by Hubinger et al. Evan Hubinger provided clarification about this definition in his post "Clarifying inner alignment terminology", I deliberately avoided using the term "outer alignment" in the post because I wanted to be more precise and not get into a debate about whether the value specification problem matches this exact definition. (I think the definitions are subtly different but the difference is not very relevant for the purpose of the post.) Overall, I think the two problems are closely associated and solving one gets you a long way towards solving the other. In the post, I defined the value identification/specification problem as, This was based on the Arbital entry for the value identification problem, which was defined as a I should say note that I used this entry as the primary definition in the post because I was not able to find a clean definition of this problem anywhere else. I'd appreciate if you clarified whether you are saying: 1. That my definition of the value specification problem is different from how MIRI would have defined it in, say, 2017. You can use Nate Soares' 2016 paper or their 2017 technical agenda to make your point. 2. That my definition matches how MIRI used the term, but the value specification problem remains very hard and unsolved, and GPT-4 is not even a partial solution to this problem. 3. That my definition matches how MIRI used the term, and we appear to be close to a solution to the problem, but a solution to the problem is not sufficient to solve the hard bits of the outer alignment problem. I'm more sympathetic to (3) than (2), and more sympathetic to (2) than (1), roughly speaking.

[+][comment deleted]2y30

Deleted by Thomas Larsen, 10/06/2023

Reason: Reread the post and decided that it didn't properly engage with your viewpoint and would require significant effort to salvage.

Moderation Log