This is a special post for quick takes by Matthew Barnett. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
10 comments, sorted by Click to highlight new comments since: Today at 9:11 AM

[This comment has been superseded by this post, which is a longer elaboration of essentially the same thesis.]

Recently many people have talked about whether MIRI people (mainly Eliezer Yudkowsky, Nate Soares, and Rob Bensinger) should update on whether value alignment is easier than they thought given that GPT-4 seems to understand human values pretty well. Instead of linking to these discussions, I'll just provide a brief caricature of how I think this argument has gone in the places I've seen it. Then I'll offer my opinion that, overall, I do think that MIRI people should probably update in the direction of alignment being easier than they thought, despite their objections.

Here's my very rough caricature of the discussion so far, plus my contribution:

Non-MIRI people: "Eliezer talked a great deal in the sequences about how it was hard to get an AI to understand human values. For example, his essay on the Hidden Complexity of Wishes made it sound like it would be really hard to get an AI to understand common sense. Actually, it turned out that it was pretty easy to get an AI to understand common sense, since LLMs are currently learning common sense. MIRI people should update on this information."

MIRI people: "You misunderstood the argument. The argument was never about getting an AI to understand human values, but about getting an AI to care about human values in the first place. Hence 'The genie knows but does not care'. There's no reason to think that GPT-4 cares about human values, even if it can understand them. We always thought the hard part of the problem was about inner alignment, or, pointing the AI in a direction you want. We think figuring out how to point an AI in whatever direction you choose is like 99% of the problem; the remaining 1% of the problem is getting it to point at the "right" set of values."

Me: 

I agree that MIRI people never thought the problem was about getting AI to merely understand human values, and that they have always said there was extra difficulty in getting an AI to care about human values. But I distinctly recall MIRI people making a big deal about how the value identification problem would be hard. The value identification problem is the problem of creating a function that correctly distinguishes valuable from non-valuable outcomes. A foreseeable difficulty with the value identification problem -- which was talked about extensively -- is the problem of edge instantiation.

I claim that GPT-4 is pretty good at distinguishing valuable from non-valuable outcomes, unless you require something that vastly exceeds human performance on this task. In other words, GPT-4 looks like it's on a path towards an adequate solution to the value identification problem, where "adequate" means "about as good as humans". And I don't just mean that GPT-4 "understands" human values well: I mean that asking it to distinguish valuable from non-valuable outcomes generally works well as an approximation of the human value function in practice. Therefore it is correct for non-MIRI people to point out that that this problem is less difficult than some people assumed in the past.

Crucially, I'm not saying that GPT-4 actually cares about maximizing human value. I'm saying that it's able to transparently pinpoint to us which outcomes are bad and which outcomes are good, with the fidelity approaching an average human. Importantly, GPT-4 can tell us which outcomes are valuable "out loud" (in writing), rather than merely passively knowing this information. This element is key to what I'm saying because it means that we can literally just ask a multimodal GPT-N about whether an outcome is bad or good, and use that as an adequate "human value function".

The supposed reason why the value identification problem was hard is because human value is complex. In fact, that's mentioned the central foreseeable difficulty on the Arbital page. Complexity of value was used as an explicit premise in the argument for why AI alignment would be difficult many times in MIRI's history (two examples: 1, 2), and it definitely seems like the reason for this premise was because it was supposed to be an intuition for why the value identification problem would be hard. If the value identification problem was never predicted to be hard, then what was the point of making a fuss about complexity of value in the first place?

In general, there are (at least) two ways that someone can fail to follow your intended instructions. Either your instructions aren't well-specified, or the person doesn't want to obey your instructions even if the instructions are well-specified. All the evidence that I've found seems to indicate that MIRI people thought that both problems would be hard for AI, not merely the second problem. For example, a straightforward literal interpretation of Nate Soares' 2017 talk supports this interpretation.

It seems to me that the following statements are true:

  1. MIRI people used to think that it would be hard to both (1) develop an explicit function that corresponds to the "human utility function" with accuracy comparable to that of an average human, and (2) separately, get an AI to care about maximizing this function. The idea that MIRI people only ever thought (2) was the hard part seems false, and unsupported by the links above.
  2. Non-MIRI people often strawman MIRI people as thinking that AGI would literally lack an understanding of human values.
  3. The "complexity of value" argument pretty much just tells us that we need an AI to learn human values, rather than hardcoding a utility function from scratch. That's a meaningful thing to say, but it doesn't tell us much about whether alignment is hard; it just means that extremely naive approaches to alignment won't work.

Complexity of value says that the space of system's possible values is large, compared to what you want to hit, so to hit it you must aim correctly, there is no hope of winning the lottery otherwise. Thus any approach that doesn't aim the values of the system correctly will fail at alignment. System's understanding of some goal is not relevant to this, unless a design for correctly aiming system's values makes use of it.

Ambitious alignment aims at human values. Prosaic alignment aims at human wishes, as currently intended. Pivotal alignment aims at a particular bounded technical task. As we move from ambitious to prosaic to pivotal alignment, minimality principle gets a bit more to work with, making the system more specific in the kinds of cognition it needs to work and thus less dangerous given lack of comprehensive understanding of what aligning a superintelligence entails.

I agree that MIRI people never thought the problem was about getting AI to merely understand human values, and that they have always said there was extra difficulty in getting an AI to care about human values. But I distinctly recall MIRI people making a big deal about how the value identification problem would be hard. The value identification problem is the problem of creating a function that correctly distinguishes valuable from non-valuable outcomes.

If the AI’s “understanding of human values” is a specific set of 4000 unlabeled nodes out of a trillion-node unlabeled world-model, and we can never find them, then the existence of those nodes isn’t directly helpful. You need a “hook” into it, to connect those nodes to motivation, presumably. I think that’s what you’re missing. No “hook”, no alignment. So how do we make the “hook”?

One possible approach to constructing the “hook” would be (presumably) solving the value identification problem and then we have an explicit function in the source code and then … I dunno, but that seems like a plausibly helpful first step. Like maybe you can have code which searches through the unlabeled world-model for sets of nodes that line up perfectly with the explicit function, or whatever.

Another possible approach to constructing the “hook” would be to invoke the magic words “human values” or “what a human would like” or whatever, while pressing a magic button that connects the associated nodes to motivation. That was basically my proposal here, and is also what you’d get with AutoGPT, I guess. However…

GPT-4 is pretty good at distinguishing valuable from non-valuable outcomes

I think this is true in-distribution. I think MIRI people would be very interested in questions like “what transhumanist utopia will the AI be motivated to build?”, and it’s very unclear to me that GPT-4 would come to the same conclusions that CEV or whatever would come to. See the FAQ item on “concept extrapolation” here.

If the AI’s “understanding of human values” is a specific set of 4000 unlabeled nodes out of a trillion-node unlabeled world-model, and we can never find them, then the existence of those nodes isn’t directly helpful. You need a “hook” into it, to connect those nodes to motivation, presumably. I think that’s what you’re missing. No “hook”, no alignment. So how do we make the “hook”?

I'm claiming that the the value identification function is obtained by literally just asking GPT-4 what to do in the situation you're in. That doesn't involve any internal search over the human utility function embedded in GPT-4's weights. I think GPT-4 can simply be queried in natural language for ethical advice, and it's pretty good at offering ethical advice in most situations that you're ever going to realistically encounter. GPT-4 is probably not human-level yet on this task, although I expect it won't be long before GPT-N is about as good at knowing what's ethical as your average human; maybe it'll even be a bit more ethical.

(But yes, this isn't the same as motivating GPT-4 to act on human values. I addressed this in my original comment though.)

I think [GPT-4 is pretty good at distinguishing valuable from non-valuable outcomes] in-distribution. I think MIRI people would be very interested in questions like “what transhumanist utopia will the AI be motivated to build?”, and it’s very unclear to me that GPT-4 would come to the same conclusions that CEV or whatever would come to. See the FAQ item on “concept extrapolation” here.

I agree that MIRI people are interested in things like "what transhumanist utopia will the AI be motivated to build" but I think saying that this is the hard part of the value identification problem is pretty much just moving the goalposts from what I thought the original claim was. Very few, if any, humans can tell you exactly how to build the transhumanist utopia either. If the original thesis was "human values are hard to identify because it's hard to extract all the nuances of value embedded in human brains", now the thesis is becoming "human values are hard to identify because literally no one knows how to build the transhumanist utopia". 

But we don't need AIs to build a utopia immediately! If we actually got AI to follow common-sense morality, it would follow from common-sense morality that you shouldn't do anything crazy and irreversible right away, like killing all the humans. Instead, you'd probably want to try to figure out, with the humans, what type of utopia we ought to build.

(This is a weird conversation for me because I’m half-defending a position I partly disagree with and might be misremembering anyway.)

moving the goalposts from what I thought the original claim was

I’m going off things like the value is fragile example: “You can imagine a mind that contained almost the whole specification of human value, almost all the morals and metamorals, but left out just this one thing - [boredom] - and so it spent until the end of time, and until the farthest reaches of its light cone, replaying a single highly optimized experience, over and over and over again.”

That’s why I think they’ve always had extreme-out-of-distribution-extrapolation on their mind (in this context).

Very few, if any, humans can tell you exactly how to build the transhumanist utopia either.

Y’know, I think this one of the many differences between Eliezer and some other people. My model of Eliezer thinks that there’s kinda a “right answer” to what-is-valuable-according-to-CEV / fun theory / etc., and hence there’s an optimal utopia, and insofar as we fall short of that, we’re leaving value on the table. Whereas my model of (say) Paul Christiano thinks that we humans are on an unprincipled journey forward into the future, doing whatever we do, and that’s the status quo, and we’d really just like for that process to continue and go well. (I don’t think this is an important difference, because Eliezer is in practice talking about extinction versus not, but it is a difference.) (For my part, I’m not really sure what I think. I find it confusing and stressful to think about.)

But we don't need AIs to build a utopia immediately! If we actually got AI to follow common-sense morality, it would follow from common-sense morality that you shouldn't do anything crazy and irreversible right away, like killing all the humans. Instead, you'd probably want to try to figure out, with the humans, what type of utopia we ought to build.

I’m mostly with you on that one, in the sense that I think it’s at least plausible (50%?) that we could make a powerful AGI that’s trying to be helpful and follow norms, but also doing superhuman innovative science, at least if alignment research progress continues. (I don’t think AGI will look like GPT-4, so reaching that destination is kinda different on my models compared to yours.) (Here’s my disagreeing-with-MIRI post on that.) (My overall pessimism is much higher than that though, mainly for reasons here.)

I'm claiming that the the value identification function is obtained by literally just asking GPT-4 what to do in the situation you're in.

AFAIK, GPT-4 is a mix of “extrapolating text-continuation patterns learned from the internet” + “RLHF based on labeled examples”.

For the former, I note that Eliezer commented in 2018 that “The central interesting-to-me idea in capability amplification is that by exactly imitating humans, we can bypass the usual dooms of reinforcement learning.” It kinda sounds like Eliezer is most comfortable thinking of RL, and sees SL as kinda different, maybe? (I could talk about my models here, but that’s a different topic… Anyway, I’m not really sure what Eliezer thinks.)

For the latter, again I think it’s a question of whether we care about our ability to extrapolate the labeled examples way out of distribution.

I'm not sure if I can find it easily, but I recall Eliezer pointing out (several years ago) that he thought that Value Identification was the "easy part" of the alignment problem, with the getting it to care part being something like an order of magnitude more difficult. He seemed to think (IIRC) this itself could still be somewhat difficult, as you point out. Additionally, the difficulty was always considered in the context of having an alignable AGI (i.e. something you can point in a specific direction), which GPT-N is not under this paradigm.

I have mixed feelings and some rambly personal thoughts about the bet Tamay Besiroglu and I proposed a few days ago. 

The first thing I'd like to say is that we intended it as a bet, and only a bet, and yet some people seem to be treating it as if we had made an argument. Personally, I am uncomfortable with the suggestion that our post was "misleading" because we did not present an affirmative case for our views.

I agree that LessWrong culture benefits from arguments as well as bets, but it seems a bit weird to demand that every bet come with an argument attached. A norm that all bets must come with arguments would seem to substantially damper the incentives to make bets, because then each time people must spend what will likely be many hours painstakingly outlining their views on the subject.

That said, I do want to reply to people who say that our post was misleading on other grounds. Some said that we should have made different bets, or at different odds. In response, I can only say that coming up with good concrete bets about AI timelines is actually really damn hard, and so if you wish you come up with alternatives, you can be my guest. I tried my best, at least.

More people said that our bet was misleading since it would seem that we too (Tamay and I) implicitly believe in short timelines, because our bets amounted to the claim that AGI has a substantial chance of arriving in 4-8 years. However, I do not think this is true.

The type of AGI that we should be worried about is one that is capable of fundamentally transforming the world. More narrowly, and to generalize a bit, fast takeoff folks believe that we will only need a minimal seed AI that is capable of rewriting its source code, and recursively self-improving into superintelligence. Slow takeoff folks believe that we will need something capable of automating a wide range of labor.

Given the fast takeoff view, it is totally understandable to think that our bets imply a short timeline. However, (and I'm only speaking for myself here) I don't believe in a fast takeoff. I think there's a huge gap between AI doing well on a handful of benchmarks, and AI fundamentally re-shaping the economy. At the very least, AI has been doing well on a ton of benchmarks since 2012. Each time AI excels in one benchmark, a new one is usually invented that's a bit more tough, and hopefully gets us a little closer to measuring what we actually mean by general intelligence.

In the near-future, I hope to create a much longer and more nuanced post expanding on my thoughts on this subject, hopefully making it clear that I do care a lot about making real epistemic progress here. I'm not just trying to signal that I'm a calm and arrogant long-timelines guy who raises his nose at the panicky short timelines people, though I understand how my recent post could have given that impression.

fast takeoff folks believe that we will only need a minimal seed AI that is capable of rewriting its source code, and recursively self-improving into superintelligence

Speaking only for myself, the minimal seed AI is a strawman of why I believe in "fast takeoff". In the list of benchmarks you mentioned in your bet, I think APPS is one of the most important.

I think the "self-improving" part will come from the system "AI Researchers + code synthesis model" with a direct feedback loop (modulo enough hardware), cf. here. That's the self-improving superintelligence.

Reading through the recent Discord discussions with Eliezer, and reading and replying to comments, has given me the following impression of a crux of the takeoff debate. It may not be the crux. But it seems like a crux nonetheless, unless I'm misreading a lot of people. 

Let me try to state it clearly:

The foom theorists are saying something like, "Well, you can usually-in-hindsight say that things changed gradually, or continuously, along some measure. You can use these measures after-the-fact, but that won't tell you about the actual gradual-ness of the development of AI itself, because you won't know which measures are gradual in advance."

And then this addendum is also added, "Furthermore, I expect that the quantities which will experience discontinuities from the past will be those that are qualitatively important, in a way that is hard to measure. For example, 'ability to manufacture nanobots' or 'ability to hack into computers' are qualitative powers that we can expect AIs will develop rather suddenly, rather than gradually from precursor states, in the way that, e.g. progress in image classification accuracy was gradual over time. This means you can't easily falsify the position by just pointing to straight lines on a million graphs."

If you agree that foom is somewhat likely, then I would greatly appreciate if you think this is your crux, or if you think I've missed something. 

If this indeed falls into one of your cruxes, then I feel like I'm in a position to say, "I kinda know what motivates your belief but I still think it's probably wrong" at least in a weak sense, which seems important.

I lean toward the foom side, and I think I agree with the first statement. The intuition for me is that it's kinda like p-hacking (there are very many possible graphs, and some percentage of those will be gradual), or using a log-log plot (which makes everything look like a nice straight line, but are actually very broad predictions when properly accounting for uncertainty). Not sure if I agree with the addendum or not yet, and I'm not sure how much of a crux this is for me yet.