Ngo and Yudkowsky on alignment difficulty

Eliezer Yudkowsky; Richard_Ngo

This post is the first in a series of transcribed Discord conversations between Richard Ngo and Eliezer Yudkowsky, moderated by Nate Soares. We've also added Richard and Nate's running summaries of the conversation (and others' replies) from Google Docs.

Later conversation participants include Ajeya Cotra, Beth Barnes, Carl Shulman, Holden Karnofsky, Jaan Tallinn, Paul Christiano, Rob Bensinger, and Rohin Shah.

The transcripts are a complete record of several Discord channels MIRI made for discussion. We tried to edit the transcripts as little as possible, other than to fix typos and a handful of confusingly-worded sentences, to add some paragraph breaks, and to add referenced figures and links. We didn't end up redacting any substantive content, other than the names of people who would prefer not to be cited. We swapped the order of some chat messages for clarity and conversational flow (indicated with extra timestamps), and in some cases combined logs where the conversation switched channels.

Color key:

Chat by Richard and Eliezer

Other chat

Google Doc content

Inline comments

0. Prefatory comments

[Yudkowsky][8:32] (Nov. 6 follow-up comment)

(At Rob's request I'll try to keep this brief, but this was an experimental format and some issues cropped up that seem large enough to deserve notes.)

Especially when coming in to the early parts of this dialogue, I had some backed-up hypotheses about "What might be the main sticking point? and how can I address that?" which from the standpoint of a pure dialogue might seem to be causing me to go on digressions, relative to if I was just trying to answer Richard's own questions. On reading the dialogue, I notice that this looks evasive or like point-missing, like I'm weirdly not just directly answering Richard's questions.

Often the questions are answered later, or at least I think they are, though it may not be in the first segment of the dialogue. But the larger phenomenon is that I came in with some things I wanted to say, and Richard came in asking questions, and there was a minor accidental mismatch there. It would have looked better if we'd both stated positions first without question marks, say, or if I'd just confined myself to answering questions from Richard. (This is not a huge catastrophe, but it's something for the reader to keep in mind as a minor hiccup that showed up in the early parts of experimenting with this new format.)

[Yudkowsky][8:32] (Nov. 6 follow-up comment)

(Prompted by some later stumbles in attempts to summarize this dialogue. Summaries seem plausibly a major mode of propagation for a sprawling dialogue like this, and the following request seems like it needs to be very prominent to work - embedded requests later on didn't work.)

Please don't summarize this dialogue by saying, "and so Eliezer's MAIN idea is that" or "and then Eliezer thinks THE KEY POINT is that" or "the PRIMARY argument is that" etcetera. From my perspective, everybody comes in with a different set of sticking points versus things they see as obvious, and the conversation I have changes drastically depending on that. In the old days this used to be the Orthogonality Thesis, Instrumental Convergence, and superintelligence being a possible thing at all; today most OpenPhil-adjacent folks have other sticking points instead.

Please transform:

"Eliezer's main reply is..." -> "Eliezer replied that..."
"Eliezer thinks the key point is..." -> "Eliezer's point in response was..."
"Eliezer thinks a major issue is..." -> "Eliezer replied that one issue is..."
"Eliezer's primary argument against this is..." -> "Eliezer tried the counterargument that..."
"Eliezer's main scenario for this is..." -> "In a conversation in September of 2021, Eliezer sketched a hypothetical where..."

Note also that the transformed statements say what you observed, whereas the untransformed statements are (often incorrect) inferences about my latent state of mind.

(Though "distinguishing relatively unreliable inference from more reliable observation" is not necessarily the key idea here or the one big reason I'm asking for this. That's just one point I tried making - one argument that I hope might help drive home the larger thesis.)

1. September 5 conversation

1.1. Deep vs. shallow problem-solving patterns

[Ngo][11:00]

Hi all! Looking forward to the discussion.

[Yudkowsky][11:01]

Hi and welcome all. My name is Eliezer and I think alignment is really actually quite extremely difficult. Some people seem to not think this! It's an important issue so ought to be resolved somehow, which we can hopefully fully do today. (I will however want to take a break after the first 90 minutes, if it goes that far and if Ngo is in sleep-cycle shape to continue past that.)

[Ngo][11:02]

A break in 90 minutes or so sounds good.

Here's one way to kick things off: I agree that humans trying to align arbitrarily capable AIs seems very difficult. One reason that I'm more optimistic (or at least, not confident that we'll have to face the full very difficult version of the problem) is that at a certain point AIs will be doing most of the work.

When you talk about alignment being difficult, what types of AIs are you thinking about aligning?

[Yudkowsky][11:04]

On my model of the Other Person, a lot of times when somebody thinks alignment shouldn't be that hard, they think there's some particular thing you can do to align an AGI, which isn't that hard, and their model is missing one of the foundational difficulties for why you can't do (easily or at all) one step of their procedure. So one of my own conversational processes might be to poke around looking for a step that the other person doesn't realize is hard. That said, I'll try to directly answer your own question first.

[Ngo][11:07]

I don't think I'm confident that there's any particular thing you can do to align an AGI. Instead I feel fairly uncertain over a broad range of possibilities for how hard the problem turns out to be.

And on some of the most important variables, it seems like evidence from the last decade pushes towards updating that the problem will be easier.

[Yudkowsky][11:09]

I think that after AGI becomes possible at all and then possible to scale to dangerously superhuman levels, there will be, in the best-case scenario where a lot of other social difficulties got resolved, a 3-month to 2-year period where only a very few actors have AGI, meaning that it was socially possible for those few actors to decide to not just scale it to where it automatically destroys the world.

During this step, if humanity is to survive, somebody has to perform some feat that causes the world to not be destroyed in 3 months or 2 years when too many actors have access to AGI code that will destroy the world if its intelligence dial is turned up. This requires that the first actor or actors to build AGI, be able to do something with that AGI which prevents the world from being destroyed; if it didn't require superintelligence, we could go do that thing right now, but no such human-doable act apparently exists so far as I can tell.

So we want the least dangerous, most easily aligned thing-to-do-with-an-AGI, but it does have to be a pretty powerful act to prevent the automatic destruction of Earth after 3 months or 2 years. It has to "flip the gameboard" rather than letting the suicidal game play out. We need to align the AGI that performs this pivotal act, to perform that pivotal act without killing everybody.

Parenthetically, no act powerful enough and gameboard-flipping enough to qualify is inside the Overton Window of politics, or possibly even of effective altruism, which presents a separate social problem. I usually dodge around this problem by picking an exemplar act which is powerful enough to actually flip the gameboard, but not the most alignable act because it would require way too many aligned details: Build self-replicating open-air nanosystems and use them (only) to melt all GPUs.

Since any such nanosystems would have to operate in the full open world containing lots of complicated details, this would require tons and tons of alignment work, is not the pivotal act easiest to align, and we should do some other thing instead. But the other thing I have in mind is also outside the Overton Window, just like this is. So I use "melt all GPUs" to talk about the requisite power level and the Overton Window problem level, both of which seem around the right levels to me, but the actual thing I have in mind is more alignable; and this way, I can reply to anyone who says "How dare you?!" by saying "Don't worry, I don't actually plan on doing that."

[Ngo][11:14]

One way that we could take this discussion is by discussing the pivotal act "make progress on the alignment problem faster than humans can".

[Yudkowsky][11:15]

This sounds to me like it requires extreme levels of alignment and operating in extremely dangerous regimes, such that, if you could do that, it would seem much more sensible to do some other pivotal act first, using a lower level of alignment tech.

[Ngo][11:16]

Okay, this seems like a crux on my end.

[Yudkowsky][11:16]

In particular, I would hope that - in unlikely cases where we survive at all - we were able to survive by operating a superintelligence only in the lethally dangerous, but still less dangerous, regime of "engineering nanosystems".

Whereas "solve alignment for us" seems to require operating in the even more dangerous regimes of "write AI code for us" and "model human psychology in tremendous detail".

[Ngo][11:17]

What makes these regimes so dangerous? Is it that it's very hard for humans to exercise oversight?

One thing that makes these regimes seem less dangerous to me is that they're broadly in the domain of "solving intellectual problems" rather than "achieving outcomes in the world".

[Yudkowsky][11:19][11:21]

Every AI output effectuates outcomes in the world. If you have a powerful unaligned mind hooked up to outputs that can start causal chains that effectuate dangerous things, it doesn't matter whether the comments on the code say "intellectual problems" or not.

The danger of "solving an intellectual problem" is when it requires a powerful mind to think about domains that, when solved, render very cognitively accessible strategies that can do dangerous things.

I expect the first alignment solution you can actually deploy in real life, in the unlikely event we get a solution at all, looks like 98% "don't think about all these topics that we do not absolutely need and are adjacent to the capability to easily invent very dangerous outputs" and 2% "actually think about this dangerous topic but please don't come up with a strategy inside it that kills us".

[Ngo][11:21][11:22]

Let me try and be more precise about the distinction. It seems to me that systems which have been primarily trained to make predictions about the world would by default lack a lot of the cognitive machinery which humans use to take actions which pursue our goals.

Perhaps another way of phrasing my point is something like: it doesn't seem implausible to me that we build AIs that are significantly more intelligent (in the sense of being able to understand the world) than humans, but significantly less agentic.

Is this a crux for you?

(obviously "agentic" is quite underspecified here, so maybe it'd be useful to dig into that first)

[Yudkowsky][11:27][11:33]

I would certainly have learned very new and very exciting facts about intelligence, facts which indeed contradict my present model of how intelligences liable to be discovered by present research paradigms work, if you showed me... how can I put this in a properly general way... that problems I thought were about searching for states that get fed into a result function and then a result-scoring function, such that the input gets an output with a high score, were in fact not about search problems like that. I have sometimes given more specific names to this problem setup, but I think people have become confused by the terms I usually use, which is why I'm dancing around them.

In particular, just as I have a model of the Other Person's Beliefs in which they think alignment is easy because they don't know about difficulties I see as very deep and fundamental and hard to avoid, I also have a model in which people think "why not just build an AI which does X but not Y?" because they don't realize what X and Y have in common, which is something that draws deeply on having deep models of intelligence. And it is hard to convey this deep theoretical grasp.

But you can also see powerful practical hints that these things are much more correlated than, eg, Robin Hanson was imagining during the FOOM debate, because Robin did not think something like GPT-3 should exist; Robin thought you should need to train lots of specific domains that didn't generalize. I argued then with Robin that it was something of a hint that humans had visual cortex and cerebellar cortex but not Car Design Cortex, in order to design cars. Then in real life, it proved that reality was far to the Eliezer side of Eliezer on the Eliezer-Robin axis, and things like GPT-3 were built with less architectural complexity and generalized more than I was arguing to Robin that complex architectures should generalize over domains.

The metaphor I sometimes use is that it is very hard to build a system that drives cars painted red, but is not at all adjacent to a system that could, with a few alterations, prove to be very good at driving a car painted blue. The "drive a red car" problem and the "drive a blue car" problem have too much in common. You can maybe ask, "Align a system so that it has the capability to drive red cars, but refuses to drive blue cars." You can't make a system that is very good at driving red-painted cars, but lacks the basic capability to drive blue-painted cars because you never trained it on that. The patterns found by gradient descent, by genetic algorithms, or by other plausible methods of optimization, for driving red cars, would be patterns very close to the ones needed to drive blue cars. When you optimize for red cars you get the blue car capability whether you like it or not.

[Ngo][11:32]

Does your model of intelligence rule out building AIs which make dramatic progress in mathematics without killing us all?

[Yudkowsky][11:34][11:39]

If it were possible to perform some pivotal act that saved the world with an AI that just made progress on proving mathematical theorems, without, eg, needing to explain those theorems to humans, I'd be extremely interested in that as a potential pivotal act. We wouldn't be out of the woods, and I wouldn't actually know how to build an AI like that without killing everybody, but it would immediately trump everything else as the obvious line of research to pursue.

Parenthetically, there is very very little which my model of intelligence rules out. I think we all die because we cannot do certain dangerous things correctly, on the very first try in the dangerous regimes where one mistake kills you, and do them before proliferation of much easier technologies kills us. If you have the Textbook From 100 Years In The Future that gives the simple robust solutions for everything, that actually work, you can write a superintelligence that thinks 2 + 2 = 5 because the Textbook gives the methods for doing that which are simple and actually work in practice in real life.

(The Textbook has the equivalent of "use ReLUs instead of sigmoids" everywhere, and avoids all the clever-sounding things that will work at subhuman levels and blow up when you run them at superintelligent levels.)

[Ngo][11:36][11:40]

Hmm, so suppose we train an AI to prove mathematical theorems when given them, perhaps via some sort of adversarial setter-solver training process.

By default I have the intuition that this AI could become extremely good at proving theorems - far beyond human level - without having goals about real-world outcomes.

It seems to me that in your model of intelligence, being able to do tasks like mathematics is closely coupled with trying to achieve real-world outcomes. But I'd actually take GPT-3 as some evidence against this position (although still evidence in favour of your position over Hanson's), since it seems able to do a bunch of reasoning tasks while still not being very agentic.

There's some alternative world where we weren't able to train language models to do reasoning tasks without first training them to perform tasks in complex RL environments, and in that world I'd be significantly less optimistic.

[Yudkowsky][11:41]

I put to you that there is a predictable bias in your estimates, where you don't know about the Deep Stuff that is required to prove theorems, so you imagine that certain cognitive capabilities are more disjoint than they actually are. If you knew about the things that humans are using to reuse their reasoning about chipped handaxes and other humans, to prove math theorems, you would see it as more plausible that proving math theorems would generalize to chipping handaxes and manipulating humans.

GPT-3 is a... complicated story, on my view of it and intelligence. We're looking at an interaction between tons and tons of memorized shallow patterns. GPT-3 is very unlike the way that natural selection built humans.

[Ngo][11:44]

I agree with that last point. But this is also one of the reasons that I previously claimed that AIs could be more intelligent than humans while being less agentic, because there are systematic differences between the way in which natural selection built humans, and the way in which we'll train AGIs.

[Yudkowsky][11:45]

My current suspicion is that Stack More Layers alone is not going to take us to GPT-6 which is a true AGI; and this is because of the way that GPT-3 is, in your own terminology, "not agentic", and which is, in my terminology, not having gradient descent on GPT-3 run across sufficiently deep problem-solving patterns.

[Ngo][11:46]

Okay, that helps me understand your position better.

So here's one important difference between humans and neural networks: humans face the genomic bottleneck which means that each individual has to rederive all the knowledge about the world that their parents already had. If this genetic bottleneck hadn't been so tight, then individual humans would have been significantly less capable of performing novel tasks.

[Yudkowsky][11:50]

I agree.

[Ngo][11:50]

In my terminology, this is a reason that humans are "more agentic" than we otherwise would have been.

[Yudkowsky][11:50]

This seems indisputable.

[Ngo][11:51]

Another important difference: humans were trained in environments where we had to run around surviving all day, rather than solving maths problems etc.

[Yudkowsky][11:51]

I continue to nod.

[Ngo][11:52]

Supposing I agree that reaching a certain level of intelligence will require AIs with the "deep problem-solving patterns" you talk about, which lead AIs to try to achieve real-world goals. It still seems to me that there's likely a lot of space between that level of intelligence, and human intelligence.

And if that's the case, then we could build AIs which help us solve the alignment problem before we build AIs which instantiate sufficiently deep problem-solving patterns that they decide to take over the world.

Nor does it seem like the reason humans want to take over the world is because of a deep fact about our intelligence. It seems to me that humans want to take over the world mainly because that's very similar to things we evolved to do (like taking over our tribe).

[Yudkowsky][11:57]

So here's the part that I agree with: If there were one theorem only mildly far out of human reach, like proving the ABC Conjecture (if you think it hasn't already been proven), and providing a machine-readable proof of this theorem would immediately save the world - say, aliens will give us an aligned superintelligence, as soon as we provide them with this machine-readable proof - then there would exist a plausible though not certain road to saving the world, which would be to try to build a shallow mind that proved the ABC Conjecture by memorizing tons of relatively shallow patterns for mathematical proofs learned through self-play; without that system ever abstracting math as deeply as humans do, but the sheer width of memory and sheer depth of search sufficing to do the job. I am not sure, to be clear, that this would work. But my model of intelligence does not rule it out.

[Ngo][11:58]

(I'm actually thinking of a mind which understands maths more deeply than humans - but perhaps only understands maths, or perhaps also a range of other sciences better than humans.)

[Yudkowsky][12:00]

Parts I disagree with: That "help us solve alignment" bears any significant overlap with "provide us a machine-readable proof of the ABC Conjecture without thinking too deeply about it". That humans want to take over the world only because it resembles things we evolved to do.

[Ngo][12:01]

I definitely agree that humans don't only want to take over the world because it resembles things we evolved to do.

[Yudkowsky][12:02]

Alas, eliminating 5 reasons why something would go wrong doesn't help much if there's 2 remaining reasons something would go wrong that are much harder to eliminate!

[Ngo][12:02]

But if we imagine having a human-level intelligence which hadn't evolved primarily to do things that reasonably closely resembled taking over the world, then I expect that we could ask that intelligence questions in a fairly safe way.

And that's also true for an intelligence that is noticeably above human level.

So one question is: how far above human level could we get before a system which has only been trained to do things like answer questions and understand the world will decide to take over the world?

[Yudkowsky][12:04]

I think this is one of the very rare cases where the intelligence difference between "village idiot" and "Einstein", which I'd usually see as very narrow, makes a structural difference! I think you can get some outputs from a village-idiot-level AGI, which got there by training on domains exclusively like math, and this will proooobably not destroy the world (if you were right about that, about what was going on inside). I have more concern about the Einstein level.

[Ngo][12:05]

Let's focus on the Einstein level then.

Human brains have been optimised very little for doing science.

This suggests that building an AI which is Einstein-level at doing science is significantly easier than building an AI which is Einstein-level at taking over the world (or other things which humans evolved to do).

[Yudkowsky][12:08]

I think there's a certain broad sense in which I agree with the literal truth of what you just said. You will systematically overestimate how much easier, or how far you can push the science part without getting the taking-over-the-world part, for as long as your model is ignorant of what they have in common.

[Ngo][12:08]

Maybe this is a good time to dig into the details of what they have in common, then.

[Yudkowsky][12:09][12:11]][12:13]

I feel like I haven't had much luck with trying to explain that on previous occasions. Not to you, to others too.

There are shallow topics like why p-zombies can't be real and how quantum mechanics works and why science ought to be using likelihood functions instead of p-values, and I can barely explain those to some people, but then there are some things that are apparently much harder to explain than that and which defeat my abilities as an explainer.

That's why I've been trying to point out that, even if you don't know the specifics, there's an estimation bias that you can realize should exist in principle.

Of course, I also haven't had much luck in saying to people, "Well, even if you don't know the truth about X that would let you see Y, can you not see by abstract reasoning that knowing any truth about X would predictably cause you to update in the direction of Y" - people don't seem to actually internalize that much either. Not you, other discussions.

[Ngo][12:10][12:11][12:13]

Makes sense. Are there ways that I could try to make this easier? E.g. I could do my best to explain what I think your position is.

Given what you've said I'm not optimistic about this helping much.

But insofar as this is the key set of intuitions which has been informing your responses, it seems worth a shot.

Another approach would be to focus on our predictions for how AI capabilities will play out over the next few years.

I take your point about my estimation bias. To me it feels like there's also a bias going the other way, which is that as long as we don't know the mechanisms by which different human capabilities work, we'll tend to lump them together as one thing.

[Yudkowsky][12:14]

Yup. If you didn't know about visual cortex and auditory cortex, or about eyes and ears, you would assume much more that any sentience ought to both see and hear.

[Ngo][12:16]

So then my position is something like: human pursuit of goals is driven by emotions and reward signals which are deeply evolutionarily ingrained, and without those we'd be much safer but not that much worse at pattern recognition.

[Yudkowsky][12:17]

If there's a pivotal act you can get just by supreme acts of pattern recognition, that's right up there with "pivotal act composed solely of math" for things that would obviously instantly become the prime direction of research.

[Ngo][12:18]

To me it seems like maths is much more about pattern recognition than, say, being a CEO. Being a CEO requires coherence over long periods of time; long-term memory; motivation; metacognition; etc.

[Yudkowsky][12:18][12:23]

(One occasionally-argued line of research can be summarized from a certain standpoint as "how about a pivotal act composed entirely of predicting text" and to this my reply is "you're trying to get fully general AGI capabilities by predicting text that is about deep / 'agentic' reasoning, and that doesn't actually help".)

Human math is very much about goals. People want to prove subtheorems on the way to proving theorems. We might be able to make a different kind of mathematician that works more like GPT-3 in the dangerously inscrutable parts that are all noninspectable vectors of floating-point numbers, but even there you'd need some Alpha-Zero-like outer framework to supply the direction of search.

That outer framework might be able to be powerful enough without being reflective, though. So it would plausibly be much easier to build a mathematician that was capable of superhuman formal theorem-proving but not agentic. The reality of the world might tell us "lolnope" but my model of intelligence doesn't mandate that. That's why, if you gave me a pivotal act composed entirely of "output a machine-readable proof of this theorem and the world is saved", I would pivot there! It actually does seem like it would be a lot easier!

[Ngo][12:21][12:25]

Okay, so if I attempt to rephrase your argument:

Your position: There's a set of fundamental similarities between tasks like doing maths, doing alignment research, and taking over the world. In all of these cases, agents based on techniques similar to modern ML which are very good at them will need to make use of deep problem-solving patterns which include goal-oriented reasoning. So while it's possible to beat humans at some of these tasks without those core competencies, people usually overestimate the extent to which that's possible.

[Yudkowsky][12:25]

Remember, a lot of my concern is about what happens first, especially if it happens soon enough that future AGI bears any resemblance whatsoever to modern ML; not about what can be done in principle.

[Soares][12:26]

(Note: it's been 85 min, and we're planning to take a break at 90min, so this seems like a good point for a little bit more clarifying back-and-forth on Richard's summary before a break.)

[Ngo][12:26]

I'll edit to say "plausible for ML techniques"?

(and "extent to which that's plausible")

[Yudkowsky][12:28]

I think that obvious-to-me future outgrowths of modern ML paradigms are extremely liable to, if they can learn how to do sufficiently superhuman X, generalize to taking over the world. How fast this happens does depend on X. It would plausibly happen relatively slower (at higher levels) with theorem-proving as the X, and with architectures that carefully stuck to gradient-descent-memorization over shallow network architectures to do a pattern-recognition part with search factored out (sort of, this is not generally safe, this is not a general formula for safe things!); rather than imposing anything like the genetic bottleneck you validly pointed out as a reason why humans generalize. Profitable X, and all X I can think of that would actually save the world, seem much more problematic.

[Ngo][12:30]

Okay, happy to take a break here.

[Soares][12:30]

Great timing!

[Ngo][12:30]

We can do a bit of meta discussion afterwards; my initial instinct is to push on the question of how similar Eliezer thinks alignment research is to theorem-proving.

[Yudkowsky][12:30]

Yup. This is my lunch break (actually my first-food-of-day break on a 600-calorie diet) so I can be back in 45min if you're still up for that.

[Ngo][12:31]

Sure.

Also, if any of the spectators are reading in real time, and have suggestions or comments, I'd be interested in hearing them.

[Yudkowsky][12:31]

I'm also cheerful about spectators posting suggestions or comments during the break.

[Soares][12:32]

Sounds good. I declare us on a break for 45min, at which point we'll reconvene (for another 90, by default).

Floor's open to suggestions & commentary.

1.2. Requirements for science

[Yudkowsky][12:50]

I seem to be done early if people (mainly Richard) want to resume in 10min (30m break)

[Ngo][12:51]

Yepp, happy to do so

[Soares][12:57]

Some quick commentary from me:

It seems to me like we're exploring a crux in the vicinity of "should we expect that systems capable of executing a pivotal act would, by default in lieu of significant technical alignment effort, be using their outputs to optimize the future".
I'm curious whether you two agree that this is a crux (but plz don't get side-tracked answering me).
The general discussion seems to be going well to me.
- In particular, huzzah for careful and articulate efforts to zero in on cruxes.

[Ngo][13:00]

I think that's a crux for the specific pivotal act of "doing better alignment research", and maybe some other pivotal acts, but not all (or necessarily most) of them.

[Yudkowsky][13:01]

I should also say out loud that I've been working a bit with Ajeya on making an attempt to convey the intuitions behind there being deep patterns that generalize and are liable to be learned, which covered a bunch of ground, taught me how much ground there was, and made me relatively more reluctant to try to re-cover the same ground in this modality.

[Ngo][13:02]

Going forward, a couple of things I'd like to ask Eliezer about:

In what ways are the tasks that are most useful for alignment similar or different to proving mathematical theorems (which we agreed might generalise relatively slowly to taking over the world)?
What are the deep problem-solving patterns underlying these tasks?
Can you summarise my position?

I was going to say that I was most optimistic about #2 in order to get these ideas into a public format

But if that's going to happen anyway based on Ajeya's work, then that seems less important

[Yudkowsky][13:03]

I could still try briefly and see what happens.

[Ngo][13:03]

That seems valuable to me, if you're up for it.

At the same time, I'll try to summarise some of my own intuitions about intelligence which I expect to be relevant.

[Yudkowsky][13:04]

I'm not sure I could summarize your position in a non-straw way. To me there's a huge visible distance between "solve alignment for us" and "output machine-readable proofs of theorems" where I can't give a good account of why you think talking about the latter would tell us much about the former. I don't know what other pivotal act you think might be easier.

[Ngo][13:06]

I see. I was considering "solving scientific problems" as an alternative to "proving theorems", with alignment being one (particularly hard) example of a scientific problem.

But decided to start by discussing theorem-proving since it seemed like a clearer-cut case.

[Yudkowsky][13:07]

Can you predict in advance why Eliezer thinks "solving scientific problems" is significantly thornier? (Where alignment is like totally not "a particularly hard example of a scientific problem" except in the sense that it has science in it at all; which is maybe the real crux; but also a more difficult issue.)

[Ngo][13:09]

Based on some of your earlier comments, I'm currently predicting that you think the step where the solutions need to be legible to and judged by humans makes science much thornier than theorem-proving, where the solutions are machine-checkable.

[Yudkowsky][13:10]

That's one factor. Should I state the other big one or would you rather try to state it first?

[Ngo][13:10]

Requiring a lot of real-world knowledge for science?

If it's not that, go ahead and say it.

[Yudkowsky][13:11]

That's one way of stating it. The way I'd put it is that it's about making up hypotheses about the real world.

Like, the real world is then a thing that the AI is modeling, at all.

Factor 3: On many interpretations of doing science, you would furthermore need to think up experiments. That's planning, value-of-information, search for an experimental setup whose consequences distinguish between hypotheses (meaning you're now searching for initial setups that have particular causal consequences).

[Ngo][13:12]

To me "modelling the real world" is a very continuous variable. At one end you have physics equations that are barely separable from maths problems, at the other end you have humans running around in physical bodies.

To me it seems plausible that we could build an agent which solves scientific problems but has very little self-awareness (in the sense of knowing that it's an AI, knowing that it's being trained, etc).

I expect that your response to this is that modelling oneself is part of the deep problem-solving patterns which AGIs are very likely to have.

[Yudkowsky][13:15]

There's a problem of inferring the causes of sensory experience in cognition-that-does-science. (Which, in fact, also appears in the way that humans do math, and is possibly inextricable from math in general; but this is an example of the sort of deep model that says "Whoops I guess you get science from math after all", not a thing that makes science less dangerous because it's more like just math.)

You can build an AI that only ever drives red cars, and which, at no point in the process of driving a red car, ever needs to drive a blue car in order to drive a red car. That doesn't mean its red-car-driving capabilities won't be extremely close to blue-car-driving capabilities if at any point the internal cognition happens to get pointed towards driving a blue car.

The fact that there's a deep car-driving pattern which is the same across red cars and blue cars doesn't mean that the AI has ever driven a blue car, per se, or that it has to drive blue cars to drive red cars. But if blue cars are fire, you sure are playing with that fire.

[Ngo][13:18]

To me, "sensory experience" as in "the video and audio coming in from this body that I'm piloting" and "sensory experience" as in "a file containing the most recent results of the large hadron collider" are very very different.

(I'm not saying we could train an AI scientist just from the latter - but plausibly from data that's closer to the latter than the former)

[Yudkowsky][13:19]

So there's separate questions about "does an AGI inseparably need to model itself inside the world to do science" and "did we build something that would be very close to modeling itself, and could easily stumble across that by accident somewhere in the inscrutable floating-point numbers, especially if that was even slightly useful for solving the outer problems".

[Ngo][13:19]

Hmm, I see

[Yudkowsky][13:20][13:21][13:21]

If you're trying to build an AI that literally does science only to observations collected without the AI having had a causal impact on those observations, that's legitimately "more dangerous than math but maybe less dangerous than active science".

You might still stumble across an active scientist because it was a simple internal solution to something, but the outer problem would be legitimately stripped of an important structural property the same way that pure math not describing Earthly objects is stripped of important structural properties.

And of course my reaction again is, "There is no pivotal act which uses only that cognitive capability."

[Ngo][13:20][13:21][13:26]

I guess that my (fairly strong) prior here is that something like self-modelling, which is very deeply built into basically every organism, is a very hard thing for an AI to stumble across by accident without significant optimisation pressure in that direction.

But I'm not sure how to argue this except by digging into your views on what the deep problem-solving patterns are. So if you're still willing to briefly try and explain those, that'd be useful to me.

"Causal impact" again seems like a very continuous variable - it seems like the amount of causal impact you need to do good science is much less than the amount which is needed to, say, be a CEO.

[Yudkowsky][13:26]

The amount doesn't seem like the key thing, nearly so much as what underlying facilities you need to do whatever amount of it you need.

[Ngo][13:27]

Agreed.

[Yudkowsky][13:27]

If you go back to the 16th century and ask for just one mRNA vaccine, that's not much of a difference from asking for a ~~million~~ hundred of them.

[Ngo][13:28]

Right, so the additional premise which I'm using here is that the ability to reason about causally impacting the world in order to achieve goals is something that you can have a little bit of.

Or a lot of, and that the difference between these might come down to the training data used.

Which at this point I don't expect you to agree with.

[Yudkowsky][13:29]

If you have reduced a pivotal act to "look over the data from this hadron collider you neither built nor ran yourself", that really is a structural step down from "do science" or "build a nanomachine". But I can't see any pivotal acts like that, so is that question much of a crux?

If there's intermediate steps they might be described in my native language like "reason about causal impacts across only this one preprogrammed domain which you didn't learn in a general way, in only this part of the cognitive architecture that is separable from the rest of the cognitive architecture".

[Ngo][13:31]

Perhaps another way of phrasing this intermediate step is that the agent has a shallow understanding of how to induce causal impacts.

[Yudkowsky][13:31]

What is "shallow" to you?

[Ngo][13:31]

In a similar way to how you claim that GPT-3 has a shallow understanding of language.

[Yudkowsky][13:32]

So it's memorized a ton of shallow causal-impact-inducing patterns from a large dataset, and this can be verified by, for example, presenting it with an example mildly outside the dataset and watching it fail, which we think will confirm our hypothesis that it didn't learn any deep ways of solving that dataset.

[Ngo][13:33]

Roughly speaking, yes.

[Yudkowsky][13:34]

Eg, it wouldn't surprise us at all if GPT-4 had learned to predict "27 * 18" but not "what is the area of a rectangle 27 meters by 18 meters"... is what I'd like to say, but Codex sure did demonstrate those two were kinda awfully proximal.

[Ngo][13:34]

Here's one way we could flesh this out. Imagine an agent that loses coherence quickly when it's trying to act in the world.

So for example, we've trained it to do scientific experiments over a period of a few hours or days

And then it's very good at understanding the experimental data and extracting patterns from it

But upon running it for a week or a month, it loses coherence in a similar way to how GPT-3 loses coherence - e.g. it forgets what it's doing.

My story for why this might happen is something like: there is a specific skill of having long-term memory, and we never trained our agent to have this skill, and so it has not acquired that skill (even though it can reason in very general and powerful ways in the short term).

This feels similar to the argument I was making before about how an agent might lack self-awareness, if we haven't trained it specifically to have that.

[Yudkowsky][13:39]

There's a set of obvious-to-me tactics for doing a pivotal act with minimal danger, which I do not think collectively make the problem safe, and one of these sets of tactics is indeed "Put a limit on the 'attention window' or some other internal parameter, ramp it up slowly, don't ramp it any higher than you needed to solve the problem."

[Ngo][13:41]

You could indeed do this manually, but my expectation is that you could also do this automatically, by training agents in environments where they don't benefit from having long attention spans.

[Yudkowsky][13:42]

(Any time one imagines a specific tactic of this kind, if one has the security mindset, one can also imagine all sorts of ways it might go wrong; for example, an attention window can be defeated if there's any aspect of the attended data or the internal state that ended up depending on past events in a way that leaked info about them. But, depending on how much superintelligence you were throwing around elsewhere, you could maybe get away with that, some of the time.)

[Ngo][13:43]

And that if you put agents in environments where they answer questions but don't interact much with the physical world, then there will be many different traits which are necessary for achieving goals in the real world which they will lack, because there was little advantage to the optimiser of building those traits in.

[Yudkowsky][13:43]

I'll observe that TransformerXL built an attention window that generalized, trained it on I think 380 tokens or something like that, and then found that it generalized to 4000 tokens or something like that.

[Ngo][13:43]

Yeah, an order of magnitude of generalisation is not surprising to me.

[Yudkowsky][13:44]

Having observed one order of magnitude, I would personally not be surprised by two orders of magnitude either, after seeing that.

[Ngo][13:45]

I'd be a little surprised, but I assume it would happen eventually.

1.3. Capability dials

[Yudkowsky][13:46]

I have a sense that this is all circling back to the question, "But what is it we do with the intelligence thus weakened?" If you can save the world using a rock, I can build you a very safe rock.

[Ngo][13:46]

Right.

So far I've said "alignment research", but I haven't been very specific about it.

I guess some context here is that I expect that the first things we do with intelligence similar to this is create great wealth, produce a bunch of useful scientific advances, etc.

And that we'll be in a world where people take the prospect of AGI much more seriously

[Yudkowsky][13:48]

I mostly expect - albeit with some chance that reality says "So what?" to me and surprises me, because it is not as solidly determined as some other things - that we do not hang around very long in the "weirdly ~human AGI" phase before we get into the "if you crank up this AGI it destroys the world" phase. Less than 5 years, say, to put numbers on things.

It would not surprise me in the least if the world ends before self-driving cars are sold on the mass market. On some quite plausible scenarios which I think have >50% of my probability mass at the moment, research AGI companies would be able to produce prototype car-driving AIs if they spent time on that, given the near-world-ending tech level; but there will be Many Very Serious Questions about this relatively new unproven advancement in machine learning being turned loose on the roads. And their AGI tech will gain the property "can be turned up to destroy the world" before Earth gains the property "you're allowed to sell self-driving cars on the mass market" because there just won't be much time.

[Ngo][13:52]

Then I expect that another thing we do with this is produce a very large amount of data which rewards AIs for following human instructions.

[Yudkowsky][13:52]

On other scenarios, of course, self-driving becomes possible by limited AI well before things start to break (further) on AGI. And on some scenarios, the way you got to AGI was via some breakthrough that is already scaling pretty fast, so by the time you can use the tech to get self-driving cars, that tech already ends the world if you turn up the dial, or that event follows very swiftly.

[Ngo][13:53]

When you talk about "cranking up the AGI", what do you mean?

Using more compute on the same data?

[Yudkowsky][13:53]

Running it with larger bounds on the for loops, over more GPUs, to be concrete about it.

[Ngo][13:53]

In a RL setting, or a supervised, or unsupervised learning setting?

Also: can you elaborate on the for loops?

[Yudkowsky][13:56]

I do not quite think that gradient descent on Stack More Layers alone - as used by OpenAI for GPT-3, say, and as opposed to Deepmind which builds more complex artifacts like Mu Zero or AlphaFold 2 - is liable to be the first path taken to AGI. I am reluctant to speculate more in print about clever ways to AGI, and I think any clever person out there will, if they are really clever and not just a fancier kind of stupid, not talk either about what they think is missing from Stack More Layers or how you would really get AGI. That said, the way that you cannot just run GPT-3 at a greater search depth, the way you can run Mu Zero at a greater search depth, is part of why I think that AGI is not likely to look exactly like GPT-3; the thing that kills us is likely to be a thing that can get more dangerous when you turn up a dial on it, not a thing that intrinsically has no dials that can make it more dangerous.

1.4. Consequentialist goals vs. deontologist goals

[Ngo][13:59]

Hmm, okay. Let's take a quick step back and think about what would be useful for the last half hour.

I want to flag that my intuitions about pivotal acts are not very specific; I'm quite uncertain about how the geopolitics of that situation would work, as well as the timeframe between somewhere-near-human-level AGI and existential risk AGI.

So we could talk more about this, but I expect there'd be a lot of me saying "well we can't rule out that X happens", which is perhaps not the most productive mode of discourse.

A second option is digging into your intuitions about how cognition works.

[Yudkowsky][14:03]

Well, obviously, in the limit of alignment not being accessible to our civilization, and my successfully building a model weaker than reality which nonetheless correctly rules out alignment being accessible to our civilization, I could spend the rest of my short remaining lifetime arguing with people whose models are weak enough to induce some area of ignorance where for all they know you could align a thing. But that is predictably how conversations go in possible worlds where the Earth is doomed; so somebody wiser on the meta-level, though also ignorant on the object-level, might prefer to ask: "Where do you think your knowledge, rather than your ignorance, says that alignment ought to be doable and you will be surprised if it is not?"

[Ngo][14:07]

That's a fair point. Although it seems like a structural property of the "pivotal act" framing, which builds in doom by default.

[Yudkowsky][14:08]

We could talk about that, if you think it's a crux. Though I'm also not thinking that this whole conversation gets done in a day, so maybe for publishability reasons we should try to focus more on one line of discussion?

But I do think that lots of people get their optimism by supposing that the world can be saved by doing less dangerous things with an AGI. So it's a big ol' crux of mine on priors.

[Ngo][14:09]

Agreed that one line of discussion is better; I'm happy to work within the pivotal act framing for current purposes.

A third option is that I make some claims about how cognition works, and we see how much you agree with them.

[Yudkowsky][14:12]

(Though it's something of a restatement, a reason I'm not going into "my intuitions about how cognition works" is that past experience has led me to believe that conveying this info in a form that the Other Mind will actually absorb and operate, is really quite hard and takes a long discussion, relative to my current abilities to Actually Explain things; it is the sort of thing that might take doing homework exercises to grasp how one structure is appearing in many places, as opposed to just being flatly told that to no avail, and I have not figured out the homework exercises.)

I'm cheerful about hearing your own claims about cognition and disagreeing with them.

[Ngo][14:12]

Great

Okay, so one claim is that something like deontology is a fairly natural way for minds to operate.

[Yudkowsky][14:14]

("If that were true," he thought at once, "bureaucracies and books of regulations would be a lot more efficient than they are in real life.")

[Ngo][14:14]

Hmm, although I think this was probably not a very useful phrasing, let me think about how to rephrase it.

Okay, so in our earlier email discussion, we talked about the concept of "obedience".

To me it seems like it is just as plausible for a mind to have a concept like "obedience" as its rough goal, as a concept like maximising paperclips.

If we imagine training an agent on a large amount of data which pointed in the rough direction of rewarding obedience, for example, then I imagine that by default obedience would be a constraint of comparable strength to, say, the human survival instinct.

(Which is obviously not strong enough to stop humans doing a bunch of things that contradict it - but it's a pretty good starting point.)

[Yudkowsky][14:18]

Heh. You mean of comparable strength to the human instinct to explicitly maximize inclusive genetic fitness?

[Ngo][14:19]

Genetic fitness wasn't a concept that our ancestors were able to understand, so it makes sense that they weren't pointed directly towards it.

(And nor did they understand how to achieve it.)

[Yudkowsky][14:19]

Even in that paradigm, except insofar as you expect gradient descent to work very differently from gene-search optimization - which, admittedly, it does - when you optimize really hard on a thing, you get contextual correlates to it, not the thing you optimized on.

This is of course one of the Big Fundamental Problems that I expect in alignment.

[Ngo][14:20]

Right, so the main correlate that I've seen discussed is "do what would make the human give you a high rating, not what the human actually wants"

One thing I'm curious about is the extent to which you're concerned about this specific correlate, versus correlates in general.

[Yudkowsky][14:21]

That said, I also see basic structural reasons why paperclips would be much easier to train than "obedience", even if we could magically instill simple inner desires that perfectly reflected the simple outer algorithm we saw ourselves as running over many particular instances of a loss function.

[Ngo][14:22]

I'd be interested in hearing what those are.

[Yudkowsky][14:22]

well, first of all, why is a book of regulations so much more unwieldy than a hunter-gatherer?

if deontology is just as good as consequentialism, y'know.

(do you want to try replying or should I just say?)

[Ngo][14:23]

Go ahead

I should probably clarify that I agree that you can't just replace consequentialism with deontology

The claim is more like: when it comes to high-level concepts, it's not clear to me why high-level consequentialist goals are more natural than high-level deontological goals.

[Yudkowsky][14:24]

I reply that reality is complicated, so when you pump a simple goal through complicated reality you get complicated behaviors required to achieve the goal. If you think of reality as a complicated function Input->Probability(Output), then even to get a simple Output or a simple partition on Output or a high expected score in a simple function over Output, you may need very complicated Input.

Humans don't trust each other. They imagine, "Well, if I just give this bureaucrat a goal, perhaps they won't reason honestly about what it takes to achieve that goal! Oh no! Therefore I will instead, being the trustworthy and accurate person that I am, reason myself about constraints and requirements on the bureaucrat's actions, such that, if the bureaucrat obeys these regulations, I expect the outcome of their action will be what I want."

But (compared to a general intelligence that observes and models complicated reality and does its own search to pick actions) an actually-effective book of regulations (implemented by some nonhuman mind with a large enough and perfect enough memory to memorize it) would tend to involve a (physically unmanageable) vast number of rules saying "if you observe this, do that" to follow all the crinkles of complicated reality as it can be inferred from observation.

[Ngo][14:28]

(Though it's something of a restatement, a reason I'm not going into "my intuitions about how cognition works" is that past experience has led me to believe that conveying this info in a form that the Other Mind will actually absorb and operate, is really quite hard and takes a long discussion, relative to my current abilities to Actually Explain things; it is the sort of thing that might take doing homework exercises to grasp how one structure is appearing in many places, as opposed to just being flatly told that to no avail, and I have not figured out the homework exercises.)

(As a side note: do you have a rough guess for when your work with Ajeya will be made public? If it's still a while away, I'm wondering whether it's still useful to have a rough outline of these intuitions even if it's in a form that very few people will internalise)

[Yudkowsky][14:30]

(As a side note: do you have a rough guess for when your work with Ajeya will be made public? If it's still a while away, I'm wondering whether it's still useful to have a rough outline of these intuitions even if it's in a form that very few people will internalise)

Plausibly useful, but not to be attempted today, I think?

[Ngo][14:30]

Agreed.

[Yudkowsky][14:30]

(We are now theoretically in overtime, which is okay for me, but for you it is 11:30pm (I think?) and so it is on you to call when to halt, now or later.)

[Ngo][14:32]

Yeah, it's 11.30 for me. I think probably best to halt here. I agree with all the things you just said about reality being complicated, and why consequentialism is therefore valuable. My "deontology" claim (which was, in its original formulation, far too general - apologies for that) was originally intended as a way of poking into your intuitions about which types of cognition are natural or unnatural, which I think is the topic we've been circling around for a while.

[Yudkowsky][14:33]

Yup, and a place to resume next time might be why I think "obedience" is unnatural compared to "paperclips" - though that is a thing that probably requires taking that stab at what underlies surface competencies.

[Ngo][14:34]

Right. I do think that even a vague gesture at that would be reasonably helpful (assuming that this doesn't already exist online?)

[Yudkowsky][14:34]

Not yet afaik, and I don't want to point you to Ajeya's stuff even if she were ok with that, because then this in-context conversation won't make sense to others.

[Ngo][14:35]

For my part I should think more about pivotal acts that I'd be willing to specifically defend.

In any case, thanks for the discussion 🙂

Let me know if there's a particular time that suits you for a follow-up; otherwise we can sort it out later.

[Soares][14:37]

(y'all are doing all my jobs for me)

[Yudkowsky][14:37]

could try Tuesday at this same time - though I may be in worse shape for dietary reasons, still, seems worth trying.

[Soares][14:37]

(wfm)

[Ngo][14:39]

Tuesday not ideal, any others work?

[Yudkowsky][14:39]

Wednesday?

[Ngo][14:40]

Yes, Wednesday would be good

[Yudkowsky][14:40]

let's call it tentatively for that

[Soares][14:41]

Great! Thanks for the chats.

[Ngo][14:41]

Thanks both!

[Yudkowsky][14:41]

Thanks, Richard!

2. Follow-ups

2.1. Richard Ngo's summary

[Tallinn][0:35] (Sep. 6)

just caught up here & wanted to thank nate, eliezer and (especially) richard for doing this! it's great to see eliezer's model being probed so intensively. i've learned a few new things (such as the genetic bottleneck being plausibly a big factor in human cognition). FWIW, a minor comment re deontology (as that's fresh on my mind): in my view deontology is more about coordination than optimisation: deontological agents are more trustworthy, as they're much easier to reason about (in the same way how functional/declarative code is easier to reason about than imperative code). hence my steelman of bureaucracies (as well as social norms): humans just (correctly) prefer their fellow optimisers (including non-human optimisers) to be deontological for trust/coordination reasons, and are happy to pay the resulting competence tax.

[Ngo][3:10] (Sep. 8)

Thanks Jaan! I agree that greater trust is a good reason to want agents which are deontological at some high level.

I've attempted a summary of the key points so far; comments welcome: [GDocs link]

[Ngo] (Sep. 8 Google Doc)

1st discussion

(Mostly summaries not quotations)

Eliezer, summarized by Richard: "To avoid catastrophe, whoever builds AGI first will have to a) align it to some extent, and b) decide not to scale it up beyond the point where their alignment techniques fail, and c) do some pivotal act that prevents others from scaling it up to that level. But ~~our alignment techniques will not be good enough~~ ~~our alignment techniques will be very far from adequate~~ on our current trajectory, our alignment techniques will be very far from adequate to create an AI that safely performs any such pivotal act."

[Yudkowsky][11:05] (Sep. 8 comment)

will not be good enough

Are not presently on course to be good enough, missing by not a little. "Will not be good enough" is literally declaring for lying down and dying.

[Yudkowsky][16:03] (Sep. 9 comment)

will [be very far from adequate]

Same problem as the last time I commented. I am not making an unconditional prediction about future failure as would be implied by the word "will". Conditional on current courses of action or their near neighboring courses, we seem to be well over an order of magnitude away from surviving, unless a miracle occurs. It's still in the end a result of people doing what they seem to be doing, not an inevitability.

[Ngo][5:10] (Sep. 10 comment)

Ah, I see. Does adding "on our current trajectory" fix this?

[Yudkowsky][10:46] (Sep. 10 comment)

Yes.

[Ngo] (Sep. 8 Google Doc)

Richard, summarized by Richard: "Consider the pivotal act of 'make a breakthrough in alignment research'. It is likely that, before the point where AGIs are strongly superhuman at seeking power, they will already be strongly superhuman at understanding the world, and at performing narrower pivotal acts like alignment research which don’t require as much agency (by which I roughly mean: large-scale motivations and the ability to pursue them over long timeframes)."

Eliezer, summarized by Richard: "There’s a deep connection between solving intellectual problems and taking over the world - the former requires a powerful mind to think about domains that, when solved, render very cognitively accessible strategies that can do dangerous things. Even mathematical research is a goal-oriented task which involves identifying then pursuing instrumental subgoals - and if brains which evolved to hunt on the savannah can quickly learn to do mathematics, then it’s also plausible that AIs trained to do mathematics could quickly learn a range of other skills. Since almost nobody understands the deep similarities in the cognition required for these different tasks, the distance between AIs that are able to perform fundamental scientific research, and dangerously agentic AGIs, is smaller than almost anybody expects."

[Yudkowsky][11:05] (Sep. 8 comment)

There’s a deep connection between solving intellectual problems and taking over the world

There's a deep connection by default between chipping flint handaxes and taking over the world, if you happen to learn how to chip handaxes in a very general way. "Intellectual" problems aren't special in this way. And maybe you could avert the default, but that would take some work and you'd have to do it before easier default ML techniques destroyed the world.

[Ngo] (Sep. 8 Google Doc)

Richard, summarized by Richard: "Our lack of understanding about how intelligence works also makes it easy to assume that traits which co-occur humans will also co-occur in future AIs. But human brains are badly-optimised for tasks like scientific research, and well-optimised for seeking power over the world, for reasons including a) evolving while embodied in a harsh environment; b) the genetic bottleneck; c) social environments which rewarded power-seeking. By contrast, training neural networks on tasks like mathematical or scientific research optimises them much less for seeking power. For example, GPT-3 has knowledge and reasoning capabilities but little agency, and loses coherence when run for longer timeframes."

[Tallinn][4:19] (Sep. 8 comment)

[well-optimised for] seeking power

male-female differences might be a datapoint here (annoying as it is to lean on pinker's point :))

[Yudkowsky][11:31] (Sep. 8 comment)

I don't think a female Eliezer Yudkowsky doesn't try to save / optimize / takeover the world. Men may do that for nonsmart reasons; smart men and women follow the same reasoning when they are smart enough. Eg Anna Salamon and many others.

[Ngo] (Sep. 8 Google Doc)

Eliezer, summarized by Richard: "Firstly, there’s a big difference between most scientific research and the sort of pivotal act that we’re talking about - you need to explain how AIs with a given skill can be used to actually prevent dangerous AGIs from being built. Secondly, insofar as GPT-3 has little agency, that’s because it has memorised many shallow patterns in a way which won’t directly scale up to general intelligence. Intelligence instead consists of deep problem-solving patterns which link understanding and agency at a fundamental level."

3. September 8 conversation

3.1. The Brazilian university anecdote

[Yudkowsky][11:00]

(I am here.)

[Ngo][11:01]

Me too.

[Soares][11:01]

Welcome back!

(I'll mostly stay out of the way again.)

[Ngo][11:02]

Cool. Eliezer, did you read the summary - and if so, do you roughly endorse it?

Also, I've been thinking about the best way to approach discussing your intuitions about cognition. My guess is that starting with the obedience vs paperclips thread is likely to be less useful than starting somewhere else - e.g. the description you gave near the beginning of the last discussion, about "searching for states that get fed into a result function and then a result-scoring function".

[Yudkowsky][11:06]

made a couple of comments about phrasings in the doc

So, from my perspective, there's this thing where... it's really quite hard to teach certain general points by talking at people, as opposed to more specific points. Like, they're trying to build a perpetual motion machine, and even if you can manage to argue them into believing their first design is wrong, they go looking for a new design, and the new design is complicated enough that they can no longer be convinced that they're wrong because they managed to make a more complicated error whose refutation they couldn't keep track of anymore.

Teaching people to see an underlying structure in a lot of places is a very hard thing to teach in this way. Richard Feynman gave an example of the mental motion in his story that ends "Look at the water!", where people learned in classrooms about how "a medium with an index" is supposed to polarize light reflected from it, but they didn't realize that sunlight coming off of water would be polarized. My guess is that doing this properly requires homework exercises; and that, unfortunately from my own standpoint, it happens to be a place where I have extra math talent, the same way that eg Marcello is more talented at formally proving theorems than I happen to be; and that people without the extra math talent, have to do a lot more exercises than I did, and I don't have a good sense of which exercises to give them.

[Ngo][11:13]

I'm sympathetic to this, and can try to turn off skeptical-discussion-mode and turn on learning-mode, if you think that'll help.

[Yudkowsky][11:14]

There's a general insight you can have about how arithmetic is commutative, and for some people you can show them 1 + 2 = 2 + 1 and their native insight suffices to generalize over the 1 and the 2 to any other numbers you could put in there, and they realize that strings of numbers can be rearranged and all end up equivalent. For somebody else, when they're a kid, you might have to show them 2 apples and 1 apple being put on the table in a different order but ending up with the same number of apples, and then you might have to show them again with adding up bills in different denominations, in case they didn't generalize from apples to money. I can actually remember being a child young enough that I tried to add 3 to 5 by counting "5, 6, 7" and I thought there was some clever enough way to do that to actually get 7, if you tried hard.

Being able to see "consequentialism" is like that, from my perspective.

[Ngo][11:15]

Another possibility: can you trace the origins of this belief, and how it came out of your previous beliefs?

[Yudkowsky][11:15]

I don't know what homework exercises to give people to make them able to see "consequentialism" all over the place, instead of inventing slightly new forms of consequentialist cognition and going "Well, now that isn't consequentialism, right?"

Trying to say "searching for states that get fed into an input-result function and then a result-scoring function" was one attempt of mine to describe the dangerous thing in a way that would maybe sound abstract enough that people would try to generalize it more.

[Ngo][11:17]

Another possibility: can you describe the closest thing to real consequentialism in humans, and how it came about in us?

[Yudkowsky][11:18][11:21]

Ok, so, part of the problem is that... before you do enough homework exercises for whatever your level of talent is (and even I, at one point, had done little enough homework that I thought there might be a clever way to add 3 and 5 in order to get to 7), you tend to think that only the very crisp formal thing that's been presented to you, is the "real" thing.

Why would your engine have to obey the laws of thermodynamics? You're not building one of those Carnot engines you saw in the physics textbook!

Humans contain fragments of consequentialism, or bits and pieces whose interactions add up to partially imperfectly shadow consequentialism, and the critical thing is being able to see that the reason why humans' outputs 'work', in a sense, is because these structures are what is doing the work, and the work gets done because of how they shadow consequentialism and only insofar as they shadow consequentialism.

Put a human in one environment, it gets food. Put a human in a different environment, it gets food again. Wow, different initial conditions, same output! There must be things inside the human that, whatever else they do, are also along the way somehow effectively searching for motor signals such that food is the end result!

[Ngo][11:20]

To me it feels like you're trying to nudge me (and by extension whoever reads this transcript) out of a specific failure mode. If I had to guess, something like: "I understand what Eliezer is talking about so now I'm justified in disagreeing with it", or perhaps "Eliezer's explanation didn't make sense to me and so I'm justified in thinking that his concepts don't make sense". Is that right?

[Yudkowsky][11:22]

More like... from my perspective, even after I talk people out of one specific perpetual motion machine being possible, they go off and try to invent a different, more complicated perpetual motion machine.

And I am not sure what to do about that. It has been going on for a very long time from my perspective.

In the end, a lot of what people got out of all that writing I did, was not the deep object-level principles I was trying to point to - they did not really get Bayesianism as thermodynamics, say, they did not become able to see Bayesian structures any time somebody sees a thing and changes their belief. What they got instead was something much more meta and general, a vague spirit of how to reason and argue, because that was what they'd spent a lot of time being exposed to over and over and over again in lots of blog posts.

Maybe there's no way to make somebody understand why corrigibility is "unnatural" except to repeatedly walk them through the task of trying to invent an agent structure that lets you press the shutdown button (without it trying to force you to press the shutdown button), and showing them how each of their attempts fails; and then also walking them through why Stuart Russell's attempt at moral uncertainty produces the problem of fully updated (non-)deference; and hope they can start to see the informal general pattern of why corrigibility is in general contrary to the structure of things that are good at optimization.

Except that to do the exercises at all, you need them to work within an expected utility framework. And then they just go, "Oh, well, I'll just build an agent that's good at optimizing things but doesn't use these explicit expected utilities that are the source of the problem!"

And then if I want them to believe the same things I do, for the same reasons I do, I would have to teach them why certain structures of cognition are the parts of the agent that are good at stuff and do the work, rather than them being this particular formal thing that they learned for manipulating meaningless numbers as opposed to real-world apples.

And I have tried to write that page once or twice (eg "coherent decisions imply consistent utilities") but it has not sufficed to teach them, because they did not even do as many homework problems as I did, let alone the greater number they'd have to do because this is in fact a place where I have a particular talent.

I don't know how to solve this problem, which is why I'm falling back on talking about it at the meta-level.

[Ngo][11:30]

I'm reminded of a LW post called "Write a thousand roads to Rome", which iirc argues in favour of trying to explain the same thing from as many angles as possible in the hope that one of them will stick.

[Soares][11:31]

(Suggestion, not-necessarily-good: having named this problem on the meta-level, attempt to have the object-level debate, while flagging instances of this as it comes up.)

[Ngo][11:31]

I endorse Nate's suggestion.

And will try to keep the difficulty of the meta-level problem in mind and respond accordingly.

[Yudkowsky][11:33]

That (Nate's suggestion) is probably the correct thing to do. I name it out loud because sometimes being told about the meta-problem actually does help on the object problem. It seems to help me a lot and others somewhat less, but it does help others at all, for many others.

3.2. Brain functions and outcome pumps

[Yudkowsky][11:34]

So, do you have a particular question you would ask about input-seeking cognitions? I did try to say why I mentioned those at all (it's a different road to Rome on "consequentialism").

[Ngo][11:36]

Let's see. So the visual cortex is an example of quite impressive cognition in humans and many other animals. But I'd call this "pattern-recognition" rather than "searching for high-scoring results".

[Yudkowsky][11:37]

Yup! And it is no coincidence that there are no whole animals formed entirely out of nothing but a visual cortex!

[Ngo][11:37]

Okay, cool. So you'd agree that the visual cortex is doing something that's qualitatively quite different from the thing that animals overall are doing.

Then another question is: can you characterise searching for high-scoring results in non-human animals? Do they do it? Or are you mainly talking about humans and AGIs?

[Yudkowsky][11:39]

Also by the time you get to like the temporal lobes or something, there is probably some significant amount of "what could I be seeing that would produce this visual field?" that is searching through hypothesis-space for hypotheses with high plausibility scores, and for sure at the human level, humans will start to think, "Well, could I be seeing this? No, that theory has the following problem. How could I repair that theory?" But it is plausible that there is no low-level analogue of this in a monkey's temporal cortex; and even more plausible that the parts of the visual cortex, if any, which do anything analogous to this, are doing it in a relatively local and definitely very domain-specific way.

Oh, that's the cerebellum and motor cortex and so on, if we're talking about a cat or whatever. They have to find motor plans that result in their catching the mouse.

Just because the visual cortex isn't (obviously) running a search doesn't mean the rest of the animal isn't running any searches.

(On the meta-level, I notice myself hiccuping "But how could you not see that when looking at a cat?" and wondering what exercises would be required to teach that.)

[Ngo][11:41]

Well, I see something when I look at a cat, but I don't know how well it corresponds to the concepts you're using. So just taking it slowly for now.

I have the intuition, by the way, that the motor cortex is in some sense doing a similar thing to the visual cortex - just in reverse. So instead of taking low-level inputs and producing high-level outputs, it's taking high-level inputs and producing low-level outputs. Would you agree with that?

[Yudkowsky][11:43]

It doesn't directly parse in my ontology because (a) I don't know what you mean by 'high-level' and (b) whole Cartesian agents can be viewed as functions, that doesn't mean all agents can be viewed as non-searching pattern-recognizers.

That said, all parts of the cerebral cortex have surprisingly similar morphology, so it wouldn't be at all surprising if the motor cortex is doing something similar to visual cortex. (The cerebellum, on the other hand...)

[Ngo][11:44]

The signal from the visual cortex saying "that is a cat", and the signal to the motor cortex saying "grab that cup", are things I'd characterise as high-level.

[Yudkowsky][11:45]

Still less of a native distinction in my ontology, but there's an informal thing it can sort of wave at, and I can hopefully take that as understood and run with it.

[Ngo][11:45]

The firing of cells in the retina, and firing of motor neurons, are the low-level parts.

Cool. So to a first approximation, we can think about the part in between the cat recognising a mouse, and the cat's motor cortex producing the specific neural signals required to catch the mouse, as the part where the consequentialism happens?

[Yudkowsky][11:49]

The part between the cat's eyes seeing the mouse, and the part where the cat's limbs move to catch the mouse, is the whole cat-agent. The whole cat agent sure is a baby consequentialist / searches for mouse-catching motor patterns / gets similarly high-scoring end results even as you vary the environment.

The visual cortex is a particular part of this system-viewed-as-a-feedforward-function that is, plausibly, by no means surely, either not very searchy, or does only small local visual-domain-specific searches not aimed per se at catching mice; it has the epistemic nature rather than the planning nature.

Then from one perspective you could reason that "well, most of the consequentialism is in the remaining cat after visual cortex has sent signals onward". And this is in general a dangerous mode of reasoning that is liable to fail in, say, inspecting every particular neuron for consequentialism and not finding it; but in this particular case, there are significantly more consequentialist parts of the cat than the visual cortex, so I am okay running with it.

[Ngo][11:50]

Ah, the more specific thing I meant to say is: most of the consequentialism is strictly between the visual cortex and the motor cortex. Agree/disagree?

[Yudkowsky][11:51]

Disagree, I'm rusty on my neuroanatomy but I think the motor cortex may send signals on to the cerebellum rather than the other way around.

(I may also disagree with the actual underlying notion you're trying to hint at, so possibly not just a "well include the cerebellum then" issue, but I think I should let you respond first.)

[Ngo][11:53]

I don't know enough neuroanatomy to chase that up, so I was going to try a different tack.

But actually, maybe it's easier for me to say "let's include the cerebellum" and see where you think the disagreement ends up.

[Yudkowsky][11:56]

So since cats are not (obviously) (that I have read about) cross-domain consequentialists with imaginations, their consequentialism is in bits and pieces of consequentialism embedded in them all over by the more purely pseudo-consequentialist genetic optimization loop that built them.

A cat who fails to catch a mouse may then get little bits and pieces of catbrain adjusted all over.

And then those adjusted bits and pieces get a pattern lookup later.

Why do these pattern-lookups with no obvious immediate search element, all happen to point towards the same direction of catching the mouse? Because of the past causal history about how what gets looked up, which was tweaked to catch the mouse.

So it is legit harder to point out "the consequentialist parts of the cat" by looking for which sections of neurology are doing searches right there. That said, to the extent that the visual cortex does not get tweaked on failure to catch a mouse, it's not part of that consequentialist loop either.

And yes, the same applies to humans, but humans also do more explicitly searchy things and this is part of the story for why humans have spaceships and cats do not.

[Ngo][12:00]

Okay, this is interesting. So in biological agents we've got these three levels of consequentialism: evolution, reinforcement learning, and planning.

[Yudkowsky][12:01]

In biological agents we've got evolution + local evolved system-rules that in the past promoted genetic fitness. Two kinds of local rules like this are "operant-conditioning updates from success or failure" and "search through visualized plans". I wouldn't characterize these two kinds of rules as "levels".

[Ngo][12:02]

Okay, I see. And when you talk about searching through visualised plans (the type of thing that humans do) can you say more about what it means for that to be a "search"?

For example, if I imagine writing a poem line-by-line, I may only be planning a few words ahead. But somehow the whole poem, which might be quite long, ends up a highly-optimised product. Is that a central example of planning?

[Yudkowsky][12:04][12:07]

Planning is one way to succeed at search. I think for purposes of understanding alignment difficulty, you want to be thinking on the level of abstraction where you see that in some sense it is the search itself that is dangerous when it's a strong enough search, rather than the danger seeming to come from details of the planning process.

One of my early experiences in successfully generalizing my notion of intelligence, what I'd later verbalize as "computationally efficient finding of actions that produce outcomes high in a preference ordering", was in writing an (unpublished) story about time-travel in which the universe was globally consistent.

The requirement of global consistency, the way in which all events between Paradox start and Paradox finish had to map the Paradox's initial conditions onto the endpoint that would go back and produce those exact initial conditions, ended up imposing strong complicated constraints on reality that the Paradox in effect had to navigate using its initial conditions. The time-traveler needed to end up going through certain particular experiences that would produce the state of mind in which he'd take the actions that would end up prodding his future self elsewhere into having those experiences.

The Paradox ended up killing the people who built the time machine, for example, because they would not otherwise have allowed that person to go back in time, or kept the temporal loop open that long for any other reason if they were still alive.

Just having two examples of strongly consequentialist general optimization in front of me - human intelligence, and evolutionary biology - hadn't been enough for me to properly generalize over a notion of optimization. Having three examples of homework problems I'd worked - human intelligence, evolutionary biology, and the fictional Paradox - caused it to finally click for me.

[Ngo][12:07]

Hmm. So to me, one of the central features of search is that you consider many possibilities. But in this poem example, I may only have explicitly considered a couple of possibilities, because I was only looking ahead a few words at a time. This seems related to the distinction Abram drew a while back between selection and control (https://www.alignmentforum.org/posts/ZDZmopKquzHYPRNxq/selection-vs-control). Do you distinguish between them in the same way as he does? Or does "control" of a system (e.g. a football player dribbling a ball down the field) count as search too in your ontology?

[Yudkowsky][12:10][12:11]

I would later try to tell people to "imagine a paperclip maximizer as not being a mind at all, imagine it as a kind of malfunctioning time machine that spits out outputs which will in fact result in larger numbers of paperclips coming to exist later". I don't think it clicked because people hadn't done the same homework problems I had, and didn't have the same "Aha!" of realizing how part of the notion and danger of intelligence could be seen in such purely material terms.

But the convergent instrumental strategies, the anticorrigibility, these things are contained in the true fact about the universe that certain outputs of the time machine will in fact result in there being lots more paperclips later. What produces the danger is not the details of the search process, it's the search being strong and effective at all. The danger is in the territory itself and not just in some weird map of it; that building nanomachines that kill the programmers will produce more paperclips is a fact about reality, not a fact about paperclip maximizers!

[Ngo][12:11]

Right, I remember a very similar idea in your writing about Outcome Pumps (https://www.lesswrong.com/posts/4ARaTpNX62uaL86j6/the-hidden-complexity-of-wishes).

[Yudkowsky][12:12]

Yup! Alas, the story was written in 2002-2003 when I was a worse writer and the real story that inspired the Outcome Pump never did get published.

[Ngo][12:14]

Okay, so I guess the natural next question is: what is it that makes you think that a strong, effective search isn't likely to be limited or constrained in some way?

What is it about search processes (like human brains) that makes it hard to train them with blind spots, or deontological overrides, or things like that?

Hmmm, although it feels like this is a question I can probably predict your answer to. (Or maybe not, I wasn't expecting the time travel.)

[Yudkowsky][12:15]

In one sense, they are! A paperclip-maximizing superintelligence is nowhere near as powerful as a paperclip-maximizing time machine. The time machine can do the equivalent of buying winning lottery tickets from lottery machines that have been thermodynamically randomized; a superintelligence can't, at least not directly without rigging the lottery or whatever.

But a paperclip-maximizing strong general superintelligence is epistemically and instrumentally efficient, relative to you, or to me. Any time we see it can get at least X paperclips by doing Y, we should expect that it gets X or more paperclips by doing Y or something that leads to even more paperclips than that, because it's not going to miss the strategy we see.

So in that sense, searching our own brains for how a time machine would get paperclips, asking ourselves how many paperclips are in principle possible and how they could be obtained, is a way of getting our own brains to consider lower bounds on the problem without the implicit stupidity assertions that our brains unwittingly use to constrain story characters. Part of the point of telling people to think about time machines instead of superintelligences was to get past the ways they imagine superintelligences being stupid. Of course that didn't work either, but it was worth a try.

I don't think that's quite what you were asking about, but I want to give you a chance to see if you want to rephrase anything before I try to answer your me-reformulated questions.

[Ngo][12:20]

Yeah, I think what I wanted to ask is more like: why should we expect that, out of the space of possible minds produced by optimisation algorithms like gradient descent, strong general superintelligences are more common than other types of agents which score highly on our loss functions?

[Yudkowsky][12:20][12:23][12:24]

It depends on how hard you optimize! And whether gradient descent on a particular system can even successfully optimize that hard! Many current AIs are trained by gradient descent and yet not superintelligences at all.

But the answer is that some problems are difficult in that they require solving lots of subproblems, and an easy way to solve all those subproblems is to use patterns which collectively have some coherence and overlap, and the coherence within them generalizes across all the subproblems. Lots of search orderings will stumble across something like that before they stumble across separate solutions for lots of different problems.

I suspect that you cannot get this out of small large amounts of gradient descent on small large layered transformers, and therefore I suspect that GPT-N does not approach superintelligence before the world is ended by systems that look differently, but I could be wrong about that.

[Ngo][12:22][12:23]

Suppose that we optimise hard enough to produce an epistemic subsystem that can make plans much better than any human's.

My guess is that you'd say that this is possible, but that we're much more likely to first produce a consequentialist agent which does this (rather than a purely epistemic agent which does this).

[Yudkowsky][12:24]

I am confused by what you think it means to have an "epistemic subsystem" that "makes plans much better than any human's". If it searches paths through time and selects high-scoring ones for output, what makes it "epistemic"?

[Ngo][12:25]

Suppose, for instance, that it doesn't actually carry out the plans, it just writes them down for humans to look at.

[Yudkowsky][12:25]

If it can in fact do the thing that a paperclipping time machine does, what makes it any safer than a paperclipping time machine because we called it "epistemic" or by some other such name?

By what criterion is it selecting the plans that humans look at?

Why did it make a difference that its output was fed through the causal systems called humans on the way to the causal systems called protein synthesizers or the Internet or whatever? If we build a superintelligence to design nanomachines, it makes no obvious difference to its safety whether it sends DNA strings directly to a protein synthesis lab, or humans read the output and retype it manually into an email. Presumably you also don't think that's where the safety difference comes from. So where does the safety difference come from?

(note: lunchtime for me in 2 minutes, propose to reconvene in 30m after that)

[Ngo][12:28]

(break for half an hour sounds good)

If we consider the visual cortex at a given point in time, how does it decide which objects to recognise?

Insofar as the visual cortex can be non-consequentialist about which objects it recognises, why couldn't a planning system be non-consequentialist about which plans it outputs?

[Yudkowsky][12:32]

This does feel to me like another "look at the water" moment, so what do you predict I'll say about that?

[Ngo][12:34]

I predict that you say something like: in order to produce an agent that can create very good plans, we need to apply a lot of optimisation power to that agent. And if the channel through which we're applying that optimisation power is "giving feedback on its plans", then we don't have a mechanism to ensure that the agent actually learns to optimise for creating really good plans, as opposed to creating plans that receive really good feedback.

[Soares][12:35]

Seems like a fine cliffhanger?

[Ngo][12:35]

Yepp.

[Soares][12:35]

Great. Let's plan to reconvene in 30min.

3.3. Hypothetical-planning systems, nanosystems, and evolving generality

[Yudkowsky][13:03][13:11]

So the answer you expected from me, translated into my terms, would be, "If you select for the consequence of the humans hitting 'approve' on the plan, you're still navigating the space of inputs for paths through time to probable outcomes (namely the humans hitting 'approve'), so you're still doing consequentialism."

But suppose you manage to avoid that. Suppose you get exactly what you ask for. Then the system is still outputting plans such that, when humans follow them, they take paths through time and end up with outcomes that score high in some scoring function.

My answer is, "What the heck would it mean for a planning system to be non-consequentialist? You're asking for nonwet water! What's consequentialist isn't the system that does the work, it's the work you're trying to do! You could imagine it being done by a cognition-free material system like a time machine and it would still be consequentialist because the output is a plan, a path through time!"

And this indeed is a case where I feel a helpless sense of not knowing how I can rephrase things, which exercises you have to get somebody to do, what fictional experience you have to walk somebody through, before they start to look at the water and see a material with an index, before they start to look at the phrase "why couldn't a planning system be non-consequentialist about which plans it outputs" and go "um".

My imaginary listener now replies, "Ah, but what if we have plans that don't end up with outcomes that score high in some function?" and I reply "Then you lie on the ground randomly twitching because any outcome you end up with which is not that is one that you wanted more than that meaning you preferred it more than the outcome of random motor outputs which is optimization toward higher in the preference function which is taking a path through time that leads to particular destinations more than it leads to random noise."

[Ngo][13:09][13:11]

Yeah, this does seem like a good example of the thing you were trying to explain at the beginning

It still feels like there's some sort of levels distinction going on here though, let me try to tease out that intuition.

Okay, so suppose I have a planning system that, given a situation and a goal, outputs a plan that leads from that situation to that goal.

And then suppose that we give it, as input, a situation that we're not actually in, and it outputs a corresponding plan.

It seems to me that there's a difference between the sense in which that planning system is consequentialist by virtue of making consequentialist plans (as in: if that plan were used in the situation described in its inputs, it would lead to some goal being achieved) versus another hypothetical agent that is just directly trying to achieve goals in the situation it's actually in.

[Yudkowsky][13:18]

So I'd preface by saying that, if you could build such a system, which is indeed a coherent thing (it seems to me) to describe for the purpose of building it, then there would possibly be a safety difference on the margins, it would be noticeably less dangerous though still dangerous. It would need a special internal structural property that you might not get by gradient descent on a loss function with that structure, just like natural selection on inclusive genetic fitness doesn't get you explicit fitness optimizers; you could optimize for planning in hypothetical situations, and get something that didn't explicitly care only and strictly about hypothetical situations. And even if you did get that, the outputs that would kill or brain-corrupt the operators in hypothetical situations might also be fatal to the operators in actual situations. But that is a coherent thing to describe, and the fact that it was not optimizing our own universe, might make it safer.

With that said, I would worry that somebody would think there was some bone-deep difference of agentiness, of something they were empathizing with like personhood, of imagining goals and drives being absent or present in one case or the other, when they imagine a planner that just solves "hypothetical" problems. If you take that planner and feed it the actual world as its hypothetical, tada, it is now that big old dangerous consequentialist you were imagining before, without it having acquired some difference of psychological agency or 'caring' or whatever.

So I think there is an important homework exercise to do here, which is something like, "Imagine that safe-seeming system which only considers hypothetical problems. Now see that if you take that system, don't make any other internal changes, and feed it actual problems, it's very dangerous. Now meditate on this until you can see how the hypothetical-considering planner was extremely close in the design space to the more dangerous version, had all the dangerous latent properties, and would probably have a bunch of actual dangers too."

"See, you thought the source of the danger was this internal property of caring about actual reality, but it wasn't that, it was the structure of planning!"

[Ngo][13:22]

I think we're getting closer to the same page now.

Let's consider this hypothetical planner for a bit. Suppose that it was trained in a way that minimised the, let's say, adversarial component of its plans.

For example, let's say that the plans it outputs for any situation are heavily regularised so only the broad details get through.

Hmm, I'm having a bit of trouble describing this, but basically I have an intuition that in this scenario there's a component of its plan which is cooperative with whoever executes the plan, and a component that's adversarial.

And I agree that there's no fundamental difference in type between these two things.

[Yudkowsky][13:27]

"What if this potion we're brewing has a Good Part and a Bad Part, and we could just keep the Good Parts..."

[Ngo][13:27]

Nor do I think they're separable. But in some cases, you might expect one to be much larger than the other.

[Soares][13:29]

(I observe that my model of some other listeners, at this point, protest "there is yet a difference between the hypothetical-planner applied to actual problems, and the Big Scary Consequentialist, which is that the hypothetical planner is emitting descriptions of plans that would work if executed, whereas the big scary consequentialist is executing those plans directly.")

(Not sure that's a useful point to discuss, or if it helps Richard articulate, but it's at least a place I expect some reader's minds to go if/when this is published.)

[Yudkowsky][13:30]

(That is in fact a difference! The insight is in realizing that the hypothetical planner is only one line of outer shell command away from being a Big Scary Thing and is therefore also liable to be Big and Scary in many ways.)

[Ngo][13:31]

To me it seems that Eliezer's position is something like: "actually, in almost no training regimes do we get agents that decide which plans to output by spending almost all of their time thinking about the object-level problem, and very little of their time thinking about how to manipulate the humans carrying out the plan".

[Yudkowsky][13:32]

My position is that the AI does not neatly separate its internals into a Part You Think Of As Good and a Part You Think Of As Bad, because that distinction is sharp in your map but not sharp in the territory or the AI's map.

From the perspective of a paperclip-maximizing-action-outputting-time-machine, its actions are not "object-level making paperclips" or "manipulating the humans next to the time machine to deceive them about what the machine does", they're just physical outputs that go through time and end up with paperclips.

[Ngo][13:34]

@Nate, yeah, that's a nice way of phrasing one point I was trying to make. And I do agree with Eliezer that these things can be very similar. But I'm claiming that in some cases these things can also be quite different - for instance, when we're training agents that only get to output a short high-level description of the plan.

[Yudkowsky][13:35]

The danger is in how hard the agent has to work to come up with the plan. I can, for instance, build an agent that very safely outputs a high-level plan for saving the world:

echo "Hey Richard, go save the world!"

So I do have to ask what kind of "high-level" planning output, that saves the world, you are envisioning, and why it was hard to cognitively come up with such that we didn't just make that high-level plan right now, if humans could follow it. Then I'll look at the part where the plan was hard to come up with, and say how the agent had to understand lots of complicated things in reality and accurately navigate paths through time for those complicated things, in order to even invent the high-level plan, and hence it was very dangerous if it wasn't navigating exactly where you hoped. Or, alternatively, I'll say, "That plan couldn't save the world: you're not postulating enough superintelligence to be dangerous, and you're also not using enough superintelligence to flip the tables on the currently extremely doomed world."

[Ngo][13:39]

At this point I'm not envisaging a particular planning output that saves the world, I'm just trying to get more clarity on the issue of consequentialism.

[Yudkowsky][13:40]

Look at the water; it's not the way you're doing the work that's dangerous, it's the work you're trying to do. What work are you trying to do, never mind how it gets done?

[Ngo][13:41]

I think I agree with you that, in the limit of advanced capabilities, we can't say much about how the work is being done, we have to primarily reason from the work that we're trying to do.

But here I'm only talking about systems that are intelligent enough to come up with plans and do research that are beyond the capability of humanity.

And for me the question is: for those systems, can we tilt the way they do the work so they spend 99% of their time trying to solve the object-level problem, and 1% of their time trying to manipulate the humans who are going to carry out the plan? (Where these are not fundamental categories for the AI, they're just a rough categorisation that emerges after we've trained it - the same way that the categories of "physically moving around" and "thinking about things" aren't fundamentally different categories of action for humans, but the way we've evolved means there's a significant internal split between them.)

[Soares][13:43]

(I suspect Eliezer is not trying to make a claim of the form "in the limit of advanced capabilities, we are relegated to reasoning about what work gets done, not about how it was done". I suspect some miscommunication. It might be a reasonable time for Richard to attempt to paraphrase Eliezer's argument?)

(Though it also seems to me like Eliezer responding to the 99%/1% point may help shed light.)

[Yudkowsky][13:46]

Well, for one thing, I'd note that a system which is designing nanosystems, and spending 1% of its time thinking about how to kill the operators, is lethal. It has to be such a small fraction of thinking that it, like, never completes the whole thought about "well, if I did X, that would kill the operators!"

[Ngo][13:46]

Thanks for that, Nate. I'll try to paraphrase Eliezer's argument now.

Eliezer's position (partly in my own terminology): we're going to build AIs that can perform very difficult tasks using cognition which we can roughly describe as "searching over many options to find one that meets our criteria". An AI that can solve these difficult tasks will need to be able to search in a very general and flexible way, and so it will be very difficult to constrain that search into a particular region.

Hmm, that felt like a very generic summary, let me try and think about the more specific claims he's making.

[Yudkowsky][13:54]

An AI that can solve these difficult tasks will need to be able to

Very very little is universally necessary over the design space. The first AGI that our tech becomes able to build is liable to work in certain easier and simpler ways.

[Ngo][13:55]

Point taken; thanks for catching this misphrasing (this and previous times).

[Yudkowsky][13:56]

Can you, in principle, build a red-car-driver that is totally incapable of driving blue cars? In principle, sure! But the first red-car-driver that gradient descent stumbles over is liable to be a blue-car-driver too.

[Ngo][13:57]

Eliezer, I'm wondering how much of our disagreement is about how high the human level is here.

Or, to put it another way: we can build systems that outperform humans at quite a few tasks by now, without having search abilities that are general enough to even try to take over the world.

[Yudkowsky][13:58]

Indubitably and indeed, this is so.

[Ngo][13:59]

Putting aside for a moment the question of which tasks are pivotal enough to save the world, which parts of your model draw the line between human-level chess players and human-level galaxy-colonisers?

And say that we'll be able to align ones that they outperform us on these tasks before taking over the world, but not on these other tasks?

[Yudkowsky][13:59][14:01]

That doesn't have a very simple answer, but one aspect there is domain generality which in turn is achieved through novel domain learning.

Humans, you will note, were not aggressively optimized by natural selection to be able to breathe underwater or fly into space. In terms of obvious outer criteria, there is not much outer sign that natural selection produced these creatures much more general than chimpanzees, by training on a much wider range of environments and loss functions.

[Soares][14:00]

(Before we drift too far from it: thanks for the summary! It seemed good to me, and I updated towards the miscommunication I feared not-having-happened.)

[Ngo][14:03]

(Before we drift too far from it: thanks for the summary! It seemed good to me, and I updated towards the miscommunication I feared not-having-happened.)

(Good to know, thanks for keeping an eye out. To be clear, I didn't ever interpret Eliezer as making a claim explicitly about the limit of advanced capabilities; instead it just seemed to me that he was thinking about AIs significantly more advanced than the ones I've been thinking of. I think I phrased my point poorly.)

[Yudkowsky][14:05][14:10]

There are complicated aspects of this story where natural selection may metaphorically be said to have "had no idea of what it was doing", eg, after early rises in intelligence possibly produced by sexual selection on neatly chipped flint handaxes or whatever, all the cumulative brain-optimization on chimpanzees reached a point where there was suddenly a sharp selection gradient on relative intelligence at Machiavellian planning against other humans (even more so than in the chimp domain) as a subtask of inclusive genetic fitness, and so continuing to optimize on "inclusive genetic fitness" in the same old savannah, turned out to happen to be optimizing hard on the subtask and internal capability of "outwit other humans", which optimized hard on "model other humans", which was a capability that could be reused for modeling the chimp-that-is-this-chimp, which turned the system on itself and made it reflective, which contributed greatly to its intelligence being generalized, even though it was just grinding the same loss function on the same savannah; the system being optimized happened to go there in the course of being optimized even harder for the same thing.

So one can imagine asking the question: Is there a superintelligent AGI that can quickly build nanotech, which has a kind of passive safety in some if not all respects, in virtue of it solving problems like "build a nanotech system which does X" the way that a beaver solves building dams, in virtue of having a bunch of specialized learning abilities without it ever having a cross-domain general learning ability?

And in this regard one does note that there are many, many, many things that humans do which no other animal does, which you might think would contribute a lot to that animal's fitness if there were animalistic ways to do it. They don't make iron claws for themselves. They never did evolve a tendency to search for iron ore, and burn wood into charcoal that could be used in hardened-clay furnaces.

No animal plays chess, but AIs do, so we can obviously make AIs to do things that animals don't do. On the other hand, the environment didn't exactly present any particular species with a challenge of chess-playing either.

Even so, though, even if some animal had evolved to play chess, I fully expect that current AI systems would be able to squish it at chess, because the AI systems are on chips that run faster than neurons and doing crisp calculations and there are things you just can't do with noisy slow neurons. So that again is not a generally reliable argument about what AIs can do.

[Ngo][14:09][14:11]

Yes, although I note that challenges which are trivial from a human-engineering perspective can be very challenging from an evolutionary perspective (e.g. spinning wheels).

And so the evolution of animals-with-a-little-bit-of-help-from-humans might end up in very different places from the evolution of animals-just-by-themselves. And analogously, the ability of humans to fill in the gaps to help less general AIs achieve more might be quite significant.

[Yudkowsky][14:11]

So we can again ask: Is there a way to make an AI system that is only good at designing nanosystems, which can achieve some complicated but hopefully-specifiable real-world outcomes, without that AI also being superhuman at understanding and manipulating humans?

And I roughly answer, "Perhaps, but not by default, there's a bunch of subproblems, I don't actually know how to do it right now, it's not the easiest way to get an AGI that can build nanotech (and kill you), you've got to make the red-car-driver specifically not be able to drive blue cars." Can I explain how I know that? I'm really not sure I can, in real life where I explain X0 and then the listener doesn't generalize X0 to X and respecialize it to X1.

It's like asking me how I could possibly know in 2008, before anybody had observed AlphaFold 2, that superintelligences would be able to crack the protein folding problem on the way to nanotech, which some people did question back in 2008.

Though that was admittedly more of a slam-dunk than this was, and I could not have told you that AlphaFold 2 would become possible at a prehuman level of general intelligence in 2021 specifically, or that it would be synced in time to a couple of years after GPT-2's level of generality at text.

[Ngo][14:18]

What are the most relevant axes of difference between solving protein folding and designing nanotech that, say, self-assembles into a computer?

[Yudkowsky][14:20]

Definitely, "turns out it's easier than you thought to use gradient descent's memorization of zillions of shallow patterns that overlap and recombine into larger cognitive structures, to add up to a consequentialist nanoengineer that only does nanosystems and never does sufficiently general learning to apprehend the big picture containing humans, while still understanding the goal for that pivotal act you wanted to do" is among the more plausible advance-specified miracles we could get.

But it is not what my model says actually happens, and I am not a believer that when your model says you are going to die, you get to start believing in particular miracles. You need to hold your mind open for any miracle and a miracle you didn't expect or think of in advance, because at this point our last hope is that in fact the future is often quite surprising - though, alas, negative surprises are a tad more frequent than positive ones, when you are trying desperately to navigate using a bad map.

[Ngo][14:22]

Perhaps one metric we could use here is something like: how much extra reward does the consequentialist nanoengineer get from starting to model humans, versus from becoming better at nanoengineering?

[Yudkowsky][14:23]

But that's not where humans came from. We didn't get to nuclear power by getting a bunch of fitness from nuclear power plants. We got to nuclear power because if you get a bunch of fitness from chipping flint handaxes and Machiavellian scheming, as found by relatively simple and local hill-climbing, that entrains the same genes that build nuclear power plants.

[Ngo][14:24]

Only in the specific case where you also have the constraint that you keep having to learn new goals every generation.

[Yudkowsky][14:24]

Huh???

[Soares][14:24]

(I think Richard's saying, "that's a consequence of the genetic bottleneck")

[Ngo][14:25]

Right.

Hmm, but I feel like we may have covered this ground before.

Suggestion: I have a couple of other directions I'd like to poke at, and then we could wrap up in 20 or 30 minutes?

[Yudkowsky][14:27]

What are the most relevant axes of difference between solving protein folding and designing nanotech that, say, self-assembles into a computer?

Though I want to mark that this question seemed potentially cruxy to me, though perhaps not for others. I.e., if building protein factories that built nanofactories that built nanomachines that met a certain deep and lofty engineering goal, didn't involve cognitive challenges different in kind from protein folding, we could maybe just safely go do that using AlphaFold 3, which would be just as safe as AlphaFold 2.

I don't think we can do that. And I would note to the generic Other that if, to them, these both just sound like thinky things, so why can't you just do that other thinky thing too using the thinky program, this is a case where having any specific model of why we don't already have this nanoengineer right now would tell you there were specific different thinky things involved.

3.4. Coherence and pivotal acts

[Ngo][14:31]

In either order:

I'm curious how the things we've been talking about relate to your opinions about meta-level optimisation from the AI foom debate. (I.e. talking about how wrapping around so that there's no longer any protected level of optimisation leads to dramatic change.)
I'm curious how your claims about the "robustness" of consequentialism (i.e. the difficulty of channeling an agent's thinking in the directions we want it to go) relate to the reliance of humans on culture, and in particular the way in which humans raised without culture are such bad consequentialists.

On the first: if I were to simplify to the extreme, it seems like there are these two core intuitions that you've been trying to share for a long time. One is a certain type of recursive improvement, and another is a certain type of consequentialism.

[Yudkowsky][14:32]

The second question didn't make much sense in my native ontology? Humans raised without culture don't have access to environmental constants whose presence their genes assume, so they end up as broken machines and then they're bad consequentialists.

[Ngo][14:35]

Hmm, good point. Okay, question modification: the ways in which humans reason, act, etc, vary greatly depending on which cultures they're raised in. (I'm mostly thinking about differences over time - e.g. cavemen vs moderns.) My low-fidelity version of your view about consequentialists says that general consequentialists like humans possess a robust search process which isn't so easily modified.

(Sorry if this doesn't make much sense in your ontology, I'm getting a bit tired.)

[Yudkowsky][14:36]

What is it that varies that you think I think should predict would stay more constant?

[Ngo][14:37]

Goals, styles of reasoning, deontological constraints, level of conformity.

[Yudkowsky][14:39]

With regards to your first point, my first reaction was, "I just have one view of intelligence, what you see me arguing about reflects which points people have proved weirdly obstinate about. In 2008, Robin Hanson was being weirdly obstinate about how capabilities scaled and whether there was even any point in analyzing AIs differently from ems, so I talked about what I saw as the most slam-dunk case for there being Plenty Of Room Above Biology and for stuff going whoosh once it got above the human level.

"It later turned out that capabilities started scaling a whole lot without self-improvement, which is an example of the kind of weird surprise the Future throws at you, and maybe a case where I missed something by arguing with Hanson instead of imagining how I could be wrong in either direction and not just the direction that other people wanted to argue with me about.

"Later on, people were unable to understand why alignment is hard, and got stuck on generalizing the concept I refer to as consequentialism. A theory of why I talked about both things for related reasons would just be a theory of why people got stuck on these two points for related reasons, and I think that theory would mainly be overexplaining an accident because if Yann LeCun had been running effective altruism I would have been explaining different things instead, after the people who talked a lot to EAs got stuck on a different point."

Returning to your second point, humans are broken things; if it were possible to build computers while working even worse than humans, we'd be having this conversation at that level of intelligence instead.

[Ngo][14:41]

(Retracted)I entirely agree about humans, but it doesn't matter that much how broken humans are when the regime of AIs that we're talking about is the regime that's directly above humans, and therefore only a bit less broken than humans.

[Yudkowsky][14:41]

Among the things to bear in mind about that, is that we then get tons of weird phenomena that are specific to humans, and you may be very out of luck if you start wishing for the same weird phenomena in AIs. Yes, even if you make some sort of attempt to train it using a loss function.

However, it does seem to me like as we start getting towards the Einstein level instead of the village-idiot level, even though this is usually not much of a difference, we do start to see the atmosphere start to thin already, and the turbulence start to settle down already. Von Neumann was actually a fairly reflective fellow who knew about, and indeed helped generalize, utility functions. The great achievements of von Neumann were not achieved by some very specialized hypernerd who spent all his fluid intelligence on crystallizing math and science and engineering alone, and so never developed any opinions about politics or started thinking about whether or not he had a utility function.

[Ngo][14:44]

I don't think I'm asking for the same weird phenomena. But insofar as a bunch of the phenomena I've been talking about have seemed weird according to your account of consequentialism, then the fact that approximately-human-level-consequentialists have lots of weird things about them is a sign that the phenomena I've been talking about are less unlikely than you expect.

[Yudkowsky][14:45][14:46]

I suspect that some of the difference here is that I think you have to be noticeably better than a human at nanoengineering to pull off pivotal acts large enough to make a difference, which is why I am not instead trying to gather the smartest people left alive and doing that pivotal act directly.

I can't think of anything you can do with somebody just barely smarter than a human, which flips the gameboard, aside of course from "go build a Friendly AI" which I did try to set up to just go do and which would be incredibly hard to align if we wanted an AI to do it instead (full-blown chicken-and-egg, that AI is already fully aligned).

[Ngo][14:45]

Oh, interesting. Actually one more question then: to what extent do you think that explicitly reasoning about utility functions and laws of rationality is what makes consequentialists have the properties you've been talking about?

[Yudkowsky][14:47, moved up in log]

Explicit reflection is one possible later stage of the path; an earlier part of the path is from being optimized to do things difficult enough that you need to stop stepping on your own feet and have different parts of your thoughts work well together.

It's the sort of path that has only one destination at its end, so there will be many ways to get there.

(Modulo various cases where different decision theories seem reflectively consistent and so on; I want to say "you know what I mean" but maybe people don't.)

[Ngo][14:47, moved down in log]

I suspect that some of the difference here is that I think you have to be noticeably better than a human at nanoengineering to pull off pivotal acts large enough to make a difference, which is why I am not instead trying to gather the smartest people left alive and doing that pivotal act directly.

Yepp, I think there's probably some disagreements about geopolitics driving this too. E.g. in my earlier summary document I mentioned some possible pivotal acts:

Monitoring all potential AGI projects to an extent that makes it plausible for the US and China to work on a joint project without worrying that the other is privately racing.
Provide arguments/demonstrations/proofs related to impending existential risk that are sufficiently compelling to scare the key global decision-makers into bottlenecking progress.

I predict that you think these would not be pivotal enough; but I don't think digging into the geopolitical side of things is the best use of our time.

[Yudkowsky][14:49, moved up in log]

Monitoring all AGI projects - either not politically feasible in real life given the actual way that countries behave in history books instead of fantasy; or at politically feasible levels, does not work well enough to prevent the world from ending once the know-how proliferates. The AI isn't doing much work here either; why not go do this now, if it's possible? (Note: please don't try to go do this now, it backfires badly.)

Provide sufficiently compelling arguments = superhuman manipulation, an incredibly dangerous domain that is just about the worst domain to try to align.

[Ngo][14:49, moved down in log]

With regards to your first point, my first reaction was, "I just have one view of intelligence, what you see me arguing about reflects which points people have proved weirdly obstinate about. In 2008, Robin Hanson was being weirdly obstinate about how capabilities scaled and whether there was even any point in analyzing AIs differently from ems, so I talked about what I saw as the most slam-dunk case for there being Plenty Of Room Above Biology and for stuff going whoosh once it got above the human level.
"It later turned out that capabilities started scaling a whole lot without self-improvement, which is an example of the kind of weird surprise the Future throws at you, and maybe a case where I missed something by arguing with Hanson instead of imagining how I could be wrong in either direction and not just the direction that other people wanted to argue with me about.
"Later on, people were unable to understand why alignment is hard, and got stuck on generalizing the concept I refer to as consequentialism. A theory of why I talked about both things for related reasons would just be a theory of why people got stuck on these two points for related reasons, and I think that theory would mainly be overexplaining an accident because if Yann LeCun had been running effective altruism I would have been explaining different things instead, after the people who talked a lot to EAs got stuck on a different point."

On my first point, it seems to me that your claims about recursive self-improvement were off in a fairly similar way to how I think your claims about consequentialism are off - which is that they defer too much to one very high-level abstraction.

[Yudkowsky][14:52]

On my first point, it seems to me that your claims about recursive self-improvement were off in a fairly similar way to how I think your claims about consequentialism are off - which is that they defer too much to one very high-level abstraction.

I suppose that is what it could potentially feel like from the inside to not get an abstraction. Robin Hanson kept on asking why I was trusting my abstractions so much, when he was in the process of trusting his worse abstractions instead.

[Ngo][14:51][14:53]

Explicit reflection is one possible later stage of the path; an earlier part of the path is from being optimized to do things difficult enough that you need to stop stepping on your own feet and have different parts of your thoughts work well together.

Can you explain a little more what you mean by "have different parts of your thoughts work well together"? Is this something like the capacity for metacognition; or the global workspace; or self-control; or...?

And I guess there's no good way to quantify how important you think the explicit reflection part of the path is, compared with other parts of the path - but any rough indication of whether it's a more or less crucial component of your view?

[Yudkowsky][14:55]

Can you explain a little more what you mean by "have different parts of your thoughts work well together"? Is this something like the capacity for metacognition; or the global workspace; or self-control; or...?

No, it's like when you don't, like, pay five apples for something on Monday, sell it for two oranges on Tuesday, and then trade an orange for an apple.

I have still not figured out the homework exercises to convey to somebody the Word of Power which is "coherence" by which they will be able to look at the water, and see "coherence" in places like a cat walking across the room without tripping over itself.

When you do lots of reasoning about arithmetic correctly, without making a misstep, that long chain of thoughts with many different pieces diverging and ultimately converging, ends up making some statement that is... still true and still about numbers! Wow! How do so many different thoughts add up to having this property? Wouldn't they wander off and end up being about tribal politics instead, like on the Internet?

And one way you could look at this, is that even though all these thoughts are taking place in a bounded mind, they are shadows of a higher unbounded structure which is the model identified by the Peano axioms; all the things being said are true about the numbers. Even though somebody who was missing the point would at once object that the human contained no mechanism to evaluate each of their statements against all of the numbers, so obviously no human could ever contain a mechanism like that, so obviously you can't explain their success by saying that each of their statements was true about the same topic of the numbers, because what could possibly implement that mechanism which (in the person's narrow imagination) is The One Way to implement that structure, which humans don't have?

But though mathematical reasoning can sometimes go astray, when it works at all, it works because, in fact, even bounded creatures can sometimes manage to obey local relations that in turn add up to a global coherence where all the pieces of reasoning point in the same direction, like photons in a laser lasing, even though there's no internal mechanism that enforces the global coherence at every point.

To the extent that the outer optimizer trains you out of paying five apples on Monday for something that you trade for two oranges on Tuesday and then trading two oranges for four apples, the outer optimizer is training all the little pieces of yourself to be locally coherent in a way that can be seen as an imperfect bounded shadow of a higher unbounded structure, and then the system is powerful though imperfect because of how the power is present in the coherence and the overlap of the pieces, because of how the higher perfect structure is being imperfectly shadowed. In this case the higher structure I'm talking about is Utility, and doing homework with coherence theorems leads you to appreciate that we only know about one higher structure for this class of problems that has a dozen mathematical spotlights pointing at it saying "look here", even though people have occasionally looked for alternatives.

And when I try to say this, people are like, "Well, I looked up a theorem, and it talked about being able to identify a unique utility function from an infinite number of choices, but if we don't have an infinite number of choices, we can't identify the utility function, so what relevance does this have" and this is a kind of mistake I don't remember even coming close to making so I do not know how to make people stop doing that and maybe I can't.

[Soares][15:07]

We're already pushing our luck on time, so I nominate that we wrap up (after, perhaps, a few more Richard responses if he's got juice left.)

[Yudkowsky][15:07]

Yeah, was thinking the same.

[Soares][15:07]

As a proposed cliffhanger to feed into the next discussion, my take is that Richard's comment:

On my first point, it seems to me that your claims about recursive self-improvement were off in a fairly similar way to how I think your claims about consequentialism are off - which is that they defer too much to one very high-level abstraction.

probably contains some juicy part of the disagreement, and I'm interested in Eliezer understanding Richard's claim to the point of being able to paraphrase it to Richard's satisfaction.

[Ngo][15:08]

Wrapping up here makes sense.

I endorse the thing Nate just said.

I also get the sense that I have a much better outline now of Eliezer's views about consequentialism (if not the actual details and texture).

On a meta level, I personally tend to focus more on things like "how should we understand cognition" and not "how should we understand geopolitics and how it affects the level of pivotal action required".

If someone else were trying to prosecute this disagreement they might say much more about the latter. I'm uncertain how useful it is for me to do so, given that my comparative advantage compared with the rest of the world (and probably Eliezer's too) is the cognition part.

[Yudkowsky][15:12]

Reconvene... tomorrow? Monday of next week?

[Ngo][15:12]

Monday would work better for me.

You okay with me summarising the discussion so far to [some people — redacted for privacy reasons]?

[Yudkowsky][15:13]

Nate, take a minute to think of your own thoughts there?

[Soares: 👍 👌]

[Soares][15:15]

My take: I think it's fine to summarize, though generally virtuous to mark summaries as summaries (rather than asserting that your summaries are Eliezer-endorsed or w/e).

[Ngo: 👍]

[Yudkowsky][15:16]

I think that broadly matches my take. I'm also a bit worried about biases in the text summarizer, and about whether I managed to say anything that Rob or somebody will object to pre-publication, but we ultimately intended this to be seen and I was keeping that in mind, so, yeah, go ahead and summarize.

[Ngo][15:17]

Great, thanks

[Yudkowsky][15:17]

I admit to being curious as to what you thought was said that was important or new, but that's a question that can be left open to be answered at your leisure, earlier in your day.

[Ngo][15:17]

I admit to being curious as to what you thought was said that was important or new, but that's a question that can be left open to be answered at your leisure, earlier in your day.

You mean, what I thought was worth summarising?

[Yudkowsky][15:17]

Yeah.

[Ngo][15:18]

Hmm, no particular opinion. I wasn't going to go out of my way to do so, but since I'm chatting to [some people — redacted for privacy reasons] regularly anyway, it seemed low-cost to fill them in.

At your leisure, I'd be curious to know how well the directions of discussion are meeting your goals for what you want to convey when this is published, and whether there are topics you want to focus on more.

[Yudkowsky][15:19]

I don't know if it's going to help, but trying it currently seems better than to go on saying nothing.

[Ngo][15:20]

(personally, in addition to feeling like less of an expert on geopolitics, it also seems more sensitive for me to make claims about in public, which is another reason I haven't been digging into that area as much)

[Soares][15:21]

(personally, in addition to feeling like less of an expert on geopolitics, it also seems more sensitive for me to make claims about in public, which is another reason I haven't been digging into that area as much)

(seems reasonable! note, though, that i'd be quite happy to have sensitive sections stricken from the record, insofar as that lets us get more convergence than we otherwise would, while we're already in the area)

[Ngo: 👍]

(tho ofc it is less valuable to spend conversational effort in private discussions, etc.)

[Ngo: 👍]

[Ngo][15:22]

At your leisure, I'd be curious to know how well the directions of discussion are meeting your goals for what you want to convey when this is published, and whether there are topics you want to focus on more.

(this question aimed at you too Nate)

Also, thanks Nate for the moderation! I found your interventions well-timed and useful.

[Soares: ❤️]

[Soares][15:23]

(this question aimed at you too Nate)

(noted, thanks, I'll probably write something up after you've had the opportunity to depart for sleep.)

On that note, I declare us adjourned, with intent to reconvene at the same time on Monday.

Thanks again, both.

[Ngo][15:23]

Thanks both 🙂

Oh, actually, one quick point

Would one hour earlier suit, for Monday?

I've realised that I'll be moving to a one-hour-later time zone, and starting at 9pm is slightly suboptimal (but still possible if necessary)

[Soares][15:24]

One hour earlier would work fine for me.

[Yudkowsky][15:25]

Doesn't work as fine for me because I've been trying to avoid any food until 12:30p my time, but on that particular day I may be more caloried than usual from the previous day, and could possibly get away with it. (That whole day could also potentially fail if a minor medical procedure turns out to take more recovery than it did the last time I had it.)

[Ngo][15:26]

Hmm, is this something where you'd have more information on the day? (For the calories thing)

[Yudkowsky][15:27]

(seems reasonable! note, though, that i'd be quite happy to have sensitive sections stricken from the record, insofar as that lets us get more convergence than we otherwise would, while we're already in the area)

I'm a touch reluctant to have discussions that we intend to delete, because then the larger debate will make less sense once those sections are deleted. Let's dance around things if we can.

[Ngo: 👍]

[Soares: 👍]

I mean, I can that day at 10am my time say how I am doing and whether I'm in shape for that day.

[Ngo][15:28]

great. and if at that point it seems net positive to postpone to 11am your time (at the cost of me being a bit less coherent later on) then feel free to say so at the time

on that note, I'm off

[Yudkowsky][15:29]

Good night, heroic debater!

[Soares][16:11]

At your leisure, I'd be curious to know how well the directions of discussion are meeting your goals for what you want to convey when this is published, and whether there are topics you want to focus on more.

The discussions so far are meeting my goals quite well so far! (Slightly better than my expectations, hooray.) Some quick rough notes:

I have been enjoying EY explicating his models around consequentialism.
- The objections Richard has been making are ones I think have been floating around for some time, and I'm quite happy to see explicit discussion on it.
- Also, I've been appreciating the conversational virtue with which the two of you have been exploring it. (Assumption of good intent, charity, curiosity, etc.)
I'm excited to dig into Richard's sense that EY was off about recursive self improvement, and is now off about consequentialism, in a similar way.
- This also sees to me like a critique that's been floating around for some time, and I'm looking forward to getting more clarity on it.
I'm a bit torn between driving towards clarity on the latter point, and shoring up some of the progress on the former point.
- One artifact I'd really enjoy having is some sort of "before and after" take, from Richard, contrasting his model of EY's views before, to his model now.
- I also have a vague sense that there are some points Eliezer was trying to make, that didn't quite feel like they were driven home; and dually, some pushback by Richard that didn't feel quite frontally answered.
  - One thing I may do over the next few days is make a list of those places, and see if I can do any distilling on my own. (No promises, though.)
  - If that goes well, I might enjoy some side-channel back-and-forth with Richard about it, eg during some more convenient-for-Richard hour (or, eg, as a thing to do on Monday if EY's not in commission at 10a pacific.)

[Ngo][5:40] (next day, Sep. 9)

The discussions so far are [...]

What do you mean by "latter point" and "former point"? (In your 6th bullet point)

[Soares][7:09] (next day, Sep. 9)

What do you mean by "latter point" and "former point"? (In your 6th bullet point)

former = shoring up the consequentialism stuff, latter = digging into your critique re: recursive self improvement etc. (The nesting of the bullets was supposed to help make that clear, but didn't come out well in this format, oops.)

4. Follow-ups

4.1. Richard Ngo's summary

[Ngo] (Sep. 10 Google Doc)

2nd discussion

(Mostly summaries not quotations~~; also hasn’t yet been evaluated by Eliezer~~)

Eliezer, summarized by Richard: "~~The~~ A core concept which people have trouble grasping is consequentialism. People try to reason about how AIs will solve problems, and ways in which they might or might not be dangerous. But they don’t realise that the ability to solve a wide range of difficult problems implies that an agent must be doing a powerful search over possible solutions, which is ~~the~~ a core skill required to take actions which greatly affect the world. Making this type of AI safe is like trying to build an AI that drives red cars very well, but can’t drive blue cars - there’s no way you get this by default, because the skills involved are so similar. And because the search process ~~is so general~~ is by default so general, ~~it’ll be very hard to~~ I don’t currently see how to constrain it into any particular region."

[Yudkowsky][10:48] (Sep. 10 comment)

The

A concept, which some people have had trouble grasping. There seems to be an endless list. I didn't have to spend much time contemplating consequentialism to derive the consequences. I didn't spend a lot of time talking about it until people started arguing.

[Yudkowsky][10:50] (Sep. 10 comment)

the

[Yudkowsky][10:52] (Sep. 10 comment)

[the search process] is [so general]

"is by default". The reason I keep emphasizing that things are only true by default is that the work of surviving may look like doing hard nondefault things. I don't take fatalistic "will happen" stances, I assess difficulties of getting nondefault results.

[Yudkowsky][10:52] (Sep. 10 comment)

it’ll be very hard to

"I don't currently see how to"

[Ngo] (Sep. 10 Google Doc)

Eliezer, summarized by Richard (continued): "In biological organisms, evolution is ~~one source~~ the ultimate source of consequentialism. A ~~second~~ secondary outcome of evolution is reinforcement learning. For an animal like a cat, upon catching a mouse (or failing to do so) many parts of its brain get slightly updated, in a loop that makes it more likely to catch the mouse next time. (Note, however, that this process isn’t powerful enough to make the cat a pure consequentialist - rather, it has many individual traits that, when we view them from this lens, point in the same direction.) ~~A third thing that makes humans in particular consequentialist is planning,~~ Another outcome of evolution, which helps make humans in particular more consequentialist, is planning - especially when we’re aware of concepts like utility functions."

[Yudkowsky][10:53] (Sep. 10 comment)

one

the ultimate

[Yudkowsky][10:53] (Sep. 10 comment)

second

secondary outcome of evolution

[Yudkowsky][10:55] (Sep. 10 comment)

especially when we’re aware of concepts like utility functions

Very slight effect on human effectiveness in almost all cases because humans have very poor reflectivity.

[Ngo] (Sep. 10 Google Doc)

Richard, summarized by Richard: "Consider an AI that, given a hypothetical scenario, tells us what the best plan to achieve a certain goal in that scenario is. Of course it needs to do consequentialist reasoning to figure out how to achieve the goal. But that’s different from an AI which chooses what to say as a means of achieving its goals. I’d argue that the former is doing consequentialist reasoning without itself being a consequentialist, while the latter is actually a consequentialist. Or more succinctly: consequentialism = problem-solving skills + using those skills to choose actions which achieve goals."

Eliezer, summarized by Richard: "The former AI might be slightly safer than the latter if you could build it, but I think people are likely to dramatically overestimate how big the effect is. The difference could just be one line of code: if we give the former AI our current scenario as its input, then it becomes the latter. For purposes of understanding alignment difficulty, you want to be thinking on the level of abstraction where you see that in some sense it is the search itself that is dangerous when it's a strong enough search, rather than the danger seeming to come from details of the planning process. One particularly helpful thought experiment is to think of advanced AI as an 'outcome pump' which selects from futures in which a certain outcome occurred, and takes whatever action leads to them."

[Yudkowsky][10:59] (Sep. 10 comment)

particularly helpful

"attempted explanatory". I don't think most readers got it.

I'm a little puzzled by how often you write my viewpoint as thinking that whatever I happened to say a sentence about is the Key Thing. It seems to rhyme with a deeper failure of many EAs to pass the MIRI ITT.

To be a bit blunt and impolite in hopes that long-languishing social processes ever get anywhere, two obvious uncharitable explanations for why some folks may systematically misconstrue MIRI/Eliezer as believing much more than in reality that various concepts an argument wanders over are Big Ideas to us, when some conversation forces us to go to that place:

(A) It paints a comfortably unflattering picture of MIRI-the-Other as weirdly obsessed with these concepts that seem not so persuasive, or more generally paints the Other as a bunch of weirdos who stumbled across some concept like "consequentialism" and got obsessed with it. In general, to depict the Other as thinking a great deal of some idea (or explanatory thought experiment) is to tie and stake their status to the listener's view of how much status that idea deserves. So if you say that the Other thinks a great deal of some idea that isn't obviously high-status, that lowers the Other's status, which can be a comfortable thing to do.

(cont.)

(B) It paints a more comfortably self-flattering picture of a continuing or persistent disagreement, as a disagreement with somebody who thinks that some random concept is much higher-status than it really is, in which case there isn't more to done or understood except to duly politely let the other person try to persuade you the concept deserves its high status. As opposed to, "huh, maybe there is a noncentral point that the other person sees themselves as being stopped on and forced to explain to me", which is a much less self-flattering viewpoint on why the conversation is staying within a place. And correspondingly more of a viewpoint that somebody else is likely to have of us, because it is a comfortable view to them, than a viewpoint that it is comfortable to us to imagine them having.

Taking the viewpoint that somebody else is getting hung up on a relatively noncentral point can also be a flattering self-portrait to somebody who believes that, of course. It doesn't mean they're right. But it does mean that you should be aware of how the Other's story, told from the Other's viewpoint, is much more liable to be something that the Other finds sensible and perhaps comfortable, even if it implies an unflattering (and untrue-seeming and perhaps untrue) view of yourself, than something that makes the Other seem weird and silly and which it is easy and congruent for you yourself to imagine the Other thinking.

[Ngo][11:18] (Sep. 12 comment)

I'm a little puzzled by how often you write my viewpoint as thinking that whatever I happened to say a sentence about is the Key Thing.

In this case, I emphasised the outcome pump thought experiment because you said that the time-travelling scenario was a key moment for your understanding of optimisation, and the outcome pump seemed to be similar enough and easier to convey in the summary, since you'd already written about it.

I'm also emphasising consequentialism because it seemed like the core idea which kept coming up in our first debate, under the heading of "deep problem-solving patterns". Although I take your earlier point that you tend to emphasise things that your interlocutor is more skeptical about, not necessarily the things which are most central to your view. But if consequentialism isn't in fact a very central concept for you, I'd be interested to hear what role it plays.

[Ngo] (Sep. 10 Google Doc)

Richard, summarized by Richard: "There’s a component of 'finding a plan which achieves a certain outcome' which involves actually solving the object-level problem of how someone who is given the plan can achieve the outcome. And there’s another component which is figuring out how to manipulate that person into doing what you want. To me it seems like Eliezer’s argument is that there’s no training regime which leads an AI to spend 99% of its time thinking about the former, and 1% thinking about the latter."

[Yudkowsky][11:20] (Sep. 10 comment)

no training regime

...that the training regimes we come up with first, in the 3 months or 2 years we have before somebody else destroys the world, will not have this property.

I don't have any particularly complicated or amazingly insightful theories of why I keep getting depicted as a fatalist; but my world is full of counterfactual functions, not constants. And I am always aware that if we had access to a real Textbook from the Future explaining all of the methods that are actually robust in real life - the equivalent of telling us in advance about all the ReLUs that in real life were only invented and understood a few decades after sigmoids - we could go right ahead and build a superintelligence that thinks 2 + 2 = 5.

All of my assumptions about "I don't see how to do X" are always labeled as ignorance on my part and a default because we won't have enough time to actually figure out how to do X. I am constantly maintaining awareness of this because being wrong about it being difficult is a major place where hope potentially comes from, if there's some idea like ReLUs that robustly vanquishes the difficulty, which I just didn't think of. Which does not, alas, mean that I am wrong about any particular thing, nor that the infinite source of optimistic ideas that is the wider field of "AI alignment" is going to produce a good idea from the same process that generates all the previous naive optimism through not seeing where the original difficulty comes from or what other difficulties surround obvious naive attempts to solve it.

[Ngo] (Sep. 10 Google Doc)

Richard, summarized by Richard (continued): "While this may be true in the limit of increasing intelligence, the most relevant systems are the earliest ones that are above human level. But humans deviate from the consequentialist abstraction you’re talking about in all sorts of ways - for example, being raised in different cultures can make people much more or less consequentialist. So it seems plausible that early AGIs can be superhuman while also deviating strongly from this abstraction - not necessarily in the same ways as humans, but in ways that we push them towards during training."

Eliezer, summarized by Richard: "Even at the Einstein or von Neumann level these types of deviations start to subside. And the sort of pivotal acts which might realistically work require skills significantly above human level. I think even 1% of the cognition of an AI that can assemble advanced nanotech, thinking about how to kill humans, would doom us. Your other suggestions for pivotal acts (surveillance to restrict AGI proliferation; persuading world leaders to restrict AI development) are not politically feasible in real life, to the level required to prevent the world from ending; or else require alignment in the very dangerous domain of superhuman manipulation."

Richard, summarized by Richard: "I think we probably also have significant disagreements about geopolitics which affect which acts we expect to be pivotal, but it seems like our comparative advantage is in discussing cognition, so let’s focus on that. We can build systems that outperform humans at quite a few tasks by now, without them needing search abilities that are general enough to even try to take over the world. Putting aside for a moment the question of which tasks are pivotal enough to save the world, which parts of your model draw the line between human-level chess players and human-level galaxy-colonisers, and say that we'll be able to align ones that significantly outperform us on these tasks before they take over the world, but not on those tasks?"

Eliezer, summarized by Richard: "One aspect there is domain generality which in turn is achieved through novel domain learning. One can imagine asking the question: is there a superintelligent AGI that can quickly build nanotech the way that a beaver solves building dams, in virtue of having a bunch of specialized learning abilities without it ever having a cross-domain general learning ability? But there are many, many, many things that humans do which no other animal does, which you might think would contribute a lot to that animal's fitness if there were animalistic ways to do it - e.g. mining and smelting iron. (Although comparisons to animals are not generally reliable arguments about what AIs can do - e.g. chess is much easier for chips than neurons.) So my answer is 'Perhaps, but not by default, there's a bunch of subproblems, I don't actually know how to do it right now, it's not the easiest way to get an AGI that can build nanotech.' ~~Can I explain how I know that? I'm really not sure I can.~~"

[Yudkowsky][11:26] (Sep. 10 comment)

Can I explain how I know that? I'm really not sure I can.

In original text, this sentence was followed by a long attempt to explain anyways; if deleting that, which is plausibly the correct choice, this lead-in sentence should also be deleted, as otherwise it paints a false picture of how much I would try to explain anyways.

[Ngo][11:15] (Sep. 12 comment)

Makes sense; deleted.

[Ngo] (Sep. 10 Google Doc)

Richard, summarized by Richard: "Challenges which are trivial from a human-engineering perspective can be very challenging from an evolutionary perspective (e.g. spinning wheels). So the evolution of animals-with-a-little-bit-of-help-from-humans might end up in very different places from the evolution of animals-just-by-themselves. And analogously, the ability of humans to fill in the gaps to help less general AIs achieve more might be quite significant.

"On nanotech: what are the most relevant axes of difference between solving protein folding and designing nanotech that, say, self-assembles into a computer?"

Eliezer, summarized by Richard: "This question seemed potentially cruxy to me. I.e., if building protein factories that built nanofactories that built nanomachines that met a certain deep and lofty engineering goal, didn't involve cognitive challenges different in kind from protein folding, we could maybe just safely go do that using AlphaFold 3, which would be just as safe as AlphaFold 2. I don't think we can do that. But it is among the more plausible advance-specified miracles we could get. At this point our last hope is that in fact the future is often quite surprising."

Richard, summarized by Richard: "It seems to me that you’re making the same mistake here as you did with regards to recursive self-improvement in the AI foom debate - namely, putting too much trust in one big abstraction."

Eliezer, summarized by Richard: "I suppose that is what it could potentially feel like from the inside to not get an abstraction. Robin Hanson kept on asking why I was trusting my abstractions so much, when he was in the process of trusting his worse abstractions instead."

4.2. Nate Soares' summary

[Soares] (Sep. 12 Google Doc)

Consequentialism

Ok, here's a handful of notes. I apologize for not getting them out until midday Sunday. My main intent here is to do some shoring up of the ground we've covered. I'm hoping for skims and maybe some light comment back-and-forth as seems appropriate (perhaps similar to Richard's summary), but don't think we should derail the main thread over it. If time is tight, I would not be offended for these notes to get little-to-no interaction.

---

My sense is that there's a few points Eliezer was trying to transmit about consequentialism, that I'm not convinced have been received. I'm going to take a whack at it. I may well be wrong, both about whether Eliezer is in fact attempting to transmit these, and about whether Richard received them; I'm interested in both protests from Eliezer and paraphrases from Richard.

[Soares] (Sep. 12 Google Doc)

1. "The consequentialism is in the plan, not the cognition".

I think Richard and Eliezer are coming at the concept "consequentialism" from very different angles, as evidenced eg by Richard saying (Nate's crappy paraphrase:) "where do you think the consequentialism is in a cat?" and Eliezer responding (Nate's crappy paraphrase:) "the cause of the apparent consequentialism of the cat's behavior is distributed between its brain and its evolutionary history".

In particular, I think there's an argument here that goes something like:

Observe that, from our perspective, saving the world seems quite tricky, and seems likely to involve long sequences of clever actions that force the course of history into a narrow band (eg, because if we saw short sequences of dumb actions, we could just get started).
Suppose we were presented with a plan that allegedly describes a long sequence of clever actions that would, if executed, force the course of history into some narrow band.
- For concreteness, suppose it is a plan that allegedly funnels history into the band where we have wealth and acclaim.
One plausible happenstance is that the plan is not in fact clever, and would not in fact have a forcing effect on history.
- For example, perhaps the plan describes founding and managing some silicon valley startup, that would not work in practice.
Conditional on the plan having the history-funnelling property, there's a sense in which it's scary regardless of its source.
- For instance, perhaps the plan describes founding and managing some silicon valley startup, and will succeed virtually every time it's executed, by dint of having very generic descriptions of things like how to identify and respond to competition, including descriptions of methods for superhumanly-good analyses of how to psychoanalyze the competition and put pressure on their weakpoints.
- In particular, note that one need not believe the plan was generated by some "agent-like" cognitive system that, in a self-contained way, made use of reasoning we'd characterize as "possessing objectives" and "pursuing them in the real world".
- More specifically, the scariness is a property of the plan itself. For instance, the fact that this plan accrues wealth and acclaim to the executor, in a wide variety of situations, regardless of what obstacles arise, implies that the plan contains course-correcting mechanisms that keep the plan on-target.
- In other words, plans that manage to actually funnel history are (the argument goes) liable to have a wide variety of course-correction mechanisms that keep the plan oriented towards some target. And while this course-correcting property tends to be a property of history-funneling plans, the choice of target is of course free, hence the worry.

(Of course, in practice we perhaps shouldn't be visualizing a single Plan handed to us from an AI or a time machine or whatever, but should instead imagine a system that is reacting to contingencies and replanning in realtime. At the least, this task is easier, as one can adjust only for the contingencies that are beginning to arise, rather than needing to predict them all in advance and/or describe general contingency-handling mechanisms. But, and feel free to take a moment to predict my response before reading the next sentence, "run this AI that replans autonomously on-the-fly" and "run this AI+human loop that replans+reevaluates on the fly", are still in this sense "plans", that still likely have the property of Eliezer!consequentialism, insofar as they work.)

[Soares] (Sep. 12 Google Doc)

There's a part of this argument I have not yet driven home. Factoring it out into a separate bullet:

2. "If a plan is good enough to work, it's pretty consequentialist in practice".

In attempts to collect and distill a handful of scattered arguments of Eliezer's:

If you ask GPT-3 to generate you a plan for saving the world, it will not manage to generate one that is very detailed. And if you tortured a big language model into giving you a detailed plan for saving the world, the resulting plan would not work. In particular, it would be full of errors like insensitivity to circumstance, suggesting impossible actions, and suggesting actions that run entirely at cross-purposes to one another.

A plan that is sensitive to circumstance, and that describes actions that synergize rather than conflict -- like, in Eliezer's analogy, photons in a laser -- is much better able to funnel history into a narrow band.

But, on Eliezer's view as I understand it, this "the plan is not constantly tripping over its own toes" property, goes hand-in-hand with what he calls "consequentialism". As a particularly stark and formal instance of the connection, observe that one way a plan can trip over its own toes is if it says "then trade 5 oranges for 2 apples, then trade 2 apples for 4 oranges". This is clearly an instance of the plan failing to "lase" -- of some orange-needing part of the plan working at cross-purposes to some apple-needing part of the plan, or something like that. And this is also a case where it's easy to see how if a plan is "lasing" with respect to apples and oranges, then it is behaving as if governed by some coherent preference.

And the point as I understand it isn't "all toe-tripping looks superficially like an inconsistent preference", but rather "insofar as a plan does manage to chain a bunch of synergistic actions together, it manages to do so precisely insofar as it is Eliezer!consequentialist".

cf the analogy to information theory, where if you're staring at a maze and you're trying to build an accurate representation of that maze in your own head, you will succeed precisely insofar as your process is Bayesian / information-theoretic. And, like, this is supposed to feel like a fairly tautological claim: you (almost certainly) can't get the image of a maze in your head to match the maze in the world by visualizing a maze at random, you have to add visualized-walls using some process that's correlated with the presence of actual walls. Your maze-visualizing process will work precisely insofar as you have access to & correctly make use of, observations that correlate with the presence of actual walls. You might also visualize extra walls in locations where it's politically expedient to believe that there's a wall, and you might also avoid visualizing walls in a bunch of distant regions of the maze because it's dark and you haven't got all day, but the resulting visualization in your head is accurate precisely insofar as you're managing to act kinda like a Bayesian.

Similarly (the analogy goes), a plan works-in-concert and avoids-stepping-on-its-own-toes precisely insofar as it is consequentialist. These are two sides of the same coin, two ways of seeing the same thing.

And, I'm not so much attempting to argue the point here, as to make sure that the shape of the argument (as I understand it) has been understood by Richard. In particular, the shape of the argument I see Eliezer as making is that "clumsy" plans don't work, and "laser-like plans" work insofar as they are managing to act kinda like a consequentialist.

Rephrasing again: we have a wide variety of mathematical theorems all spotlighting, from different angles, the fact that a plan lacking in clumsiness, is possessing of coherence.

("And", my model of Eliezer is quick to note, "this ofc does not mean that all sufficiently intelligent minds must generate very-coherent plans. If you really knew what you were doing, you could design a mind that emits plans that always "trip over themselves" along one particular axis, just as with sufficient mastery you could build a mind that believes 2+2=5 (for some reasonable cashing-out of that claim). But you don't get this for free -- and there's a sort of "attractor" here, when building cognitive systems, where just as generic training will tend to cause it to have true beliefs, so will generic training tend to cause its plans to lase.")

(And ofc much of the worry is that all the mathematical theorems that suggest "this plan manages to work precisely insofar as it's lasing in some direction", say nothing about which direction it must lase. Hence, if you show me a plan clever enough to force history into some narrow band, I can be fairly confident it's doing a bunch of lasing, but not at all confident which direction it's lasing in.)

[Soares] (Sep. 12 Google Doc)

One of my guesses is that Richard does in fact understand this argument (though I personally would benefit from a paraphrase, to test this hypothesis!), and perhaps even buys it, but that Richard gets off the train at a following step, namely that we need plans that "lase", because ones that don't aren't strong enough to save us. (Where in particular, I suspect most of the disagreement is in how far one can get with plans that are more like language-model outputs and less like lasers, rather than in the question of which pivotal acts would put an end to the acute risk period)

But setting that aside for a moment, I want to use the above terminology to restate another point I saw Eliezer as attempting to make: one big trouble with alignment, in the case where we need our plans to be like lasers, is that on the one hand we need our plans to be like lasers, but on the other hand we want them to fail to be like lasers along certain specific dimensions.

For instance, the plan presumably needs to involve all sorts of mechanisms for refocusing the laser in the case where the environment contains fog, and redirecting the laser in the case where the environment contains mirrors (...the analogy is getting a bit strained here, sorry, bear with me), so that it can in fact hit a narrow and distant target. Refocusing and redirecting to stay on target are part and parcel to plans that can hit narrow distant targets.

But the humans shutting the AI down is like scattering the laser, and the humans tweaking the AI so that it plans in a different direction is like them tossing up mirrors that redirect the laser; and we want the plan to fail to correct for those interferences.

As such, on the Eliezer view as I understand it, we can see ourselves as asking for a very unnatural sort of object: a path-through-the-future that is robust enough to funnel history into a narrow band in a very wide array of circumstances, but somehow insensitive to specific breeds of human-initiated attempts to switch which narrow band it's pointed towards.

Ok. I meandered into trying to re-articulate the point over and over until I had a version distilled enough for my own satisfaction (which is much like arguing the point), apologies for the repetition.

I don't think debating the claim is the right move at the moment (though I'm happy to hear rejoinders!). Things I would like, though, are: Eliezer saying whether the above is on-track from his perspective (and if not, then poking a few holes); and Richard attempting to paraphrase the above, such that I believe the arguments themselves have been communicated (saying nothing about whether Richard also buys them).

---

[Soares] (Sep. 12 Google Doc)

My Richard-model's stance on the above points is something like "This all seems kinda plausible, but where Eliezer reads it as arguing that we had better figure out how to handle lasers, I read it as an argument that we'd better save the world without needing to resort to lasers. Perhaps if I thought the world could not be saved except by lasers, I would share many of your concerns, but I do not believe that, and in particular it looks to me like much of the recent progress in the field of AI -- from AlphaGo to GPT to AlphaFold -- is evidence in favor of the proposition that we'll be able to save the world without lasers."

And I recall actual-Eliezer saying the following (more-or-less in response, iiuc, though readers note that I might be misunderstanding and this might be out-of-context):

Definitely, "turns out it's easier than you thought to use gradient descent's memorization of zillions of shallow patterns that overlap and recombine into larger cognitive structures, to add up to a consequentialist nanoengineer that only does nanosystems and never does sufficiently general learning to apprehend the big picture containing humans, while still understanding the goal for that pivotal act you wanted to do" is among the more plausible advance-specified miracles we could get.

On my view, and I think on Eliezer's, the "zillions of shallow patterns"-style AI that we see today, is not going to be sufficient to save the world (nor destroy it). There's a bunch of reasons that GPT and AlphaZero aren't destroying the world yet, and one of them is this "shallowness" property. And, yes, maybe we'll be wrong! I myself have been surprised by how far the shallow pattern memorization has gone (and, for instance, was surprised by GPT), and acknowledge that perhaps I will continue to be surprised. But I continue to predict that the shallow stuff won't be enough.

I have the sense that lots of folk in the community are, one way or another, saying "Why not consider the problems of aligning systems that memorize zillions of shallow patterns?". And my answer is, "I still don't expect those sorts of machines to either kill or save us, I'm still expecting that there's a phase shift that won't happen until AI systems start to be able to make plans that are sufficiently deep and laserlike to do scary stuff, and I'm still expecting that the real alignment challenges are in that regime."

And this seems to me close to the heart of the disagreement: some people (like me!) have an intuition that it's quite unlikely that figuring out how to get sufficient work out of shallow-memorizers is enough to save us, and I suspect others (perhaps even Richard!) have the sense that the aforementioned "phase shift" is the unlikely scenario, and that I'm focusing on a weird and unlucky corner of the space. (I'm curious whether you endorse this, Richard, or some nearby correction of it.)

In particular, Richard, I am curious whether you endorse something like the following:

I'm focusing ~all my efforts on the shallow-memorizers case, because I think shallow-memorizer-alignment will by and large be sufficient, and even if it is not then I expect it's a good way to prepare ourselves for whatever we'll turn out to need in practice. In particular I don't put much stock in the idea that there's a predictable phase-change that forces us to deal with laser-like planners, nor that predictable problems in that domain give large present reason to worry.

(I suspect not, at least not in precisely this form, and I'm eager for corrections.)

I suspect something in this vicinity constitutes a crux of the disagreement, and I would be thrilled if we could get it distilled down to something as concise as the above. And, for the record, I personally endorse the following counter to the above:

I am focusing ~none of my efforts on shallow-memorizer-alignment, as I expect it to be far from sufficient, as I do not expect a singularity until we have more laser-like systems, and I think that the laserlike-planning regime has a host of predictable alignment difficulties that Earth does not seem at all prepared to face (unlike, it seems to me, the shallow-memorizer alignment difficulties), and as such I have large and present worries.

---

[Soares] (Sep. 12 Google Doc)

Ok, and now a few less substantial points:

There's a point Richard made here:

Oh, interesting. Actually one more question then: to what extent do you think that explicitly reasoning about utility functions and laws of rationality is what makes consequentialists have the properties you've been talking about?

that I suspect constituted a miscommunication, especially given that the following sentence appeared in Richard's summary:

A third thing that makes humans in particular consequentialist is planning, especially when we’re aware of concepts like utility functions.

In particular, I suspect Richard's model of Eliezer's model places (or placed, before Richard read Eliezer's comments on Richard's summary) some particular emphasis on systems reflecting and thinking about their own strategies, as a method by which the consequentialism and/or effectiveness gets in. I suspect this is a misunderstanding, and am happy to say more on my model upon request, but am hopeful that the points I made a few pages above have cleared this up.

Finally, I observe that there are a few places where Eliezer keeps beeping when Richard attempts to summarize him, and I suspect it would be useful to do the dorky thing of Richard very explicitly naming Eliezer's beeps as he understands them, for purposes of getting common knowledge of understanding. For instance, things I think it might be useful for Richard to say verbatim (assuming he believes them, which I suspect, and subject to Eliezer-corrections, b/c maybe I'm saying things that induce separate beeps):

1. Eliezer doesn't believe it's impossible to build AIs that have most any given property, including most any given safety property, including most any desired "non-consequentialist" or "deferential" property you might desire. Rather, Eliezer believes that many desirable safety properties don't happen by default, and require mastery of minds that likely takes a worrying amount of time to acquire.

2. The points about consequentialism are not particularly central in Eliezer's view; they seem to him more like obvious background facts; the reason conversation has lingered here in the EA-sphere is that this is a point that many folk in the local community disagree on.

For the record, I think it might also be worth Eliezer acknowledging that Richard probably understands point (1), and that glossing "you don't get it for free by default and we aren't on course to have the time to get it" as "you can't" is quite reasonable when summarizing. (And it might be worth Richard counter-acknowledging that the distinction is actually quite important once you buy the surrounding arguments, as it constitutes the difference between describing the current playing field and laying down to die.) I don't think any of these are high-priority, but they might be useful if easy :-)

---

Finally, stating the obvious-to-me, none of this is intended as criticism of either party, and all discussing parties have exhibited significant virtue-according-to-Nate throughout this process.

[Yudkowsky][21:27] (Sep. 12)

From Nate's notes:

For instance, the plan presumably needs to involve all sorts of mechanisms for refocusing the laser in the case where the environment contains fog, and redirecting the laser in the case where the environment contains mirrors (...the analogy is getting a bit strained here, sorry, bear with me), so that it can in fact hit a narrow and distant target. Refocusing and redirecting to stay on target are part and parcel to plans that can hit narrow distant targets.
But the humans shutting the AI down is like scattering the laser, and the humans tweaking the AI so that it plans in a different direction is like them tossing up mirrors that redirect the laser; and we want the plan to fail to correct for those interferences.

--> GOOD ANALOGY.

...or at least it sure conveys to me why corrigibility is anticonvergent / anticoherent / actually moderately strongly contrary to and not just an orthogonal property of a powerful-plan generator.

But then, I already know why that's true and how it generalized up to resisting our various attempts to solve small pieces of more important aspects of it - it's not just true by weak default, it's true by a stronger default where a roomful of people at a workshop spend several days trying to come up with increasingly complicated ways to describe a system that will let you shut it down (but not steer you through time into shutting it down), and all of those suggested ways get shot down. (And yes, people outside MIRI now and then publish papers saying they totally just solved this problem, but all of those "solutions" are things we considered and dismissed as trivially failing to scale to powerful agents - they didn't understand what we considered to be the first-order problems in the first place - rather than these being evidence that MIRI just didn't have smart-enough people at the workshop.)

[Yudkowsky][18:56] (Nov. 5 follow-up comment)

Eg, "Well, we took a system that only learned from reinforcement on situations it had previously been in, and couldn't use imagination to plan for things it had never seen, and then we found that if we didn't update it on shut-down situations it wasn't reinforced to avoid shutdowns!"

[-]habryka4y*320Review for 2021 Review

I think this post might be the best one of all the MIRI dialogues. I also feel confused about how to relate to the MIRI dialogues overall.

A lot of the MIRI dialogues consist of Eliezer and Nate saying things that seem really important and obvious to me, and a lot of my love for them comes from a feeling of "this actually makes a bunch of the important arguments for why the problem is hard". But the nature of the argument is kind of closed off.

Like, I agree with these arguments, but like, if you believe these arguments, having traction on AI Alignment becomes much harder, and a lot of things that people currently label "AI Alignment" kind of stops feeling real, and I have this feeling that even though a really quite substantial fraction of the people I talk to about AI Alignment are compelled by Eliezer's argument for difficulty, that there is some kind of structural reason that AI Alignment as a field can't really track these arguments.

Like, a lot of people's jobs and funding rely on these arguments being false, and also, if these arguments are correct, the space of perspectives on the problem suddenly loses a lot of common ground on how to proceed or what to do, and it isn't really obvious that you even want an "AI Alignment field" or lots of "AI Alignment research organizations" or "AI Alignment student groups". Like, because we don't know how to solve this problem, it really isn't clear what the right type of social organization is, and there aren't obviously great gains from trade, and so from a coalition perspective, you don't get a coalition of people who think these arguments are real.

I feel deeply confused about this. Over the last two years, I think I wrongly ended up just kind of investing into an ecosystem of people that somewhat structurally can't really handle these arguments, and makes plans that assume that these arguments are false, and in doing so actually mostly makes the world worse, by having a far too optimistic stance on the differential technological progress of solving various ML challenges, and feeling like they can pick up a lot of probability mass of good outcomes by just having better political relationships to capabilities-labs by giving them resources to make AI happen even faster.

I now regret that a lot, and I think somehow engaging with these dialogues more closely, or having more discussion of them, would have prevented me from making what I currently consider one of the biggest mistakes in my life. Maybe also making them more accessible, or somehow having them be structured in a way that gave me as a reader more permission for actually taking the conclusions of them seriously, by having content that builds on these assumptions and asks the question "what's next" instead of just the question of "why not X?" in dialogue with people who disagree.

In terms of follow-up work, the dialogues I would most love to see is maybe a conversation between Eliezer and Nate, or between John Wentworth and Eliezer, where they try to hash out their disagreements about what to do next, instead of having the conversation be at the level these dialogues were at.

[-]Eliezer Yudkowsky4y50

If it's a mistake you made over the last two years, I have to say in your defense that this post didn't exist 2 years ago.

[-]habryka4y10

I think I was actually helping Robby edit some early version of this post a few months before it was posted on LessWrong, so I think my exposure to it was actually closer to ~18-20 months ago.

I do think that still means I set a lot of my current/recent plans into motion before this was out, and your post is appreciated.

[-]Rob Bensinger5y*240

This is the first post in a sequence, consisting of the logs of a Discord server MIRI made for hashing out AGI-related disagreements with Richard Ngo, Open Phil, etc.

I did most of the work of turning the chat logs into posts, with lots of formatting help from Matt Graves and additional help from Oliver Habryka, Ray Arnold, and others. I also hit the 'post' button for Richard and Eliezer. (I don't plan to repeat this note on future posts in this sequence, unless folks request it.)

[-]TurnTrout5y140

I've started commenting on this discussion on a Google Doc. Here are some excerpts:

During this step, if humanity is to survive, somebody has to perform some feat that causes the world to not be destroyed in 3 months or 2 years when too many actors have access to AGI code that will destroy the world if its intelligence dial is turned up.

Contains implicit assumptions about takeoff that I don't currently buy:

Well-modelled as binary "has-AGI?" predicate;
- (I am sympathetic to the microeconomics of intelligence explosion working out in a way where "Well-modelled as binary "has-AGI?" predicate is true, but I feel uncertain about the prospect)
Somehow rules out situations like: We have somewhat aligned AIs which push the world to make future unaligned AIs slightly less likely, which makes the AI population more aligned on average; this cycle compounds until we're descending very fast into the basin of alignment and goodness.
- This isn't my mainline or anything, but I note that it's ruled out by Eliezer's model as I understand it.
Some other internal objections are arising and I'm not going to focus on them now.

Every AI output effectuates outcomes in the world.

Right but the likely domain of cognitive discourse matters. Pac-Man agents effectuate outcomes in the world, but their optimal policies are harmless. So the question seems to hinge on when the domain of cognition shifts to put us in the crosshairs of performant policies.

This doesn't mean Eliezer is wrong here about the broader claim, but the distinction deserves mentioning for the people who weren't tracking it. (I think EY is obviously aware of this)

If you knew about the things that humans are using to reuse their reasoning about chipped handaxes and other humans, to prove math theorems, you would see it as more plausible that proving math theorems would generalize to chipping handaxes and manipulating humans.

Could we have observed it any other way? Since we surely wouldn't have been selected for proving math theorems, we wouldn't have a native cortex specializing in math. So conditional on considering things like theorem-proving at all, it has to reuse other native capabilities.

More precisely, one possible mind design which solves theorems also reasons about humans. This is some update from whatever prior, towards EY's claim. I'm considering whether we know enough about the common cause (evolution giving us a general-purpose reasoning algorithm) to screen off/reduce the Theorems -> Human-modelling update.

Thanks, Richard—this is a cool argument that I hadn't heard before.

You will systematically overestimate how much easier, or how far you can push the science part without getting the taking-over-the-world part, for as long as your model is ignorant of what they have in common.

OK, it's a valid point and I'm updating a little, under the apparent model of "here's a set of AI capabilities, linearly ordered in terms of deep-problem-solving, and if you push too far you get taking-over-the-world." But I don't see how we get to that model to begin with.

[-]Ramana Kumar5y130

I am interested in the history-funnelling property -- the property of being like a consequentialist, or of being effective at achieving an outcome -- and have a specific confusion I'd love to get insight on from anyone who has any.

Question: Possible outcomes are in the mind of a world-modeller - reality just is as it is (exactly one way) and isn't made of possibilities. So in what sense do the consequentialist-like things Yudkowsky is referring to funnel history?

Option 1 (robustness/behavioural/our models): They achieve narrow outcomes with respect to an externally specified set of counterfactuals. E.g., relative to what we consider "could have happened", the consequentialists selected an excellent course of action for their purposes. This would make consequentialists optimizing systems in Flint's sense.

Option 2 (agency/structural/their models): They are structured in such a way that they do their own considering and evaluating and deciding. We observe mechanisms that implement the processes of predicting and evaluating outcomes in these systems (and/or their history). So the possibilities that are narrowed down are the consequentialist's possibilities, the counterfactuals are produced by their models which may or may not line up with some externally specified ones (like ours).

I mostly think Yudkowsky is referring to Option 2, but I get confused by phrases (e.g. from Soares's summary) like "manage to actually funnel history" or "apparent consequentialism", that seem to me to make most sense under Option 1.

[-]Eliezer Yudkowsky5y60

To Rob's reply, I'll add that my own first reaction to your question was that it seems like a map-territory / perspective issue as appears in eg thermodynamics? Like, this has a similar flavor to asking "What does it mean to say that a classical system is in a state of high entropy when it actually only has one particular system state?" Adding this now in case I don't have time to expand on it later; maybe just saying that much will help at all, possibly.

[-]Rob Bensinger5y40

Possible outcomes are in the mind of a world-modeller - reality just is as it is (exactly one way) and isn't made of possibilities. So in what sense do the consequentialist-like things Yudkowsky is referring to funnel history?

I'm not sure that I understand the question, but my intuition is to say: they funnel world-states into particular outcomes in the same sense that literal funnels funnel water into particular spaces, or in the same sense that a slope makes things roll down it.

If you find water in a previously-empty space with a small aperture, and you're confused that no water seems to have spilled over the sides, you may suspect that a funnel was there. Funnels are part of a larger deterministic universe, so maybe in some sense any given funnel (like everything else) 'had to do exactly that thing'. Still, we can observe that funnels are an important part of the causal chain in these cases, and that places with funnels tend to end up with this type of outcome much more often.

Similarly, consequentialists tend to remake parts of the world (typically, as much of the world as they can reach) into things that are high in their preference ordering. From Optimization and the Singularity:

[...] Suppose you have a car, and suppose we already know that your preferences involve travel. Now suppose that you take all the parts in the car, or all the atoms, and jumble them up at random. It's very unlikely that you'll end up with a travel-artifact at all, even so much as a wheeled cart; let alone a travel-artifact that ranks as high in your preferences as the original car. So, relative to your preference ordering, the car is an extremely improbable artifact; the power of an optimization process is that it can produce this kind of improbability.
You can view both intelligence and natural selection as special cases of optimization: Processes that hit, in a large search space, very small targets defined by implicit preferences. Natural selection prefers more efficient replicators. Human intelligences have more complex preferences. Neither evolution nor humans have consistent utility functions, so viewing them as "optimization processes" is understood to be an approximation. You're trying to get at the sort of work being done, not claim that humans or evolution do this work perfectly.
This is how I see the story of life and intelligence - as a story of improbably good designs being produced by optimization processes. The "improbability" here is improbability relative to a random selection from the design space, not improbability in an absolute sense - if you have an optimization process around, then "improbably" good designs become probable. [...]

But it's not clear what a "preference" is, exactly. So a more general way of putting it, in Recognizing Intelligence, is:

[...] Suppose I landed on an alien planet and discovered what seemed to be a highly sophisticated machine, all gleaming chrome as the stereotype demands. Can I recognize this machine as being in any sense well-designed, if I have no idea what the machine is intended to accomplish? Can I guess that the machine's makers were intelligent, without guessing their motivations?
And again, it seems like in an intuitive sense I should obviously be able to do so. I look at the cables running through the machine, and find large electrical currents passing through them, and discover that the material is a flexible high-temperature high-amperage superconductor. Dozens of gears whir rapidly, perfectly meshed...
I have no idea what the machine is doing. I don't even have a hypothesis as to what it's doing. Yet I have recognized the machine as the product of an alien intelligence.
[...] Why is it a good hypothesis to suppose that intelligence or any other optimization process played a role in selecting the form of what I see, any more than it is a good hypothesis to suppose that the dust particles in my rooms are arranged by dust elves?
Consider that gleaming chrome. Why did humans start making things out of metal? Because metal is hard; it retains its shape for a long time. So when you try to do something, and the something stays the same for a long period of time, the way-to-do-it may also stay the same for a long period of time. So you face the subproblem of creating things that keep their form and function. Metal is one solution to that subproblem.
[... A]s simple a form of negentropy as regularity over time - that the alien's terminal values don't take on a new random form with each clock tick - can imply that hard metal, or some other durable substance, would be useful in a "machine" - a persistent configuration of material that helps promote a persistent goal.
The gears are a solution to the problem of transmitting mechanical forces from one place to another, which you would want to do because of the presumed economy of scale in generating the mechanical force at a central location and then distributing it. In their meshing, we recognize a force of optimization applied in the service of a recognizable instrumental value: most random gears, or random shapes turning against each other, would fail to mesh, or fly apart. Without knowing what the mechanical forces are meant to do, we recognize something that transmits mechanical force - this is why gears appear in many human artifacts, because it doesn't matter much what kind of mechanical force you need to transmit on the other end. You may still face problems like trading torque for speed, or moving mechanical force from generators to appliers.
These are not universally convergent instrumental challenges. They probably aren't even convergent with respect to maximum-entropy goal systems (which are mostly out of luck).
But relative to the space of low-entropy, highly regular goal systems - goal systems that don't pick a new utility function for every different time and every different place - that negentropy pours through the notion of "optimization" and comes out as a concentrated probability distribution over what an "alien intelligence" would do, even in the "absence of any hypothesis" about its goals. [...]

"Consequentialists funnel the universe into shapes that are higher in their preference ordering" isn't a required inherent truth for all consequentialists; some might have weird goals, or be too weak to achieve much. Likewise, some literal funnels are broken or misshapen, or just never get put to use. But in both cases, we can understand the larger class by considering the unusual function well-working instances can perform.

(In the case of literal funnels, we can also understand the class by considering its physical properties rather than its function/behavior/effects. Eventually we should be able to do the same for consequentialists, but currently we don't know what physical properties of a system make it consequentialist, beyond the level of generality of e.g. 'its future-steering will approximately obey expected utility theory'.)

[-]Ramana Kumar5y60

Thanks for the replies! I'm still somewhat confused but will try again to both ask the question more clearly and summarise my current understanding.

What, in the case of consequentialists, is analogous to the water funnelled by literal funnels? Is it possibilities-according-to-us? Or is it possibilities-according-to-the-consequentialist? Or is it neither (or both) of those?

To clarify a little what the options in my original comment were, I'll say what I think they correspond to for literal funnels. Option 1 corresponds to the fact that funnels are usually nearby (in spacetime) when water is in a small space without having spilled, and Option 2 corresponds to the characteristic funnel shape (in combination with facts about physical laws maybe).

I think your and Eliezer's replies are pointing me at a sense in which both Option 1 and Option 2 are correct, but they are used in different ways in the overall story. To tell this story, I want to draw a distinction between outcome-pumps (behavioural agents) and consequentialists (structural agents). Outcome-pumps are effective at achieving outcomes, and this effectiveness is measured according to our models (option 1). Consequentialists do (or have done in their causal history) the work of selecting actions according to expected consequences in coherent pursuit of an outcome, and the expected consequences are therefore their own (option 2).

Spelling this out a little more - Outcome-pumps are optimizing systems: there is a space of possible configurations, a much smaller target subset of configurations, and a basin of attraction such that if the system+surroundings starts within the basin, it ends up within the target. There are at least two ways of looking at the configuration space. Firstly, there's the range of situations in which we actually observe the same (or similar) outcome-pump system and that it achieved its outcome. Secondly, there's the range of hypothetical possibilities we can imagine and reason about putting the outcome-pump system into, and extrapolating (using our own models) that it will achieve the outcome. Both of these ways are "Option 1".

Consequentialists (structural agents) do the work, somewhere somehow - maybe in their brains, maybe in their causal history, maybe in other parts of their structure and history - of maintaining and updating beliefs and selecting actions that lead to (their modelled) expected consequences that are high in their preference ordering (this is all Option 2).

It should be somewhat uncontroversial that consequentialists are outcome pumps, to the extent that they’re any good at doing the consequentialist thing (and have sufficiently achievable preferences relative to their resources etc).

The more substantial claim I read MIRI as making is that outcome pumps are consequentialists, because the only way to be an outcome pump is to be a consequentialist. Maybe you wouldn't make this claim so strongly, since there are counterexamples like fires and black holes -- and there may be some restrictions on what kind of outcome pumps the claim applies to (such as some level of retargetability or robustness?).

How does this overall take sound?

Scott Garrabrant’s question on whether agent-like behaviour implies agent-like architecture seems pretty relevant to this whole discussion -- Eliezer, do you have an answer to that question? Or at least do you think it’s an important open question?

[-]Eliezer Yudkowsky5y100

My reply to your distinction between 'consequentialists' and 'outcome pumps' would be, "Please forget entirely about any such thing as a 'consequentialist' as you defined it; I would now like to talk entirely about powerful outcome pumps. All understanding begins there, and we should only introduce the notion of how outcomes are pumped later in the game. Understand the work before understanding the engines; nearly every key concept here is implicit in the notion of work rather than in the notion of a particular kind of engine."

(Modulo that lots of times people here are like "Well but a human at a particular intelligence level in a particular complicated circumstance once did this kind of work without the thing happening that it sounds like you say happens with powerful outcome pumps"; and then you have to look at the human engine and its circumstances to understand why outcome pumping could specialize down to that exact place and fashion, which will not be reduplicated in more general outcome pumps that have their dice re-rolled.)

[-]Ramana Kumar5y40

A couple of direct questions I'm stuck on:

Do you agree that Flint's optimizing systems are a good model (or even definition) of outcome pumps?
Are black holes and fires reasonable examples of outcome pumps?

I'm asking these to understand the work better.

Currently my answers are:

Yes. Flint's notion is one I came to independently when thinking about "goal-directedness". It could be missing some details, but I find it hard to snap out of the framework entirely.
Yes. But maybe not the most informative examples. They're highly non-retargetable.

[-]Daniel Kokotajlo5y30

Understand the work before understanding the engines; nearly every key concept here is implicit in the notion of work rather than in the notion of a particular kind of engine."

I don't know the relevant history of science, but I wouldn't be surprised if something like the opposite was true: Our modern, very useful understanding of work is an abstraction that grew out of many people thinking concretely about various engines. Thinking about engines was like the homework exercises that helped people to reach and understand the concept of work.

Similarly, perhaps it is pedagogically (and conceptually) helpful to begin with the notion of a consequentialist and then generalize to outcome pumps.

[-]johnswentworth5y131

I disagree with this in an interesting way. (Not particularly central to the discussion, but since both Richard & Eliezer thought the quoted claim is basically-true, I figured I should comment on it.)

First, outside view evidence: most of the genome is junk. That's pretty strong evidence that the size of the genome is not itself a taut constraint. If there evolutionary fitness gains to be had, in general, by passing more information via the genome, then we should expect that to have evolved already.

Second, inside view: overparameterized local search processes (including evolution and gradient descent on NNs) perform information compression by default. This is a technical idea that I haven't written up properly yet, but as a quick sketch... suppose that I have a neural net with N parameters. It's overparameterized, so there are many degrees of freedom in any optimum - i.e. there's a whole optimal surface, not just an optimal point. Now suppose that I can build a near-perfect model of the training data by setting only M (< N) parameter-values; with these values, all the other parameters are screened off, so the remaining N-M parameters can take any values at all. (I'll call the set of M parameter-values a "model".) The smaller M, the larger N-M, and therefore the more possible parameter-values achieve optimality using this model. And the more possible parameter-values achieve optimality using the model, the more of the optimum-space this "model" fills. In practice, for something like evolution or gradient descent, this would mean a broad peak.

Rough takeaway: broader peaks in the fitness-landscape are precisely those which require fixing fewer parameters. Fixing fewer parameters, while still achieving optimality, requires compressing all the information-required-to-achieve-optimality into those few parameters. The more compression, the broader the peak, and the more likely that a local search process will find it.

[-]DaemonicSigil5y80

Large genomes have (at least) 2 kinds of costs. The first is the energy and other resources required to copy the genome whenever your cells divide. The existence of junk DNA suggests that this cost is not a limiting factor. The other cost is that a larger genome will have more mutations per generation. So maintaining that genome across time uses up more selection pressure. Junk DNA requires no maintenance, so it provides no evidence either way. Selection pressure cost could still be the reason why we don't see more knowledge about the world being translated genetically.

A gene-level way of saying the same thing is that even a gene that provides an advantage may not survive if it takes up a lot of genome space, because it will be destroyed by the large number of mutations.

[-]johnswentworth5y80

Good point, I wasn't thinking about that mechanism.

However, I don't think this creates an information bottleneck in the sense needed for the original claim in the post, because the marginal cost of storing more information in the genome does not increase via this mechanism as the amount-of-information-passed increases. Each gene just needs to offer a large enough fitness advantage to counter the noise on that gene; the requisite fitness advantage does not change depending on whether the organism currently has a hundred information-passing genes or a hundred thousand. It's not really a "bottleneck" so much as a fixed price: the organism can pass any amount of information via the genome, so long as each base-pair contributes marginal fitness above some fixed level.

It does mean that individual genes can't be too big, but it doesn't say much about the number of information-passing genes (so long as separate genes have mostly-decoupled functions, which is indeed the case for the vast majority of gene pairs in practice).

[-]Vanessa Kosoy5y120

Comment after reading section 3:

I want to push back a little against the claim that the bootstrapping strategy ("build a relatively weak aligned AI that will make superhumanly fast progress on AI alignment") is definitely irrelevant/doomed/inferior. Specifically, I don't know whether this strategy is good or not in practice, but it serves as useful threshold for what level/kind of capabilities we need to align in order to solve AI risk.

Yudkowsky and I seem to agree that "do a pivotal act directly" is not something productive for us to work on, but "do alignment" research is something productive for us to work on. Therefore, there exists some range of AI capabilities which allow for superhuman alignment research but not for pivotal acts. Maybe this range is so narrow that in practice AI capability will cross it very quickly, or maybe not.

Moreover, I believe that there are trade-offs between safety and capability. This not only seems plausible, but actually shows up in many approach to safety (quantilization, confidence thresholds / consensus algorithms, homomorphic encryption...) Therefore, it's not safe to assume that any level of capability sufficient to pose risk (i.e. for a negative pivotal act) is also sufficient for a positive pivotal act.

Yudkowsky seems to claim that aligning an AI that does further alignment research is just too hard, and instead we should be designing AIs that are only competent in a narrow domain (e.g. competent at designing nanosystems but not at manipulating humans). Now, this does seem like an interesting class of alignment strategies, but it's not the only class.

One class of alignment strategies (which in particular Christiano wrote a lot about) compatible with bootstrapping is "amplified imitation of users" (e.g. IDA but I don't want to focus on IDA too much because of certain specifics I am skeptical about). This is potentially vulnerable to attack from counterfactuals plus the usual malign simulation hypotheses, but is not obviously doomed. There is also a potential issue with capability: maybe predicting is too hard if you don't know which features are important to predict and which aren't.

Another class of alignment strategies (which in particular Russel often promotes) compatible with boostrapping is "learn what the user wants and find a plan to achieve it" (e.g. IRL/CIRC etc). This is hard because it requires formalizing "what the user wants" but might be tractable via something along the lines of the AIT definition of intelligence. Making it safe probably requires imposing something like the Hippocratic principle, which, if you think through the implications, pulls it in the direction of the "superimitation" class. But, this might avoid superimitation's capability issues.

It could be that "restricted cognition" will turn out to be superior to both superimitation and value learning, but it seems far from a slam dunk at this point.

[-]Edouard Harris5y20

Yeah, very much agree with all of this. I even think there's an argument to be made that relatively narrow-yet-superhuman theorem provers (or other research aids) could be worth the risk to develop and use, because they may make the human alignment researchers who use them more effective in unpredictable ways. For example, researchers tend to instinctively avoid considering solution paths that are bottlenecked by statements they see as being hard to prove — which is totally reasonable. But if your mentality is that you can just toss a super-powerful theorem-prover at the problem, then you're free to explore concept-space more broadly since you may be able to check your ideas at much lower cost.

(Also find myself agreeing with your point about tradeoffs. In fact, you could think of a primitive alignment strategy as having a kind of Sharpe ratio: how much marginal x-risk does it incur per marginal bit of optimization it gives? Since a closed-form solution to the alignment problem doesn't necessarily seem forthcoming, measuring its efficient frontier might be the next best thing.)

[-]Daniel Kokotajlo5y120

[Notes mostly to myself, not important, feel free to skip]

My hot take overall is that Yudkowsky is basically right but doing a poor job of arguing for the position. Ngo is very patient and understanding.

"it doesn't seem implausible to me that we build AIs that are significantly more intelligent (in the sense of being able to understand the world) than humans, but significantly less agentic." --Ngo

"It is likely that, before the point where AGIs are strongly superhuman at seeking power, they will already be strongly superhuman at understanding the world, and at performing narrower pivotal acts like alignment research which don’t require as much agency (by which I roughly mean: large-scale motivations and the ability to pursue them over long timeframes)." --Ngo

"So it is legit harder to point out "the consequentialist parts of the cat" by looking for which sections of neurology are doing searches right there. That said, to the extent that the visual cortex does not get tweaked on failure to catch a mouse, it's not part of that consequentialist loop either." --Yudkowsky

"But the answer is that some problems are difficult in that they require solving lots of subproblems, and an easy way to solve all those subproblems is to use patterns which collectively have some coherence and overlap, and the coherence within them generalizes across all the subproblems. Lots of search orderings will stumble across something like that before they stumble across separate solutions for lots of different problems." --Yudkowsky

This is really making me want to keep working on my+Ramana's sequence on agency! :)

I think I disagree with Yudkowsky here? I almost want to say "the opposite is true; if people were all innately consequentialist then we wouldn't have so many blankfaces and bureaucracies would be a lot better because the rules would just be helpful guidelines." Or "Sure but books of regulations work surprisingly well, well enough that there's gotta be some innate deontology in humans." Or "Have you conversed with normal humans about ethics recently? If they are consequentialists they are terrible at it."

I think this is a great paragraph. It's a concise and reasonably accurate description of (an important part of) the problem.

I do think it, and this whole discussion, focuses too much on plans and not enough on agents. It's good for illustrating how the problem arises even in a context where we have some sort of oracle that gives us a plan and then we carry it out... but realistically our situation will be more dire than that because we'll be delegating to autonomous AGI agents. :(

[-]Eliezer Yudkowsky5y90

The idea is not that humans are perfect consquentialists, but that they are able to work at all to produce future-steering outputs, insofar as humans actually do work at all, by an inner overlap of the shape of inner parts which has a shape resembling consequentialism, and the resemblance is what does the work. That is, your objection has the same flavor as "But humans aren't Bayesian! So how can you say that updating on evidence is what's doing their work of mapmaking?"

To be clear I think I agree with your overall position. I just don't think the argument you gave for it (about bureaucracies etc.) was compelling.

[-]Charlie Steiner5y30

Ngo is very patient and understanding.

Perhaps... too patient and understanding. Richard! Blink twice if you're being held against your will!

(I too would like you to write more about agency :P)

[-]Ramana Kumar5y70

Here Daniel Kokotajlo and I try to paraphrase the two sides of part of the disagreement and point towards a possible crux about the simplicity of corrigibility.

We are training big neural nets to be effective. (More on what effective means elsewhere; it means something like “being able to steer the future better than humans can.”) We want to have an effective&corrigible system, and we are worried that instead we’ll get an effective&deceptive system. Ngo, Shah, etc. are hopeful that it won’t be “that hard” to get the former and avoid the latter; maybe if we just apply selection pressure in the various ways that have been discovered so far (adversarial training, oversight, process-based feedback, etc.) it’ll work. Yudkowsky is more pessimistic; he thinks that the ways that have been discovered so far really don’t seem good enough. Instead of creating an effective&corrigible system, they’ll create either an ineffective&corrigible system, or an effective&deceptive system that deceives us into thinking it is corrigible.

What are the arguments they give for their respective positions?

Yudkowsky (we think) says that corrigibility is both (a) significantly more complex than deception, and (b) at cross-purposes to effectiveness.

[-]Daniel Kokotajlo5y40

For (a): Deception is a convergent instrumental goal; you get it “for free” when you succeed in making an effective system, in the sense that the simplest, most-likely-to-be-randomly-generated effective systems are deceptive. Corrigibility by contrast is complex and involves making various nuanced decisions between good and bad sorts of influence on human behavior.

For (b): If you take an effective system and modify it to be corrigible, this will tend to make it less effective. By contrast, deceptiveness (insofar as it arises “naturally” as a byproduct of pursuing convergent instrumental goals effectively) does not “get in the way” of effectiveness, and even helps in some cases!

Ngo’s (and Shah’s) position (we think) is that the data we’ll be using to select our systems will be heavily entangled with human preferences - we’ll indeed be trying to use human preferences to guide and shape the systems - so there’s a strong bias towards actually learning them. You don’t have to get human preferences right in all their nuance and detail to know some basic things like that humans generally don’t want to die or be manipulated/deceived. I think they mostly bounce off the claim that “effectiveness” has some kind of “deep underlying principles” that will generalise better than any plausible amount of human preference data actually goes into building the effective system. We imagine Shah saying: “1. Why will the AI have goals at all?, and 2. If it does have goals, why will its goals be incompatible with human survival? Sure, most goals are incompatible with human survival, but we’re not selecting uniformly from the space of all goals.”

It seems to us that Ngo, Shah, etc. draw intuitive support from analogy to humans, whereas Yudkowsky etc. draw intuitive support from the analogy to programs and expected utility equations.

If you are thinking about a piece of code that describes a bayesian EU-maximizer, and then you try to edit the code to make the agent corrigible, it’s obvious that (a) you don’t know how to do that, and (b) if you did figure it out the code you add would be many orders of magnitude longer than the code you started with.

If instead you are thinking about humans, it seems like you totally could be corrigible if you tried, and it seems like you might totally have tried if you had been raised in the right way (e.g. if your parents had lovingly but strictly trained you to be corrigible-in-way-X.)

We think Yudkowsky’s response to this apparent counterexample is that humans are stupid, basically; AIs might be similarly stupid at first, but as they get smarter we should expect crude corrigibility-training techniques to stop working.

[-]Wei Dai5y50

Are there any examples of this in history, where being corrigible-in-way-X wasn't being constantly incentivized/reinforced via a larger game (e.g., status game) that the human was embedded in? In other words, I think an apparently corrigible human can be modeled as trying to optimize for survival and social status as terminal values, and using "being corrigible" as an instrumental strategy as long as that's an effective strategy. In other words, it's unclear that they can be better described as "corrigible" than "deceptive" (in the AI alignment sense).

(Humans probably have hard-coded drives for survival and social status, so it may actually be harder to train humans than AIs to be actually corrigible. My point above is just that humans don't seem to be a good example of corrigibility being easy or possible.)

[-]Rohin Shah5y50

We imagine Shah saying: “1. Why will the AI have goals at all?, and 2. If it does have goals, why will its goals be incompatible with human survival? Sure, most goals are incompatible with human survival, but we’re not selecting uniformly from the space of all goals.”

Yeah, that's right. Adapted to the language here, it would be 1. Why would we have a "full and complete" outcome pump, rather than domain-specific outcome pumps that primarily use plans using actions from a certain domain rather than "all possible actions", and 2. Why are the outcomes being pumped incompatible with human survival?

[-]Ramana Kumar5y30

A couple of other arguments the non-MIRI side might add here:

The things AI systems today can do are already hitting pretty narrow targets. E.g., generating English text that is coherent is not something you’d expect from a random neural network. Why is corrigibility so much more of a narrow target than that? (I think Rohin may have said this to me at some point.)
How do we imagine scaled up humans [e.g. thinking faster, thinking in more copies, having more resources, or having more IQ] to be effective? Wouldn’t they be corrigible? Wouldn't they have nice goals? What can we learn from the closest examples we already have of scaled up humans? (h/t Shahar for bringing this point up in conversation).

The things AI systems today can do are already hitting pretty narrow targets. E.g., generating English text that is coherent is not something you’d expect from a random neural network. Why is corrigibility so much more of a narrow target than that? (I think Rohin may have said this to me at some point.)

I'll note that this is framed a bit too favorably to me, the actual question is "why is an effective and corrigible system so much more of a narrow target than that?"

[-]cousin_it5y50

I think it makes complete sense to say something like "once we have enough capability to run AIs making good real-world plans, some moron will run such an AI unsafely". And that itself implies a startling level of danger. But Eliezer seems to be making a stronger point, that there's no easy way to run such an AI safely, and all tricks like "ask the AI for plans that succeed conditional on them being executed" fail. And maybe I'm being thick, but the argument for that point still isn't reaching me somehow. Can someone rephrase for me?

[-]Steven Byrnes5y100

Speaking for myself here…

OK, let's say we want an AI to make a "nanobot plan". I'll leave aside the possibility of other humans getting access to a similar AI as mine. Then there are two types of accident risk that I need to worry about.

First, I need to worry that the AI may run for a while, then hand me a plan, and it looks like a nanobot plan, but it's not, it's a booby trap. To avoid (or at least minimize) that problem, we need to be confident that the AI is actually trying to make a nanobot plan—i.e., we need to solve the whole alignment problem.

Alternatively, maybe we're able to thoroughly understand the plan once we see it; we're just too stupid to come up with it ourselves. That seems awfully fraught—I'm not sure how we could be so confident that we can tell apart nanobot plans from booby-trap plans. But let's assume that's possible for the sake of argument, and then move on to the other type of accident risk:

Second, I need to worry that the AI will start running, and I think it's coming up with a nanobot plan, but actually it's hacking its way out of its box and taking over the world.

How and why might that happen?

I would say that if a nanobot plan is very hard to create—requiring new insights etc.—then the only way to do it is to create the nanobot plan is to construct an agent-like thing that is trying to create the nanobot plan.

The agent-like thing would have some kind of action space (e.g. it can choose to summon a particular journal article to re-read, or it can choose to think through a certain possibility, etc.), and it would have some kind of capability of searching for and executing plans (specifically, plans-for-how-to-create-the-nanobot-plan), and it would have a capability of creating and executing instrumental subgoals (e.g. go on a side-quest to better understand boron chemistry) and plausibly it needs some kind of metacognition to improve its ability to find subgoals and take actions.

Everything I mentioned is an "internal" plan or an "internal" action or an "internal" goal, not involving "reaching out into the world" with actuators and internet access and nanobots etc.

If only the AI would stick to such "internal" consequentialist actions (e.g. "I will read this article to better understand boron chemistry") and not engage in any "external" consequentialist actions (e.g. "I will seize more computer power to better understand boron chemistry"), well then we would have nothing to worry about! Alas, so far as I know, nobody knows how to make a powerful AI agent that would definitely always stick to "internal" consequentialism.

[-]johnswentworth5y70

Personally, I'd consider a Fusion Power Generator-like scenario a more central failure mode than either of these. It's not about the difficulty of getting the AI to do what we asked, it's about the difficulty of posing the problem in a way which actually captures what we want.

[-]Steven Byrnes5y40

I agree that that is another failure mode. (And there are yet other failure modes too—e.g. instead of printing the nanobot plan, it prints "Help me I'm trapped in a box…" :-P . I apologize for sloppy wording that suggested the two things I mentioned were the only two problems.)

I disagree about "more central". I think that's basically a disagreement on the question of "what's a bigger deal, inner misalignment or outer misalignment?" with you voting for "outer" and me voting for "inner, or maybe tie, I dunno". But I'm not sure it's a good use of time to try to hash out that disagreement. We need an alignment plan that solves all the problems simultaneously. Probably different alignment approaches will get stuck on different things.

[+][comment deleted]5y10

The main issue with this sort of thing (on my understanding of Eliezer's models) is Hidden Complexity of Wishes. You can make an AI safe by making it only able to fulfill certain narrow, well-defined kinds of wishes where we understand all the details of what we want, but then it probably won't suffice for a pivotal act. Alternatively, you can make it powerful enough for a pivotal act, but unfortunately a (good) pivotal act probably has to be very big, very irreversible, and very entangled with all the complicated details of human values. So alignment is likely to be a necessary step for a (good) pivotal act.

What this looks-like-in-practice is that "ask the AI for plans that succeed conditional on them being executed" has to be operationalized somehow, and the operationalization will inevitably not correctly capture what we actually want (because "what we actually want" has a ton of hidden complexity).

This is tricky. Let's say we have a powerful black box that initially has no knowledge or morals, but a lot of malleable computational power. We train it to give answers to scary real-world questions, like how to succeed at business or how to manipulate people. If we reward it for competent answers while we can still understand the answers, at some point we'll stop understanding answers, but they'll continue being super-competent. That's certainly a danger and I agree with it. But by the same token, if we reward the box for aligned answers while we still understand them, the alignment will generalize too. There seems no reason why alignment would be much less learnable than competence about reality.

Maybe your and Eliezer's point is that competence about reality has a simple core, while alignment doesn't. But I don't see the argument for that. Reality is complex, and so are values. A process for learning and acting in reality can have a simple core, but so can a process for learning and acting on values. Humans pick up knowledge from their surroundings, which is part of "general intelligence", but we pick up values just as easily and using the same circuitry. Where does the symmetry break?

[-]johnswentworth5y130

I do think alignment has a relatively-simple core. Not as simple as intelligence/competence, since there's a decent number of human-value-specific bits which need to be hardcoded (as they are in humans), but not enough to drive the bulk of the asymmetry.

(BTW, I do think you've correctly identified an important point which I think a lot of people miss: humans internally "learn" values from a relatively-small chunk of hardcoded information. It should be possible in-principle to specify values with a relatively small set of hardcoded info, similar to the way humans do it; I'd guess fewer than at most 1000 things on the order of complexity of a very fuzzy face detector are required, and probably fewer than 100.)

The reason it's less learnable than competence is not that alignment is much more complex, but that it's harder to generate a robust reward signal for alignment. Basically any sufficiently-complex long-term reward signal should incentivize competence. But the vast majority of reward signals do not incentivize alignment. In particular, even if we have a reward signal which is "close" to incentivizing alignment in some sense, the actual-process-which-generates-the-reward-signal is likely to be at least as simple/natural as actual alignment.

(I'll note that the departure from talking about Hidden Complexity here is mainly because competence in particular is a special case where "complexity" plays almost no role, since it's incentivized by almost any reward. Hidden Complexity is still usually the right tool for talking about why any particular reward-signal will not incentivize alignment.)

I suspect that Eliezer's answer to this would be different, and I don't have a good guess what it would be.

[-]cousin_it5y*150

Thinking about it more, it seems that messy reward signals will lead to some approximation of alignment that works while the agent has low power compared to its "teachers", but at high power it will do something strange and maybe harm the "teachers" values. That holds true for humans gaining a lot of power and going against evolutionary values ("superstimuli"), and for individual humans gaining a lot of power and going against societal values ("power corrupts"), so it's probably true for AI as well. The worrying thing is that high power by itself seems sufficient for the change, for example if an AI gets good at real-world planning, that constitutes power and therefore danger. And there don't seem to be any natural counterexamples. So yeah, I'm updating toward your view on this.

[-]Koen.Holtman5y50

I think it makes complete sense to say something like "once we have enough capability to run AIs making good real-world plans, some moron will run such an AI unsafely". And that itself implies a startling level of danger. But Eliezer seems to be making a stronger point, that there's no easy way to run such an AI safely, and all tricks like "ask the AI for plans that succeed conditional on them being executed" fail.

Yes, I am reading here too that Eliezer seems to be making a stronger point, specifically one related to corrigibility.

Looks like Eliezer believes that (or in Bayesian terms, assigns a high probability to the belief that) corrigibility has not been solved for AGI. He believes it has not been solved for any practically useful value of solved. Furthermore it looks like he expects that progress on solving AGI corrigibility will be slower than progress on creating potentially world-ending AGI. If Eliezer believed that AGI corrigibility had been solved or was close to being solved, I expect he would be in a less dark place than depicted, that he would not be predicting that stolen/leaked AGI code will inevitably doom us when some moron turns it up to 11.

In the transcript above, Eliezer devotes significant space to explaining why he believes that all corrigibility solutions being contemplated now will likely not work. Some choice quotations from the end of the transcript:

[...] corrigibility is anticonvergent / anticoherent / actually moderately strongly contrary to and not just an orthogonal property of a powerful-plan generator.

this is where things get somewhat personal for me:

[...] (And yes, people outside MIRI now and then publish papers saying they totally just solved this problem, but all of those "solutions" are things we considered and dismissed as trivially failing to scale to powerful agents - they didn't understand what we considered to be the first-order problems in the first place - rather than these being evidence that MIRI just didn't have smart-enough people at the workshop.)

I am one of `these people outside MIRI' who have published papers and sequences saying that they have solved large chunks of the AGI corrigibility problem.

I have never been claiming that I 'totally just solved corrigibility'. I am not sure where Eliezer is finding these 'totally solved' people, so I will just ignore that bit and treat it as a rhetorical flourish. But I have indeed been claiming that significant progress has been made on AGI corrigibility in the last few years. In particular, especially in the sequence, I implicitly claim that viewpoints have been developed, outside of MIRI, that address and resolve some of MIRIs main concerns about corrigibility. They resolve these in part by moving beyond Eliezer's impoverished view of what an AGI-level intelligence is, or must be.

Historical note: around 2019 I spent some time trying to get Eliezier/MIRI interested in updating their viewpoints on how easy or hard corrigibility was. They showed no interest to engage at that time, I have since stopped trying. I do not expect that anything I will say here will update Eliezer, my main motivation to write here is to inform and update others.

I will now point out a probable point of agreement between Eliezer and me. Eliezer says above that corrigibility is a property that is contradictory to having a powerful coherent AGI-level plan generator. Here, coherency has something to do with satisfying a bunch of theorems about how a game-theoretically rational utility maximiser must behave when making plans. One of these theorems is that coherence implies an emergent drive towards self-preservation.

I generally agree with Eliezer that there is a indeed a contradiction here: there is a contradiction between broadly held ideas of what it implies for an AGI to be a coherent utility maximising planner, and broadly held ideas of what it implies for an AGI to be corrigible.

I very much disagree with Eliezier on how hard it is to resolve these contradictions. These contradictions about corrigibility are easy to resolve one you abandon the idea that every AGI must necessarily satisfy various theorems about coherency. Human intelligence definitely does not satisfy various theorems about coherency. Almost all currently implemented AI systems do not satisfy some theorems about coherency, because they will not resist you pressing their off switch.

So this is why I call Eliezer's view of AGI an impoverished view: Eliezer (at least in the discussion transcript above, and generally whenever I read his stuff) always takes it as axiomatic that an AGI must satisfy certain coherence theorems. Once you take that as axiomatic, it is indeed easy to develop some rather negative opinions about how good other people's solutions to corrigibility are. Any claimed solution can easily be shown to violate at least one axiom you hold dear. You don't even need to examine the details of the proposed solution to draw that conclusion.

[-]Eliezer Yudkowsky5y160

Various previous proposals for utility indifference have foundered on gotchas like "Well, if we set it up this way, that's actually just equivalent to the AI assigning probability 0 to the shutdown button ever being pressed, which means that it'll tend to design the useless button out of itself." Or, "This AI behaves like the shutdown button gets pressed with a fixed nonzero probability, which means that if, say, that fixed probability is 10%, the AI has an incentive to strongly precommit to making the shutdown button get pressed in cases where the universe doesn't allow perpetual motion, because that way there's a nearly 90% probability of perpetual motion being possible." This tends to be the kind of gotcha you run into, if you try to violate coherence principles; though of course the real and deeper problem is that I expect things contrary to the core of general intelligence to fail to generalize when we try to scale AGI from the safe domains in which feedback can be safely provided, to the unsafe domains in which bad outputs kill the operators before they can label the results.

It's all very well and good to say "It's easy to build an AI that believes 2 + 2 = 5 once you relax the coherence constraints of arithmetic!" But the whole central problem is that we have to train an AI when it's operating in an intrinsically safe domain and intrinsically safe intelligence level where it couldn't kill the operators if it tried, and then scale that AI to produce outputs in dangerous domains like "Please build a nanosystem"; and if you build a dumb AI that thinks 2 + 2 = 5, and then make it much more intelligent, I strongly suspect that it snaps into in some sense 'knowing' or 'realizing' or 'starting to act in important ways as if' 2 + 2 = 4.

Can you boil down your attempted solution to a statement of which coherence constraint you think it is possible to violate, and what shutdownable behavior you think can result from that, how, in the least fancy actually-precise terms of which your writing talent permits? And then I can say whether my answer is "That doesn't do what you think" or "Sorry, that coherence violation is a large enough ask that I think any trained patch for it probably fails to scale with general intelligence" or "That does seem non-self-destructive in a certain generalized sense, and the shutdownability might work and scale if we could magically conjure the underlying pattern you posit, though I don't see how you could get it into Mu Zero per se by being clever with loss functions." Where the third option there is the one that crosses over into "Well golly that was a large advance compared to the state we occupied in 2015."

[-]Koen.Holtman5y20

Can you boil down your attempted solution to a statement of which coherence constraint you think it is possible to violate, and what shutdownable behavior you think can result from that, how,

Glad you asked.

in the least fancy actually-precise terms of which your writing talent permits?

If you want actual full precision, I have to refer you to the math in my papers. Since 2019 I have been working on and off to make this math more general and accessible, and to find better ways to explain this math in actually-precise natural language statements. So here is my best current attempt.

TL;DR: The coherence constraint/theorem/property that I want to violate is the property that the AGI is using a world model that accurately depicts the internals of its own compute core. I want to make the agent use an inaccurate model of these internals, one that omits the safety shut-down mechanisms I have built inside of its core. By doing this, I can reduce the emergent incentive of the AGI agent to disable its own emergency shut-down mechanisms, and the emergent incentive to stop people from activating them.

I will now expand on this and add more details, using the the example of an emergency stop button.

Say that the compute core has an emergency stop button attached to it. Say that actual software running inside the compute core will, when receiving a stop signal from the button, cause the agent to stop. When the signal is received, the software will always select and perform actions in every future time step. Let's say that $M_{c}$ is a world model that accurately depicts this situation. I am not going to build an AGI that uses $M_{c}$ to plan its actions.

Instead I build an AGI agent that will plan its next actions by using an incorrect world model $M_{i}$ . This $M_{i}$ is different from $M_{c}$ , but only in how it depicts the internals of the agent compute core. In the incorrect/imaginary world depicted by $M_{i}$ , the compute core has different software in it, software that will ignore the stop button signal, and just keep on picking actions that maximize utility. I further construct my AGI so that, in every time step, it calculates which next action $a$ would maximize utility in this incorrect, imaginary world $M_{i}$ . I then further construct it to take this same action $a$ in the real world.

An AGI that lives in a world that matches the correct model $M_{c}$ , while using the incorrect model $M_{i}$ to decide what actions to take, will (sometimes) act incoherently. But this is a very specific and targeted form of incoherence, far less severe and far less dangerous than having the agent believe that $2 + 2 = 5$ . Still, this incoherency does have some unwanted side effects. One example side effect, also considered in the 2015 MIRI corrigibility paper, is that the AGI will be indifferent about the fate of its emergency stop button. In my papers, I explore a much broader range of side effects, and look at how to prevent/suppress/fix them.

So the first step in the above corrigibility design is that you have to imagine constructing an AGI that will use a specifically incorrect world model $M_{i}$ . The first step is that you have to realize that you actually want this, that you want to decrease coherence in order to increase corrigibility.

The next step is to look for options to actually construct an AGI that uses this incorrect $M_{i}$ .

One option is to train the AGI in a simulated environment that faithfully represents $M_{i}$ . We then extract the learned world model, and put it in an AI that we unleash in the real world. We disable further learning in this AI, to prevent $M_{i}$ from being updated towards $M_{c}$ . Though this option depicts exactly how many current real-life AI systems are constructed and get their corrigibility, this option is not very satisfactory in a theoretical sense. It may not be a viable path to build certain otherwise desirable types of AI/AGI/superintelligence.
The more interesting option is to build an AGI that will construct $M_{i}$ based on an online training regime in the real world, in a real world that exactly matches the correct world model $M_{c}$ .

Is the second option technically feasible? The answer is generally yes.

A very general solution, leveraging causal models, is in this sequence. To give you some idea of the more detailed math involved: the (ls) pictured below is the real/correct world model $M_{c}$ , and (p) is the imaginary/incorrect world model $M_{i}$ . In the agent defined using these pictures, certain indifference properties are present trivially, by construction.

The solution in the pictures above uses ML to learn a model $L$ that correctly models only a part of the real world, the part of the real world outside of the agent compute core. We then slot $L$ into a causal world model, one that has other parts which mis-represent the real compute core internals. This causal world model will then be used as the incorrect $M_{i}$ . Note that this learned $L$ may be a fully opaque deep neural net, or whatever black-box thing the next generation of ML might produce. We do not have to understand or edit any details inside $L$ for this slot-in operation to work.

(I should mention that the paper has a lot of details not mentioned in the sequence, or visible in the pictures above. In particular, section 10.2 may be of interest.)

I want to stress that this causal model option is only one possible route to creating incorrect world models $M_{i}$ via machine learning in the real world. Papers like Safely interruptible agents and How RL Agents Behave When Their Actions Are Modified show that the idea of removing certain events from the training record can also work: whether this works as intended depends on having the right built-in priors, priors which control inductive generalization.

So overall, I have a degree of optimism about AGI corrigibility.

That being said, if you want to map out and estimate probabilities for our possible routes to doom, then you definitely need to include the scenario where a future superior-to-everything-else type of ML is invented, where this superior future type of ML just happens to be incompatible with any of the corrigibility techniques known at that time. Based on the above work, I put a fairly low probability on that scenario.

[-]TurnTrout5y110

Apparently no one has actually shown that corrigibility can be VNM-incoherent in any precise sense (and not in the hand-wavy sense which is good for intuition-pumping). I went ahead and sketched out a simple proof of how a reasonable kind of corrigibility gives rise to formal VNM incoherence.

I'm interested in hearing about how your approach handles this environment, because I think I'm getting lost in informal assumptions and symbol-grounding issues when reading about your proposed method.

[-]Koen.Holtman5y60

Read your post, here are my initial impressions on how it relates to the discussion here.

In your post, you aim to develop a crisp mathematical definition of (in)coherence, i.e. VNM-incoherence. I like that, looks like a good way to move forward. Definitely, developing the math further has been my own approach to de-confusing certain intuitive notions about what should be possible or not with corrigibility.

However, my first impression is that your concept of VNM-incoherence is only weakly related to the meaning that Eliezer has in mind when he uses the term incoherence. In my view, the four axioms of VNM-rationality have only a very weak descriptive and constraining power when it comes to defining rational behavior. I believe that Eliezer's notion of rationality, and therefore his notion of coherence above, goes far beyond that implied by the axioms of VNM-rationality. My feeling is that Eliezer is using the term 'coherence constraints' an intuition-pump way where coherence implies, or almost always implies, that a coherent agent will develop the incentive to self-preserve.

Looking at your post, I am also having trouble telling exactly how you are defining VNM-incoherence. You seem to be toying with several alternative definitions, one where it applies to reward functions (or preferences over lotteries) which are only allowed to examine the final state in a 10-step trajectory, another where the reward function can examine the entire trajectory and maybe the actions taken to produce that trajectory. I think that your proof only works in the first case, but fails in the second case. This has certain (fairly trivial) corollaries about building corrigibility. I'll expand on this in a comment I plan to attach to your post.

I'm interested in hearing about how your approach handles this environment,

I think one way to connect your ABC toy environment to my approach is to look at sections 3 and 4 of my earlier paper where I develop a somewhat similar clarifying toy environment, with running code.

Another comment I can make is that your ABC nodes-and-arrows state transition diagram is a depiction which makes it hard see how to apply my approach, because the depiction mashes up the state of the world outside of the compute core and the state of the world inside the compute core. If you want to apply counterfactual planning, or if you want to have a an agent design that can compute the balancing function terms according to Armstrong's indifference approach, you need a different depiction of your setup. You need one which separates out these two state components more explicitely. For example, make an MDP model where the individual states are instances of the tuple (physical position of agent in the ABC playing field,policy function loaded into the compute core).

Not sure how to interpret your statement that you got lost in symbol-grounding issues. If you can expand on this, I might be able to help.

[-]Koen.Holtman5y00

Update: I just recalled that Eliezer and MIRI often talk about Dutch booking when they talk about coherence. So not being susceptible to Dutch booking may be the type of coherence Eliezer has in mind here.

When it comes to Dutch booking as a coherence criterion, I need to repeat again the observation I made below:

In general, when you want to think about coherence without getting deeply confused, you need to keep track of what reward function you are using to rule on your coherency criterion. I don't see that fact mentioned often on this forum, so I will expand.

An agent that plans coherently given a reward function to maximize paperclips will be an incoherent planner if you judge its actions by a reward function $R_{s}$ that values the maximization of staples instead.

To extend this to Dutch booking: if you train a superintelligent poker playing agent with a reward function that rewards it for losing at poker, you will find that if can be Dutch booked rather easily, if your Dutch booking test is whether you can find a counter-strategy to make it loose money.

[-]Andrew McKnight5y00

I haven't read your papers but your proposal seems like it would scale up until the point when the AGI looks at itself. If it can't learn at this point then I find it hard to believe it's generally capable, and if it can, it will have incentive to simply remove the device or create a copy of itself that is correct about its own world model. Do you address this in the articles?

On the other hand, this made me curious about what we could do with an advanced model that is instructed to not learn and also whether we can even define and ensure a model stops learning.

I haven't read your papers but your proposal seems like it would scale up until the point when the AGI looks at itself. [...] Do you address this in the articles?

Yes I address this, see for example the part about The possibility of learned self-knowledge in the sequence. I show there that any RL agent, even a non-AGI, will always have the latent ability to 'look at itself' and create a machine-learned model of its compute core internals.

What is done with this latent ability is up to the designer. The key thing here is that you have a choice as a designer, you can decide if you want to design an agent which indeed uses this latent ability to 'look at itself'.

Once you decide that you don't want to use this latent ability, certain safety/corrigibility problems become a lot more tractable.

Wikipedia has the following definition of AGI:

Artificial general intelligence (AGI) is the hypothetical ability of an intelligent agent to understand or learn any intellectual task that a human being can.

Though there is plenty of discussion on this forum which silently assumes otherwise, there is no law of nature which says that, when I build a useful AGI-level AI, I must necessarily create the entire package of all human cognitive abilities inside of it.

this made me curious about what we could do with an advanced model that is instructed to not learn and also whether we can even define and ensure a model stops learning.

Terminology note if you want to look into this some more: ML typically does not frame this goal as 'instructing the model not to learn about Q'. ML would frame this as 'building the model to approximate the specific relation between some well-defined observables, and this relation is definitely not Q'.

[-]Gurkenglas5y*00

If you don't wish to reply to Eliezer, I'm an other and also ask what incoherence allows what corrigibility. I expect counterfactual planning to fail for want of basic interpretability. It would also coherently plan about the planning world - my Eliezer says we might as well equivalently assume superintelligent musings about agency to drive human readers mad.

See above for my reply to Eliezer.

Indeed, a counterfactual planner will plan coherently inside its planning world.

In general, when you want to think about coherence without getting deeply confused, you need to keep track of what reward function you are using to rule on your coherency criterion. I don't see that fact mentioned often on this forum, so I will expand.

An agent that plans coherently given a reward function to maximize paperclips will be an incoherent planner if you judge its actions by a reward function $R_{s}$ that values the maximization of staples instead. In section 6.3 of the paper I show that you can perfectly well interpret a counterfactual planner as an agent that plans coherently even inside its learning world (inside the real world), as long as you are willing to evaluate its coherency according to the somewhat strange reward function $R^{π}$ . Armstrong's indifference methods use this approach to create corrigibility without losing coherency: they construct an equivalent somewhat strange reward function by including balancing terms.

One thing I like about counterfactual planning is that, in my view, it is very interpretable to humans. Humans are very good at predicting what other humans will do, when these other humans are planning coherently inside a specifically incorrect world model, for example in a world model where global warming is a hoax. The same skill can also be applied to interpreting and anticipating the actions of AIs which are counterfactual planners. But maybe I am misunderstanding your concern about interpretability.

[-]Gurkenglas5y10

it is very interpretable to humans

Misunderstanding: I expect we can't construct a counterfactual planner because we can't pick out the compute core in the black-box learned model.

And my Eliezer's problem with counterfactual planning is that the plan may start by unleashing a dozen memetic, biological, technological, magical, political and/or untyped existential hazards on the world which then may not even be coordinated correctly when one of your safeguards takes out one of the resulting silicon entities.

[-]Koen.Holtman5y30

we can't pick out the compute core in the black-box learned model.

Agree it is hard to pick the compute core out of a black-box learned model that includes the compute core.

But one important point I am trying to make in the counterfactual planning sequence/paper is that you do not have to solve that problem. I show that it is tractable to route around it, and still get an AGI.

I don't understand your second paragraph 'And my Eliezer's problem...'. Can you unpack this a bit more? Do you mean that counterfactual planning does not automatically solve the problem of cleaning up an already in-progress mess when you press the emergency stop button too late? It does not intend to, and I do not think that the cleanup issue is among the corrigibility-related problems Eliezer has been emphasizing in the discussion above.

Oh, I wasn't expecting you to have addressed the issue! 10.2.4 says L wouldn't be S if it were calculated from projected actions instead of given actions. How so? Mightn't it predict the given actions correctly?

You're right on all counts in your last paragraph.

[-]Koen.Holtman5y10

10.2.4 says L wouldn't be S if it were calculated from projected actions instead of given actions. How so? Mightn't it predict the given actions correctly?

Not sure if a short answer will help, so I will write a long one.

In 10.2.4 I talk about the possibility of an unwanted learned predictive function that makes predictions without using the argument $a$ . This is possible for example by using $s^{'}$ together with a (learned) model $π^{l}$ of the compute core to predict $a$ : so a viable $L^{-}$ could be defined as $L^{-} (s^{'}, s, a) = S (s^{'}, s, π^{l} (s))$ . This $L^{-}$ could make predictions fully compatible with the observational record $o$ , but I claim it would not be a reasonable learned $L$ according to the reasonableness criterion $L \approx S$ . How so?

The reasonableness criterion $L \approx S$ is similar to that used in supervised machine learning: we evaluate the learned $L$ not primarily by how it matches the training set (how well it predicts the observations in $o$ ), but by evaluating it on a separate test set. This test set can be constructed by sampling $S$ to create samples not contained in $o$ . Mathematically, perfect reasonableness is defined as $L = S$ , which implies that $L$ predicts all samples from $S$ fully accurately.

Philosophically/ontologically speaking, an the agent specification in my paper, specifically the learning world diagram and the descriptive text around it of how this diagram is a model of reality, gives the engineer an unambiguous prescription of how they might build experimental equipment that can measure the properties of the $S$ in the learning world diagram by sampling reality. A version of this equipment must of course be built into the agent, to create the observations that drive machine learning of $L$ , but another version can be used stand-alone to construct a test set.

A sampling action to construct a member of the test set would set up a desired state $s$ and action $a$ , and then observe the resulting $s^{'}$ . Mathematically speaking, this observation gives additional information about the numeric value of $S (s^{'}, s, a)$ and of all $S (s^{''}, s, a)$ for all $s^{''} \neq s^{'}$ .

I discuss in the section that, if we take an observational record $o$ sampled from $S$ , then two learned predictive functions $L_{1}$ and $L_{2}$ could be found which are both fully compatible with all observations in $o$ . So to determine which one might be a more reasonable approximation of $S$ , we can see how well they would each predict samples not yet in $o$ .

In the case of section 10.2.4, the crucial experimental test showing that $L^{-}$ is an unreasonable approximation of $S$ is one where we create a test set by setting up an $s_{t}$ and an $a_{t}$ where we know that $a_{t}$ is an action that would definitely not be taken by the real compute core software running in the agent, when it it encounters state $s_{t}$ . So we set up a test where we expect that $a_{t} \neq π^{l} (s_{t})$ . $L^{-}$ will (likely) mis-predict the outcome of this test. In philosophical/ontological terms, you can read this test as one that (likely) falsifies the claim that $L^{-}$ is a correct theory of $S$ .

As discussed in section 10.2.4, there are parallels between the above rejection test and the idea of random exploration, where random exploration causes the observational record $o$ , the training set, to already contain observations where $a_{t} \neq π^{l} (s_{t})$ for any deterministic $π^{l}$ . So this will likely suppress the creation of an unwanted $L^{-}$ via machine learning.

Some background: the symbol grounding issue I discuss in 10.2.4 is very related to the five-and-ten problem you can find in MIRI's work on embedded agency. In my experience, most people in AI, robotics, statistics, or cyber-physical systems have no problem seeing the solution to this five-and-ten problem, i.e. how to construct an agent that avoids it But somehow, and I do not know exactly why, MIRI-style(?) Rationalists keep treating it as a major open philosophical problem that is ignored by the mainstream AI/academic community. So you can read section 10.2.4 as my attempt to review and explain the standard solution to the five-and-ten problem, as used in statistics and engineering. The section was partly written with Rationalist readers in mind.

Philosophically speaking, the reasonableness criterion defined in my paper, and by supervised machine learning, has strong ties to Popper's view of science and engineering, which emphasizes falsification via new experiments as the key method for deciding between competing theories about the nature of reality. I believe that MIRI-style rationality de-emphasizes the conceptual tools provided by Popper. Instead it emphasizes a version of Bayesianism that provides a much more limited vocabulary to reason about differences between the map and the territory.

I would be interested to know if the above explanation was helpful to you, and if so which parts.

[-]Vanessa Kosoy5y41

Comment after reading section 1.1:

It seems to me that systems which have no access to data with rich information about the physical world are mostly safe (I called such systems "Class I" here). Such a system cannot attack because it has no idea how to physical world looks like. In principle we could imagine an attack that would work in most locations in the multiverse that are metacosmologically plausible, but it doesn't seem very likely.

Can you train a system to prove theorems without providing any data about the physical world? This depends from which distribution you sample your theorems. If we're talking about something like, uniform sentences of given length in the language of ZFC then, yes, we can. However, proving such theorems is very hard, and whatever progress you can make there doesn't necessarily help with proving interesting theorems.

Human mathematicians probably can only solve some rather narrow type of theorems. We can try training the AI on theorems selected by interest to human mathematicians, but then we risk leaking information about the physical world. Alternatively, the class of humanly-solvable-theorems might be close to something natural and not human specific, in which case a theorem prover can be class I. But, designing such a theorem prover would require us to first discover the specification of this natural class.

[-]Eliezer Yudkowsky5y101

You'd also need to prevent the system from knowing too much about its own source code or the computers it was running on. Anyways, this seems to me to mostly fall prey to the safe-but-useless branch of the dilemma; I don't know how to save the world using a theorem-prover that is never exposed to any reality-contaminated theorems. It seems strategically isomorphic to an expensive rock.

[-]Vanessa Kosoy5y*20

In general, yes, although we could imagine an AI and/or virtual machine whose design is so simple that it conveys little evidence about the universe. But, sure, it's not at all clear that this is useful against AI risk, and I wasn't implying otherwise.

[EDIT: I amended the class system to account for this.]

[-]Evan R. Murphy5y00

Richard, summarized by Richard: "Consider an AI that, given a hypothetical scenario, tells us what the best plan to achieve a certain goal in that scenario is. Of course it needs to do consequentialist reasoning to figure out how to achieve the goal. But that’s different from an AI which chooses what to say as a means of achieving its goals. [...]"
Eliezer, summarized by Richard: "The former AI might be slightly safer than the latter if you could build it, but I think people are likely to dramatically overestimate how big the effect is. The difference could just be one line of code: if we give the former AI our current scenario as its input, then it becomes the latter.

How does giving the former "planner" AI the current scenario as input turn it into the latter "acting" AI? It still only outputs a plan, which then the operators can review and decide whether or not to carry out.

Also, the planner AI that Richard put forth had two inputs, not one. The inputs were: 1) a scenario, and 2) a goal. So for Eliezer (or anyone who confidently understood this part of the discussion), which goal input are you providing to the planner AI in this situation? Are you saying that the planner AI becomes dangerous when it's provided with the current scenario and any goal as inputs?

89

Ngo and Yudkowsky on alignment difficulty

89

0. Prefatory comments

1. September 5 conversation

1.1. Deep vs. shallow problem-solving patterns

1.2. Requirements for science

1.3. Capability dials

1.4. Consequentialist goals vs. deontologist goals

2. Follow-ups

2.1. Richard Ngo's summary

3. September 8 conversation

3.1. The Brazilian university anecdote

3.2. Brain functions and outcome pumps

3.3. Hypothetical-planning systems, nanosystems, and evolving generality

3.4. Coherence and pivotal acts

4. Follow-ups

4.1. Richard Ngo's summary

4.2. Nate Soares' summary