This is the second post in a series of transcribed conversations about AGI forecasting and alignment. See the first post for prefaces and more information about the format.


Color key:

  Chat by Richard Ngo and Eliezer Yudkowsky     Other chat      Inline comments   

 

5. September 14 conversation

 

5.1. Recursive self-improvement, abstractions, and miracles

 

[Yudkowsky][11:00] 

Good morning / good evening.

So it seems like the obvious thread to pull today is your sense that I'm wrong about recursive self-improvement and consequentialism in a related way?

[Ngo][11:04] 

Right. And then another potential thread (probably of secondary importance) is the question of what you mean by utility functions, and digging more into the intuitions surrounding those.

But let me start by fleshing out this RSI/consequentialism claim.

I claim that your early writings about RSI focused too much on a very powerful abstraction, of recursively applied optimisation; and too little on the ways in which even powerful abstractions like this one become a bit... let's say messier, when they interact with the real world.

In particular, I think that Paul's arguments that there will be substantial progress in AI in the leadup to a RSI-driven takeoff are pretty strong ones.

(Just so we're on the same page: to what extent did those arguments end up shifting your credences?)

[Yudkowsky][11:09] 

I don't remember being shifted by Paul on this at all. I sure shifted a lot over events like Alpha Zero and the entire deep learning revolution. What does Paul say that isn't encapsulated in that update - does he furthermore claim that we're going to get fully smarter-than-human in all regards AI which doesn't cognitively scale much further either through more compute or through RSI?

[Ngo][11:10] 

Ah, I see. In that case, let's just focus on the update from the deep learning revolution.

[Yudkowsky][11:12][11:13] 

I'll also remark that I see my foreseeable mistake there as having little to do with "abstractions becoming messier when they interact with the real world" - this truism tells you very little of itself, unless you can predict directional shifts in other variables just by contemplating the unknown messiness relative to the abstraction.

Rather, I'd see it as a neighboring error to what I've called the Law of Earlier Failure, where the Law of Earlier Failure says that, compared to the interesting part of the problem where it's fun to imagine yourself failing, you usually fail before then, because of the many earlier boring points where it's possible to fail.

The nearby reasoning error in my case is that I focused on an interesting way that AI capabilities could scale and the most powerful argument I had to overcome Robin's objections, while missing the way that Robin's objections could fail even earlier through rapid scaling and generalization in a more boring way.

It doesn't mean that my arguments about RSI were false about their domain of supposed application, but that other things were also true and those things happened first on our timeline. To be clear, I think this is an important and generalizable issue with the impossible task of trying to forecast the Future, and if I am wrong about other things it sure would be plausible if I was wrong in similar ways.

[Ngo][11:13] 

Then the analogy here is something like: there is a powerful abstraction, namely consequentialism; and we both agree that (like RSI) a large amount of consequentialism is a very dangerous thing. But we disagree on the question of how much the strategic landscape in the leadup to highly-consequentialist AIs is affected by other factors apart from this particular abstraction.

"this truism tells you very little of itself, unless you can predict directional shifts in other variables just by contemplating the unknown messiness relative to the abstraction"

I disagree with this claim. It seems to me that the predictable direction in which the messiness pushes is away from the applicability of the high-level abstraction.

[Yudkowsky][11:15] 

The real world is messy, but good abstractions still apply, just with some messiness around them. The Law of Earlier Failure is not a failure of the abstraction being messy, it's a failure of the subject matter ending up different such that the abstractions you used were about a different subject matter.

When a company fails before the exciting challenge where you try to scale your app across a million users, because you couldn't hire enough programmers to build your app at all, the problem is not that you had an unexpectedly messy abstraction about scaling to many users, but that the key determinants were a different subject matter than "scaling to many users".

Throwing 10,000 TPUs at something and actually getting progress - not very much of a famous technological idiom at the time I was originally arguing with Robin - is not a leak in the RSI abstraction, it's just a way of getting powerful capabilities without RSI.

[Ngo][11:18] 

To me the difference between these two things seems mainly semantic; does it seem otherwise to you?

[Yudkowsky][11:18] 

If I'd been arguing with somebody who kept arguing in favor of faster timescales, maybe I'd have focused on that different subject matter and gotten a chance to be explicitly wrong about it. I mainly see my ur-failure here as letting myself be influenced by the whole audience that was nodding along very seriously to Robin's arguments, at the expense of considering how reality might depart in either direction from my own beliefs, and not just how Robin might be right or how to persuade the audience.

[Ngo][11:19] 

Also, "throwing 10,000 TPUs at something and actually getting progress" doesn't seem like an example of the Law of Earlier Failure - if anything it seems like an Earlier Success

[Yudkowsky][11:19] 

it's an Earlier Failure of Robin's arguments about why AI wouldn't scale quickly, so my lack of awareness of this case of the Law of Earlier Failure is why I didn't consider why Robin's arguments could fail earlier

though, again, this is a bit harder to call if you're trying to call it in 2008 instead of 2018

but it's a valid lesson that the future is, in fact, hard to predict, if you're trying to do it in the past

and I would not consider it a merely "semantic" difference as to whether you made a wrong argument about the correct subject matter, or a correct argument about the wrong subject matter

these are like... very different failure modes that you learn different lessons from

but if you're not excited by these particular fine differences in failure modes or lessons to learn from them, we should perhaps not dwell upon that part of the meta-level Art

[Ngo][11:21] 

Okay, so let me see if I understand your position here.

Due to the deep learning revolution, it turned out that there were ways to get powerful capabilities without RSI. This isn't intrinsically a (strong) strike against the RSI abstraction; and so, unless we have reason to expect another similarly surprising revolution before reaching AGI, it's not a good reason to doubt the consequentialism abstraction.

[Yudkowsky][11:25] 

Consequentialism and RSI are very different notions in the first place. Consequentialism is, in my own books, significantly simpler. I don't see much of a conceptual connection between the two myself, except insofar as they both happen to be part of the connected fabric of a coherent worldview about cognition.

It is entirely reasonable to suspect that we may get another surprising revolution before reaching AGI. Expecting a particular revolution that gives you particular miraculous benefits is much more questionable and is an instance of conjuring expected good from nowhere, like hoping that you win the lottery because the first lottery ball comes up 37. (Also, if you sincerely believed you actually had info about what kind of revolution might lead to AGI, you should shut up about it and tell very few carefully selected people, not bake it into a public dialogue.)

[Ngo][11:28] 

and I would not consider it a merely "semantic" difference as to whether you made a wrong argument about the correct subject matter, or a correct argument about the wrong subject matter

On this point: the implicit premise of "and also nothing else will break this abstraction or render it much less relevant" turns a correct argument about the wrong subject matter into an incorrect argument.

[Yudkowsky][11:28] 

Sure.

Though I'd also note that there's an important lesson of technique where you learn to say things like that out loud instead of keeping them "implicit".

Learned lessons like that are one reason why I go through your summary documents of our conversation and ask for many careful differences of wording about words like "will happen" and so on.

[Ngo][11:30] 

Makes sense.

So I claim that:

1. A premise like this is necessary for us to believe that your claims about consequentialism lead to extinction.

2. A surprising revolution would make it harder to believe this premise, even if we don't know which particular revolution it is.

3. If we'd been told back in 2008 that a surprising revolution would occur in AI, then we should have been less confident in the importance of the RSI abstraction to understanding AGI and AGI risk.

[Yudkowsky][11:32][11:34] 

Suppose I put to you that this claim is merely subsumed by all of my previous careful qualifiers about how we might get a "miracle" and how we should be trying to prepare for an unknown miracle in any number of places. Why suspect that place particularly for a model-violation?

I also think that you are misinterpreting my old arguments about RSI, in a pattern that matches some other cases of your summarizing my beliefs as "X is the one big ultra-central thing" rather than "X is the point where the other person got stuck and Eliezer had to spend a lot of time arguing".

I was always claiming that RSI was a way for AGI capabilities to scale much further once they got far enough, not the way AI would scale to human-level generality.

This continues to be a key fact of relevance to my future model, in the form of the unfalsified original argument about the subject matter it previously applied to: if you lose control of a sufficiently smart AGI, it will FOOM, and this fact about what triggers the metaphorical equivalent of a full nuclear exchange and a total loss of the gameboard continues to be extremely relevant to what you have to do to obtain victory instead.

[Ngo][11:34][11:35] 

Perhaps we're interpreting the word "miracle" in quite different ways.

I think of it as an event with negligibly small probability.

[Yudkowsky][11:35] 

Events that actually have negligibly small probability are not much use in plans.

[Ngo][11:35] 

Which I guess doesn't fit with your claims that we should be trying to prepare for a miracle.

[Yudkowsky][11:35] 

Correct.

[Ngo][11:35] 

But I'm not recalling off the top of my head where you've claimed that.

I'll do a quick search of the transcript

"You need to hold your mind open for any miracle and a miracle you didn't expect or think of in advance, because at this point our last hope is that in fact the future is often quite surprising."

Okay, I see. The connotations of "miracle" seemed sufficiently strong to me that I didn't interpret "you need to hold your mind open" as practical advice.

What sort of probability, overall, do you assign to us being saved by what you call a miracle?

[Yudkowsky][11:40] 

It's not a place where I find quantitative probabilities to be especially helpful.

And if I had one, I suspect I would not publish it.

[Ngo][11:41] 

Can you leak a bit of information? Say, more or less than 10%?

[Yudkowsky][11:41] 

Less.

Though a lot of that is dominated, not by the probability of a positive miracle, but by the extent to which we seem unprepared to take advantage of it, and so would not be saved by one.

[Ngo][11:41] 

Yeah, I see.

 

5.2. The idea of expected utility

 

[Ngo][11:43] 

Okay, I'm now significantly less confident about how much we actually disagree.

At least about the issues of AI cognition.

[Yudkowsky][11:44] 

You seem to suspect we'll get a particular miracle having to do with "consequentialism", which means that although it might be a miracle to me, it wouldn't be a miracle to you.

There is something forbidden in my model that is not forbidden in yours.

[Ngo][11:45] 

I think that's partially correct, but I'd call it more a broad range of possibilities in the rough direction of you being wrong about consequentialism.

[Yudkowsky][11:46] 

Well, as much as it may be nicer to debate when the other person has a specific positive expectation that X will work, we can also debate when I know that X won't work and the other person remains ignorant of that. So say more!

[Ngo][11:47] 

That's why I've mostly been trying to clarify your models rather than trying to make specific claims of my own.

Which I think I'd prefer to continue doing, if you're amenable, by asking you about what entities a utility function is defined over - say, in the context of a human.

[Yudkowsky][11:51][11:53] 

I think that to contain the concept of Utility as it exists in me, you would have to do homework exercises I don't know how to prescribe. Maybe one set of homework exercises like that would be showing you an agent, including a human, making some set of choices that allegedly couldn't obey expected utility, and having you figure out how to pump money from that agent (or present it with money that it would pass up).

Like, just actually doing that a few dozen times.

Maybe it's not helpful for me to say this? If you say it to Eliezer, he immediately goes, "Ah, yes, I could see how I would update that way after doing the homework, so I will save myself some time and effort and just make that update now without the homework", but this kind of jumping-ahead-to-the-destination is something that seems to me to be... dramatically missing from many non-Eliezers. They insist on learning things the hard way and then act all surprised when they do. Oh my gosh, who would have thought that an AI breakthrough would suddenly make AI seem less than 100 years away the way it seemed yesterday? Oh my gosh, who would have thought that alignment would be difficult?

Utility can be seen as the origin of Probability within minds, even though Probability obeys its own, simpler coherence constraints.

that is, you will have money pumped out of you, unless you weigh in your mind paths through time according to some quantitative weight, which determines how much resources you're willing to spend on preparing for them

this is why sapients think of things as being more or less likely

[Ngo][11:53] 

Suppose that this agent has some high-level concept - say, honour - which leads it to pass up on offers of money.

[Yudkowsky][11:55] 

Suppose that this agent has some high-level concept - say, honour - which leads it to pass up on offers of money.

then there's two possibilities:

  • this concept of honor is something that you can see as helping to navigate a path through time to a destination
  • honor isn't something that would be optimized into existence by optimization pressure for other final outcomes

[Ngo][11:55] 

Right, I see.

Hmm, but it seems like humans often don't see concepts as helping to navigate a path in time to a destination. (E.g. the deontological instinct not to kill.)

And yet those concepts were in fact optimised into existence by evolution.

[Yudkowsky][11:59] 

You're describing a defect of human reflectivity about their consequentialist structure, not a departure from consequentialist structure. 🙂

[Ngo][12:01] 

(Sorry, internet was slightly buggy; switched to a better connection now.)

[Yudkowsky][12:01] 

But yes, from my perspective, it creates a very large conceptual gap that I can stare at something for a few seconds and figure out how to parse it as navigating paths through time, while others think that "consequentialism" only happens when their minds are explicitly thinking about "well, what would have this consequence" using language.

Similarly, when it comes to Expected Utility, I see that any time something is attaching relative-planning-weights to paths through time, not when a human is thinking out loud about putting spoken numbers on outcomes

[Ngo][12:02] 

Human consequentialist structure was optimised by evolution for a different environment. Insofar as we are consequentialists in a new environment, it's only because we're able to be reflective about our consequentialist structure (or because there are strong similarities between the environments).

[Yudkowsky][12:02] 

False.

It just generalized out-of-distribution because the underlying coherence of the coherent behaviors was simple.

When you have a very simple pattern, it can generalize across weak similarities, not "strong similarities".

The human brain is large but the coherence in it is simple.

The idea, the structure, that explains why the big thing works, is much smaller than the big thing.

So it can generalize very widely.

[Ngo][12:04] 

Taking this example of the instinct not to kill people - is this one of the "very simple patterns" that you're talking about?

[Yudkowsky][12:05] 

"Reflectivity" doesn't help per se unless on some core level a pattern already generalizes, I mean, either a truth can generalize across the data or it can't? So I'm a bit puzzled about why you're bringing up "reflectivity" in this context.

And, no.

An instinct not to kill doesn't even seem to me like a plausible cross-cultural universal. 40% of deaths among Yanomami men are in intratribal fights, iirc.

[Ngo][12:07] 

Ah, I think we were talking past each other. When you said "this concept of honor is something that you can see as helping to navigate a path through time to a destination" I thought you meant "you" as in the agent in question (as you used it in some previous messages) not "you" as in a hypothetical reader.

[Yudkowsky][12:07] 

ah.

it would not have occurred to me to ascribe that much competence to an agent that wasn't a superintelligence.

even I don't have time to think about why more than 0.0001% 0.01% of my thoughts do anything, but thankfully, you don't have to think about why 2 + 2 = 4 for it to be the correct answer for counting sheep.

[Ngo][12:10] 

Got it.

I might now try to throw a high-level (but still inchoate) disagreement at you and see how that goes. But while I'm formulating that, I'm curious what your thoughts are on where to take the discussion.

Actually, let's spend a few minutes deciding where to go next, and then take a break

I'm thinking that, at this point, there might be more value in moving onto geopolitics

[Yudkowsky][12:19] 

Some of my current thoughts are a reiteration of old despair: It feels to me like the typical Other within EA has no experience with discovering unexpected order, with operating a generalization that you can expect will cover new cases even when that isn't immediately obvious, with operating that generalization to cover those new cases correctly, with seeing simple structures that generalize a lot and having that be a real and useful and technical experience; instead of somebody blathering in a non-expectation-constraining way about how "capitalism is responsible for everything wrong with the world", and being able to extend that to lots of cases.

I could try to use much simpler language in hopes that people actually look-at-the-water Feynman-style, like "navigating a path through time" instead of Consequentialism which is itself a step down from Expected Utility.

But you actually do lose something when you throw away the more technical concept. And then people still think that either you instantly see in the first second how something is a case of "navigating a path through time", or that this is something that people only do explicitly when visualizing paths through time using that mental terminology; or, if Eliezer says that it's "navigating time" anyways, this must be an instance of Eliezer doing that thing other people do when they talk about how "Capitalism is responsible for all the problems of the world". They have no experience operating genuinely useful, genuinely deep generalizations that extend to nonobvious things.

And in fact, being able to operate some generalizations like that is a lot of how I know what I know, in reality and in terms of the original knowledge that came before trying to argue that knowledge with people. So trying to convey the real source of the knowledge feels doomed. It's a kind of idea that our civilization has lost, like that college class Feynman ran into.

[Soares][12:19] 

My own sense (having been back for about 20min) is that one of the key cruxes is in "is it possible that non-scary cognition will be able to end the acute risk period", or perhaps "should we expect a longish regime of pre-scary cognition, that we can study and learn to align in such a way that by the time we get scary cognition we can readily align it".

[Ngo][12:19] 

Some potential prompts for that:

  • what are some scary things which might make governments take AI more seriously than they took covid, and which might happen before AGI
  • how much of a bottleneck in your model is governmental competence? and how much of a difference do you see in this between, say, the US and China?

[Soares][12:20] 

I also have a bit of a sense that there's a bit more driving to do on the "perhaps EY is just wrong about the applicability of the consequentialism arguments" (in a similar domain), and would be happy to try articulating a bit of what I think are the not-quite-articulated-to-my-satisfaction arguments on that side.

[Yudkowsky][12:21] 

I also had a sense - maybe mistaken - that RN did have some specific ideas about how "consequentialism" might be inapplicable. though maybe I accidentally refuted that in passing because the idea was "well, what if it didn't know what consequentialism was?" and then I explained that reflectivity was not required to make consequentialism generalize. but if so, I'd like RN to say explicitly what specific idea got refuted that way. or failing that, talk about the specific idea that didn't get refuted.

[Ngo][12:23] 

That wasn't my objection, but I do have some more specific ideas, which I could talk about.

And I'd also be happy for Nate to try articulating some of the arguments he mentioned above.

[Yudkowsky][12:23] 

I have a general worry that this conversation has gotten too general, and that it would be more productive, even of general understanding, to start from specific ideas and shoot those down specifically.

[Ngo: 👍]

[Ngo][12:26] 

The other thing is that, for pedagogical purposes, I think it'd be useful for you to express some of your beliefs about how governments will respond to AI

I think I have a rough guess about what those beliefs are, but even if I'm right, not everyone who reads this transcript will be

[Yudkowsky][12:28] 

Why would I be expected to know that? I could talk about weak defaults and iterate through an unending list of possibilities.

Thinking that Eliezer thinks he knows that to any degree of specificity feels like I'm being weakmanned!

[Ngo][12:28] 

I'm not claiming you have any specific beliefs

[Yudkowsky][12:29] 

I suppose I have skepticism when other people dream up elaborately positive and beneficial reactions apparently drawn from some alternate nicer political universe that had an absolutely different response to Covid-19, and so on.

[Ngo][12:29] 

But I'd guess that your models rule out, for instance, the US and China deeply cooperating on AI before it's caused any disasters

[Yudkowsky][12:30] 

"Deeply"? Sure. That sounds like something that has never happened, and I'm generically skeptical about political things that go better than any political thing has ever gone before.

I guess we could talk about that? It doesn't seem like the most productive area, but maybe it lies upstream of more technical disagreements because we disagree about what AGI would actually have to do to have the world not end.

[Soares][12:31] 

Cool. I claim it's time for a break, and then I nominate a little Eliezer gov't-response-overview followed by specific maybe-consequentialism-based-worries-aren't-a-problem-in-practice ideas from Richard.

[Ngo: 👍]

[Yudkowsky][12:32] 

See you in 28mins

 

5.3. Epistemology, and assessing the idea of expected utility

 

[Ngo][13:03] 

Some of my current thoughts are a reiteration of old despair: It feels to me like the typical Other within EA has no experience with discovering unexpected order, with operating a generalization that you can expect will cover new cases even when that isn't immediately obvious, with operating that generalization to cover those new cases correctly, with seeing simple structures that generalize a lot and having that be a real and useful and technical experience; instead of somebody blathering in a non-expectation-constraining way about how "capitalism is responsible for everything wrong with the world", and being able to extend that to lots of cases.

I could try to use much simpler language in hopes that people actually look-at-the-water Feynman-style, like "navigating a path through time" instead of Consequentialism which is itself a step down from Expected Utility.

But you actually do lose something when you throw away the more technical concept. And then people still think that either you instantly see in the first second how something is a case of "navigating a path through time", or that this is something that people only do explicitly when visualizing paths through time using that mental terminology; or, if Eliezer says that it's "navigating time" anyways, this must be an instance of Eliezer doing that thing other people do when they talk about how "Capitalism is responsible for all the problems of the world". They have no experience operating genuinely useful, genuinely deep generalizations that extend to nonobvious things.

And in fact, being able to operate some generalizations like that is a lot of how I know what I know, in reality and in terms of the original knowledge that came before trying to argue that knowledge with people. So trying to convey the real source of the knowledge feels doomed. It's a kind of idea that our civilization has lost, like that college class Feynman ran into.

Ooops, didn't see this comment earlier. With respect to discovering unexpected order, one point that seems relevant is the extent to which that order provides predictive power. To what extent do you think that predictive successes in economics are important evidence for expected utility theory being a powerful formalism? (Or are there other ways in which it's predictively powerful that provide significant evidence?)

I'd be happy with a quick response to that, and then on geopolitics, here's a prompt to kick us off:

  • If the only two actors involved in AGI development were the US and the UK governments, how much safer (or less safe) would you think we were compared with a world in which the two actors are the US and Chinese governments? How about a world in which the US government was a decade ahead of everyone else in reaching AGI?

[Yudkowsky][13:06] 

I think that the Apollo space program is much deeper evidence for Utility. Observe, if you train protein blobs to run around the savanna, they also go to the moon!

If you think of "utility" as having something to do with the human discipline called "economics" then you are still thinking of it in a much much much more narrow way than I do.

[Ngo][13:07] 

I'm not asking about evidence for utility as an abstraction in general, I'm asking for evidence based on successful predictions that have been made using it.

[Yudkowsky][13:10] 

That doesn't tend to happen a lot, because all of the deep predictions that it makes are covered by shallow predictions that people made earlier.

Consider the following prediction of evolutionary psychology: Humans will enjoy activities associated with reproduction!

"What," says Simplicio, "you mean like dressing up for dates? I don't enjoy that part."

"No, you're overthinking it, we meant orgasms," says the evolutionary psychologist.

"But I already knew that, that's just common sense!" replies Simplicio.

"And yet it is very specifically a prediction of evolutionary psychology which is not made specifically by any other theory of human minds," replies the evolutionary psychologist.

"Not an advance prediction, just-so story, too obvious," replies Simplicio.

[Ngo][13:11] 

Yepp, I agree that most of its predictions won't be new. Yet evolution is a sufficiently powerful theory that people have still come up with a range of novel predictions that derive from it.

Insofar as you're claiming that expected utility theory is also very powerful, then we should expect that it also provides some significant predictions.

[Yudkowsky][13:12] 

An advance prediction of the notion of Utility, I suppose, is that if you train an AI which is otherwise a large blob of layers - though this may be inadvisable for other reasons - to the point where it starts solving lots of novel problems, that AI will tend to value aspects of outcomes with weights, and weight possible paths through time (the dynamic progress of the environment), and use (by default, usually, roughly) the multiplication of these weights to allocate limited resources between mutually conflicting plans.

[Ngo][13:13] 

Again, I'm asking for evidence in the form of successful predictions.

[Yudkowsky][13:14] 

I predict that people will want some things more than others, think some possibilities are more likely than others, and prefer to do things that lead to stuff they want a lot through possibilities they think are very likely!

[Ngo][13:15] 

It would be very strange to me if a theory which makes such strong claims about things we can't yet verify can't shed light on anything which we are in a position to verify.

[Yudkowsky][13:15] 

If you think I'm deriving my predictions of catastrophic alignment failure through something more exotic than that, you're missing the reason why I'm so worried. It doesn't take intricate complicated exotic assumptions.

It makes the same kind of claims about things we can't verify yet as it makes about things we can verify right now.

[Ngo][13:16] 

But that's very easy to do! Any theory can do that.

[Yudkowsky][13:17] 

For example, if somebody wants money, and you set up a regulation which prevents them from making money, it predicts that the person will look for a new way to make money that bypasses the regulation.

[Ngo][13:17] 

And yes, of course fitting previous data is important evidence in favour of a theory

[Yudkowsky][13:17] 

[But that's very easy to do! Any theory can do that.]

False! Any theory can do that in the hands of a fallible agent which invalidly, incorrectly derives predictions from the theory.

[Ngo][13:18] 

Well, indeed. But the very point at hand is whether the predictions you base on this theory are correctly or incorrectly derived.

[Yudkowsky][13:18] 

It is not the case that every theory does an equally good job of predicting the past, given valid derivations of predictions.

Well, hence the analogy to evolutionary psychology. If somebody doesn't see the blatant obviousness of how sexual orgasms are a prediction specifically of evolutionary theory, because it's "common sense" and "not an advance prediction", what are you going to do? We can, in this case, with a lot more work, derive more detailed advance predictions about degrees of wanting that correlate in detail with detailed fitness benefits. But that's not going to convince anybody who overlooked the really blatant and obvious primary evidence.

What they're missing there is a sense of counterfactuals, of how the universe could just as easily have looked if the evolutionary origins of psychology were false: why should organisms want things associated with reproduction, why not instead have organisms running around that want things associated with rolling down hills?

Similarly, if optimizing complicated processes for outcomes hard enough, didn't produce cognitive processes that internally mapped paths through time and chose actions conditional on predicted outcomes, human beings would... not think like that? What am I supposed to say here?

[Ngo][13:24] 

Let me put it this way. There are certain traps that, historically, humans have been very liable to fall into. For example, seeing a theory, which seems to match so beautifully and elegantly the data which we've collected so far, it's very easy to dramatically overestimate how much that data favours that theory. Fortunately, science has a very powerful social technology for avoiding this (i.e. making falsifiable predictions) which seems like approximately the only reliable way to avoid it - and yet you don't seem concerned at all about the lack of application of this technology to expected utility theory.

[Yudkowsky][13:25] 

This is territory I covered in the Sequences, exactly because "well it didn't make a good enough advance prediction yet!" is an excuse that people use to reject evolutionary psychology, some other stuff I covered in the Sequences, and some very predictable lethalities of AGI.

[Ngo][13:26] 

With regards to evolutionary psychology: yes, there are some blatantly obvious ways in which it helps explain the data available to us. But there are also many people who have misapplied or overapplied evolutionary psychology, and it's very difficult to judge whether they have or have not done so, without asking them to make advance predictions.

[Yudkowsky][13:26] 

I talked about the downsides of allowing humans to reason like that, the upsides, the underlying theoretical laws of epistemology (which are clear about why agents that reason validly or just unbiasedly would do that without the slightest hiccup), etc etc.

In the case of the theory "people want stuff relatively strongly, predict stuff relatively strongly, and combine the strengths to choose", what kind of advance prediction that no other theory could possibly make, do you expect that theory to make?

In the worlds where that theory is true, how should it be able to prove itself to you?

[Ngo][13:28] 

I expect deeper theories to make more and stronger predictions.

I'm currently pretty uncertain if expected utility theory is a deep or shallow theory.

But deep theories tend to shed light in all sorts of unexpected places.

[Yudkowsky][13:30] 

The fact is, when it comes to AGI (general optimization processes), we have only two major datapoints in our dataset, natural selection and humans. So you can either try to reason validly about what theories predict about natural selection and humans, even though we've already seen the effects of those; or you can claim to give up in great humble modesty while actually using other implicit theories instead to make all your predictions and be confident in them.

[Ngo][13:30] 

I talked about the downsides of allowing humans to reason like that, the upsides, the underlying theoretical laws of epistemology (which are clear about why agents that reason validly or just unbiasedly would do that without the slightest hiccup), etc etc.

I'm familiar with your writings on this, which is why I find myself surprised here. I could understand a perspective of "yes, it's unfortunate that there are no advanced predictions, it's a significant weakness, I wish more people were doing this so we could better understand this vitally important theory". But that seems very different from your perspective here.

[Yudkowsky][13:32] 

Oh, I'd love to be making predictions using a theory that made super detailed advance predictions made by no other theory which had all been borne out by detailed experimental observations! I'd also like ten billion dollars, a national government that believed everything I honestly told them about AGI, and a drug that raises IQ by 20 points.

[Ngo][13:32] 

The very fact that we have only two major datapoints is exactly why it seems like such a major omission that a theory which purports to describe intelligent agency has not been used to make any successful predictions about the datapoints we do have.

[Yudkowsky][13:32][13:33] 

This is making me think that you imagine the theory as something much more complicated and narrow than it is.

Just look at the water.

Not very special water with an index.

Just regular water.

People want stuff. They want some things more than others. When they do stuff they expect stuff to happen.

These are predictions of the theory. Not advance predictions, but predictions nonetheless.

[Ngo][13:33][13:33] 

I'm accepting your premise that it's something deep and fundamental, and making the claim that deep, fundamental theories are likely to have a wide range of applications, including ones we hadn't previously thought of.

Do you disagree with that premise, in general?

[Yudkowsky][13:36] 

I don't know what you really mean by "deep fundamental theory" or "wide range of applications we hadn't previously thought of", especially when it comes to structures that are this simple. It sounds like you're still imagining something I mean by Expected Utility which is some narrow specific theory like a particular collection of gears that are appearing in lots of places.

Are numbers a deep fundamental theory?

Is addition a deep fundamental theory?

Is probability a deep fundamental theory?

Is the notion of the syntax-semantics correspondence in logic and the notion of a generally semantically valid reasoning step, a deep fundamental theory?

[Ngo][13:38] 

Yes to the first three, all of which led to very successful novel predictions.

[Yudkowsky][13:38] 

What's an example of a novel prediction made by the notion of probability?

[Ngo][13:38] 

Most applications of the central limit theorem.

[Yudkowsky][13:39] 

Then I should get to claim every kind of optimization algorithm which used expected utility, as a successful advance prediction of expected utility? Optimal stopping and all the rest? Seems cheap and indeed invalid to me, and not particularly germane to whether these things appear inside AGIs, but if that's what you want, then sure.

[Ngo][13:39] 

These are predictions of the theory. Not advance predictions, but predictions nonetheless.

I agree that it is a prediction of the theory. And yet it's also the case that smarter people than either of us have been dramatically mistaken about how well theories fit previously-collected data. (Admittedly we have advantages which they didn't, like a better understanding of cognitive biases - but it seems like you're ignoring the possibility of those cognitive biases applying to us, which largely negates those advantages.)

[Yudkowsky][13:42] 

I'm not ignoring it, just adjusting my confidence levels and proceeding, instead of getting stuck in an infinite epistemic trap of self-doubt.

I don't live in a world where you either have the kind of detailed advance experimental predictions that should convince the most skeptical scientist and render you immune to all criticism, or, alternatively, you are suddenly in a realm beyond the reach of all epistemic authority, and you ought to cuddle up into a ball and rely only on wordless intuitions and trying to put equal weight on good things happening and bad things happening.

I live in a world where I proceed with very strong confidence if I have a detailed formal theory that made detailed correct advance predictions, and otherwise go around saying, "well, it sure looks like X, but we can be on the lookout for a miracle too".

If this was a matter of thermodynamics, I wouldn't even be talking like this, and we wouldn't even be having this debate.

I'd just be saying, "Oh, that's a perpetual motion machine. You can't build one of those. Sorry." And that would be the end.

Meanwhile, political superforecasters go on making well-calibrated predictions about matters much murkier and more complicated than these, often without anything resembling a clearly articulated theory laid forth at length, let alone one that had made specific predictions even retrospectively. They just go do it instead of feeling helpless about it.

[Ngo][13:45] 

Then I should get to claim every kind of optimization algorithm which used expected utility, as a successful advance prediction of expected utility? Optimal stopping and all the rest? Seems cheap and indeed invalid to me, and not particularly germane to whether these things appear inside AGIs, but if that's what you want, then sure.

These seem better than nothing, but still fairly unsatisfying, insofar as I think they are related to more shallow properties of the theory.

Hmm, I think you're mischaracterising my position. I nowhere advocated for feeling helpless or curling up in a ball. I was just noting that this is a particularly large warning sign which has often been valuable in the past, and it seemed like you were not only speeding past it blithely, but also denying the existence of this category of warning signs.

[Yudkowsky][13:48] 

I think you're looking for some particular kind of public obeisance that I don't bother to perform internally because I'd consider it a wasted motion. If I'm lost in a forest I don't bother going around loudly talking about how I need a forest theory that makes detailed advance experimental predictions in controlled experiments, but, alas, I don't have one, so now I should be very humble. I try to figure out which way is north.

When I have a guess at a northerly direction, it would then be an error to proceed with as much confidence as if I'd had a detailed map and had located myself upon it.

[Ngo][13:49] 

Insofar as I think we're less lost than you do, then the weaknesses of whichever forest theory implies that we're lost are relevant for this discussion.

[Yudkowsky][13:49] 

The obeisance I make in that direction is visible in such statements as, "But this, of course, is a prediction about the future, which is well-known to be quite difficult to predict, in fact."

If my statements had been matters of thermodynamics and particle masses, I would not be adding that disclaimer.

But most of life is not a statement about particle masses. I have some idea of how to handle that. I do not need to constantly recite disclaimers to myself about it.

I know how to proceed when I have only a handful of data points which have already been observed and my theories of them are retrospective theories. This happens to me on a daily basis, eg when dealing with human beings.

[Soares][13:50] 

(I have a bit of a sense that we're going in a circle. It also seems to me like there's some talking-past happening.)

(I suggest a 5min break, followed by EY attempting to paraphrase RN to his satisfaction and vice versa.)

[Yudkowsky][13:51] 

I'd have more trouble than usual paraphrasing RN because epistemic helplessness is something I find painful to type out.

[Soares][13:51] 

(I'm also happy to attempt to paraphrase each point as I see it; it may be that this smooths over some conversational wrinkle.)

[Ngo][13:52] 

Seems like a good suggestion. I'm also happy to move on to the next topic. This was meant to be a quick clarification.

[Soares][13:52] 

nod. It does seem to me like it possibly contains a decently sized meta-crux, about what sorts of conclusions one is licensed to draw from what sorts of observations

that, eg, might be causing Eliezer's probabilities to concentrate but not Richard's.

[Yudkowsky][13:52] 

Yeah, this is in the opposite direction of "more specificity".

[Soares: 😝][Ngo: 😆]

I frankly think that most EAs suck at explicit epistemology, OpenPhil and FHI affiliated EAs are not much of an exception to this, and I expect I will have more luck talking people out of specific errors than talking them out of the infinite pit of humble ignorance considered abstractly.

[Soares][13:54] 

Ok, that seems to me like a light bid to move to the next topic from both of you, my new proposal is that we take a 5min break and then move to the next topic, and perhaps I'll attempt to paraphrase each point here in my notes, and if there's any movement in the comments there we can maybe come back to it later.

[Ngo: 👍]

[Ngo][13:54] 

Broadly speaking I am also strongly against humble ignorance (albeit to a lesser extent than you are).

[Yudkowsky][13:55] 

I'm off to take a 5-minute break, then!

 

5.4. Government response and economic impact

 

[Ngo][14:02] 

A meta-level note: I suspect we're around the point of hitting significant diminishing marginal returns from this format. I'm open to putting more time into the debate (broadly construed) going forward, but would probably want to think a bit about potential changes in format.

[Soares][14:04, moved two up in log] 

A meta-level note: I suspect we're around the point of hitting significant diminishing marginal returns from this format. I'm open to putting more time into the debate (broadly construed) going forward, but would probably want to think a bit about potential changes in format.

(Noted, thanks!)

[Yudkowsky][14:03] 

I actually think that may just be a matter of at least one of us, including Nate, having to take on the thankless job of shutting down all digressions into abstractions and the meta-level.

[Ngo][14:05] 

I actually think that may just be a matter of at least one of us, including Nate, having to take on the thankless job of shutting down all digressions into abstractions and the meta-level.

I'm not so sure about this, because it seems like some of the abstractions are doing a lot of work.

[Yudkowsky][14:03][14:04] 

Anyways, government reactions?

It seems to me like the best observed case for government reactions - which I suspect is no longer available in the present era as a possibility - was the degree of cooperation between the USA and Soviet Union about avoiding nuclear exchanges.

This included such incredibly extravagant acts of cooperation as installing a direct line between the President and Premier!

which is not what I would really characterize as very "deep" cooperation, but it's more than a lot of cooperation you see nowadays.

More to the point, both the USA and Soviet Union proactively avoided doing anything that might lead towards starting down a path that led to a full nuclear exchange.

[Ngo][14:04] 

The question I asked earlier:

  • If the only two actors involved in AGI development were the US and the UK governments, how much safer (or less safe) would you think we were compared with a world in which the two actors are the US and Chinese governments? How about a world in which the US government was a decade ahead of everyone else in reaching AGI?

[Yudkowsky][14:05] 

They still provoked one another a lot, but, whenever they did so, tried to do so in a way that wouldn't lead to a full nuclear exchange.

It was mutually understood to be a strategic priority and lots of people on both sides thought a lot about how to avoid it.

I don't know if that degree of cooperation ever got to the fantastic point of having people from both sides in the same room brainstorming together about how to avoid a full nuclear exchange, because that is, like, more cooperation than you would normally expect from two governments, but it wouldn't shock me to learn that this had ever happened.

It seems obvious to me that if some situation developed nowadays which increased the profile possibility of a nuclear exchange between the USA and Russia, we would not currently be able to do anything like installing a Hot Line between the US and Russian offices if such a Hot Line had not already been installed. This is lost social technology from a lost golden age. But still, it's not unreasonable to take this as the upper bound of attainable cooperation; it's been observed within the last 100 years.

Another guess for how governments react is a very simple and robust one backed up by a huge number of observations:

They don't.

They have the same kind of advance preparation and coordination around AGI, in advance of anybody getting killed, as governments had around the mortgage crisis of 2007 in advance of any mortgages defaulting.

I am not sure I'd put this probability over 50% but it's certainly by far the largest probability over any competitor possibility specified to an equally low amount of detail.

I would expect anyone whose primary experience was with government, who was just approaching this matter and hadn't been talked around to weird exotic views, to tell you the same thing as a matter of course.

[Ngo][14:10] 

But still, it's not unreasonable to take this as the upper bound of attainable cooperation; it's been observed within the last 100 years.

Is this also your upper bound conditional on a world that has experienced a century's worth of changes within a decade, and in which people are an order of magnitude wealthier than they currently are?

I am not sure I'd put this probability over 50% but it's certainly by far the largest probability over any competitor possibility specified to an equally low amount of detail.

which one was this? US/UK?

[Yudkowsky][14:12][14:14] 

Assuming governments do react, we have the problem of "What kind of heuristic could have correctly led us to forecast that the US's reaction to a major pandemic would be for the FDA to ban hospitals from doing in-house Covid tests? What kind of mental process could have led us to make that call?" And we couldn't have gotten it exactly right, because the future is hard to predict; the best heuristic I've come up with, that feels like it at least would not have been surprised by what actually happened, is, "The government will react with a flabbergasting level of incompetence, doing exactly the wrong thing, in some unpredictable specific way."

which one was this? US/UK?

I think if we're talking about any single specific government like the US or UK then the probability is over 50% that they don't react in any advance coordinated way to the AGI crisis, to a greater and more effective degree than they "reacted in an advance coordinated way" to pandemics before 2020 or mortgage defaults before 2007.

Maybe some two governments somewhere on Earth will have a high-level discussion between two cabinet officials.

[Ngo][14:14] 

That's one lesson you could take away. Another might be: governments will be very willing to restrict the use of novel technologies, even at colossal expense, in the face of even a small risk of large harms.

[Yudkowsky][14:15] 

That's one lesson you could take away. Another might be: governments will be very willing to restrict the use of novel technologies, even at colossal expense, in the face of even a small risk of large harms.

I just... don't know what to do when people talk like this.

It's so absurdly, absurdly optimistic.

It's taking a massive massive failure and trying to find exactly the right abstract gloss to put on it that makes it sound like exactly the right perfect thing will be done next time.

This just - isn't how to understand reality.

This isn't how superforecasters think.

This isn't sane.

[Soares][14:16] 

(be careful about ad hominem)

(Richard might not be doing the insane thing you're imagining, to generate that sentence, etc)

[Ngo][14:17] 

Right, I'm not endorsing this as my mainline prediction about what happens. Mainly what I'm doing here is highlighting that your view seems like one which cherrypicks pessimistic interpretations.

[Yudkowsky][14:18] 

That abstract description "governments will be very willing to restrict the use of novel technologies, even at colossal expense, in the face of even a small risk of large harms" does not in fact apply very well to the FDA banning hospitals from using their well-established in-house virus tests, at risk of the alleged harm of some tests giving bad results, when in fact the CDC's tests were giving bad results and much larger harms were on the way because of bottlenecked testing; and that abstract description should have applied to an effective and globally coordinated ban against gain-of-function research, which didn't happen.

[Ngo][14:19] 

Alternatively: what could have led us to forecast that many countries will impose unprecedentedly severe lockdowns.

[Yudkowsky][14:19][14:21][14:21] 

Well, I didn't! I didn't even realize that was an option! I thought Covid was just going to rip through everything.

(Which, to be clear, it still may, and Delta arguably is in the more primitive tribal areas of the USA, as well as many other countries around the world that can't afford vaccines financially rather than epistemically.)

But there's a really really basic lesson here about the different style of "sentences found in political history books" rather than "sentences produced by people imagining ways future politics could handle an issue successfully".
Reality is so much worse than people imagining what might happen to handle an issue successfully.

[Ngo][14:21][14:21][14:22] 

I might nudge us away from covid here, and towards the questions I asked before.

The question I asked earlier:

  • If the only two actors involved in AGI development were the US and the UK governments, how much safer (or less safe) would you think we were compared with a world in which the two actors are the US and Chinese governments? How about a world in which the US government was a decade ahead of everyone else in reaching AGI?

This being one.

"But still, it's not unreasonable to take this as the upper bound of attainable cooperation; it's been observed within the last 100 years." Is this also your upper bound conditional on a world that has experienced a century's worth of changes within a decade, and in which people are an order of magnitude wealthier than they currently are?

And this being the other.

[Yudkowsky][14:22] 

Is this also your upper bound conditional on a world that has experienced a century's worth of changes within a decade, and in which people are an order of magnitude wealthier than they currently are?

I don't expect this to happen at all, or even come remotely close to happening; I expect AGI to kill everyone before self-driving cars are commercialized.

[Yudkowsky][16:29]  (Nov. 14 follow-up comment) 

(This was incautiously put; maybe strike "expect" and put in "would not be the least bit surprised if" or "would very tentatively guess that".)

[Ngo][14:23] 

ah, I see

Okay, maybe here's a different angle which I should have been using. What's the most impressive technology you expect to be commercialised before AGI kills everyone?

[Yudkowsky][14:24] 

If the only two actors involved in AGI development were the US and the UK governments, how much safer (or less safe) would you think we were compared with a world in which the two actors are the US and Chinese governments?

Very hard to say; the UK is friendlier but less grown-up. We would obviously be VASTLY safer in any world where only two centralized actors (two effective decision processes) could ever possibly build AGI, though not safe / out of the woods / at over 50% survival probability.

How about a world in which the US government was a decade ahead of everyone else in reaching AGI?

Vastly safer and likewise impossibly miraculous, though again, not out of the woods at all / not close to 50% survival probability.

What's the most impressive technology you expect to be commercialised before AGI kills everyone?

This is incredibly hard to predict. If I actually had to predict this for some reason I would probably talk to Gwern and Carl Shulman. In principle, there's nothing preventing me from knowing something about Go which lets me predict in 2014 that Go will probably fall in two years, but in practice I did not do that and I don't recall anybody else doing it either. It's really quite hard to figure out how much cognitive work a domain requires and how much work known AI technologies can scale to with more compute, let alone predict AI breakthroughs.

[Ngo][14:27] 

I'd be happy with some very rough guesses

[Yudkowsky][14:27] 

If you want me to spin a scifi scenario, I would not be surprised to find online anime companions carrying on impressively humanlike conversations, because this is a kind of technology that can be deployed without major corporations signing on or regulatory approval.

[Ngo][14:28] 

Okay, this is surprising; I expected something more advanced.

[Yudkowsky][14:29] 

Arguably AlphaFold 2 is already more advanced than that, along certain dimensions, but it's no coincidence that afaik people haven't really done much with AlphaFold 2 and it's made no visible impact on GDP.

I expect GDP not to depart from previous trendlines before the world ends, would be a more general way of putting it.

[Ngo][14:29] 

What's the most least impressive technology that your model strongly rules out happening before AGI kills us all?

[Yudkowsky][14:30] 

you mean least impressive?

[Ngo][14:30] 

oops, yes

That seems like a structurally easier question to answer

[Yudkowsky][14:30] 

"Most impressive" is trivial. "Dyson Spheres" answers it.

Or, for that matter, "perpetual motion machines".

[Ngo][14:31] 

Ah yes, I was thinking that Dyson spheres were a bit too prosaic

[Yudkowsky][14:32] 

My model mainly rules out that we get to certain points and then hang around there for 10 years while the technology gets perfected, commercialized, approved, adopted, ubiquitized enough to produce a visible trendline departure on the GDP graph; not so much various technologies themselves being initially demonstrated in a lab.

I expect that the people who build AGI can build a self-driving car if they want to. Getting it approved and deployed before the world ends is quite another matter.

[Ngo][14:33] 

OpenAI has commercialised GPT-3

[Yudkowsky][14:33] 

Hasn't produced much of a bump in GDP as yet.

[Ngo][14:33] 

I wasn't asking about that, though

I'm more interested in judging how hard you think it is for AIs to take over the world

[Yudkowsky][14:34] 

I note that it seems to me like there is definitely a kind of thinking here, which, if told about GPT-3 five years ago, would talk in very serious tones about how much this technology ought to be predicted to shift GDP, and whether we could bet on that.

By "take over the world" do you mean "turn the world into paperclips" or "produce 10% excess of world GDP over predicted trendlines"?

[Ngo][14:35] 

Turn world into paperclips

[Yudkowsky][14:36] 

I expect this mainly happens as a result of superintelligence, which is way up in the stratosphere far above the minimum required cognitive capacities to get the job done?

The interesting question is about humans trying to deploy a corrigible AGI thinking in a restricted domain, trying to flip the gameboard / "take over the world" without full superintelligence?

I'm actually not sure what you're trying to get at here.

[Soares][14:37] 

(my guess, for the record, is that the crux Richard is attempting to drive for here, is centered more around something like "will humanity spend a bunch of time in the regime where there are systems capable of dramatically increasing world GDP, and if not how can you be confident of that from here")

[Yudkowsky][14:38] 

This is not the sort of thing I feel Confident about.

[Yudkowsky][16:31]  (Nov. 14 follow-up comment) 

(My confidence here seems understated.  I am very pleasantly surprised if we spend 5 years hanging around with systems that can dramatically increase world GDP and those systems are actually being used for that.  There isn't one dramatic principle which prohibits that, so I'm not Confident, but it requires multiple nondramatic events to go not as I expect.)

[Ngo][14:38] 

Yeah, that's roughly what I'm going for. Or another way of putting it: we have some disagreements about the likelihood of humans being able to get an AI to do a pivotal act which saves the world. So I'm trying to get some estimates for what the hardest act you think humans can get an AI to do is.

[Soares][14:39] 

(and that a difference here causes, eg, Richard to suspect the relevant geopolitics happen after a century of progress in 10y, everyone being suddenly much richer in real terms, and a couple of warning shots, whereas Eliezer expects the relevant geopolitics to happen the day after tomorrow, with "realistic human-esque convos" being the sort of thing we get in stead of warning shots)

[Ngo: 👍]

[Yudkowsky][14:40] 

I mostly do not expect pseudo-powerful but non-scalable AI powerful enough to increase GDP, hanging around for a while. But if it happens then I don't feel I get to yell "what happened?" at reality, because there's an obvious avenue for it to happen: something GDP-increasing proved tractable to non-deeply-general AI systems.

where GPT-3 is "not deeply general"

[Ngo][14:40] 

Again, I didn't ask about GDP increases, I asked about impressive acts (in order to separate out the effects of AI capabilities from regulatory effects, people-having-AI-but-not-using-it, etc).

Where you can use whatever metric of impressiveness you think is reasonable.

[Yudkowsky][14:42] 

so there's two questions here, one of which is something like, "what is the most impressive thing you can do while still being able to align stuff and make it corrigible", and one of which is "if there's an incorrigible AI whose deeds are being exhibited by fools, what impressive things might it do short of ending the world".

and these are both problems that are hard for the same reason I did not predict in 2014 that Go would fall in 2016; it can in fact be quite hard - even with a domain as fully lawful and known as Go - to figure out which problems will fall to which level of cognitive capacity.

[Soares][14:43] 

Nate's attempted rephrasing: EY's model might not be confident that there's not big GDP boosts, but it does seem pretty confident that there isn't some "half-capable" window between the shallow-pattern-memorizer stuff and the scary-laserlike-consequentialist stuff, and in particular Eliezer seems confident humanity won't slowly traverse that capability regime

[Yudkowsky][14:43] 

that's... allowed? I don't get to yell at reality if that happens?

[Soares][14:44] 

and (shakier extrapolation), that regime is where a bunch of Richard's hope lies (eg, in the beginning of that regime we get to learn how to do practical alignment, and also the world can perhaps be saved midway through that regime using non-laserlike-systems)

[Ngo: 👍]

[Yudkowsky][14:45] 

so here's an example of a thing I don't think you can do without the world ending: get an AI to build a nanosystem or biosystem which can synthesize two strawberries identical down to the cellular but not molecular level, and put them on a plate

this is why I use this capability as the definition of a "powerful AI" when I talk about "powerful AIs" being hard to align, if I don't want to start by explicitly arguing about pivotal acts

this, I think, is going to end up being first doable using a laserlike world-ending system

so even if there's a way to do it with no lasers, that happens later and the world ends before then

[Ngo][14:47] 

Okay, that's useful.

[Yudkowsky][14:48] 

it feels like the critical bar there is something like "invent a whole engineering discipline over a domain where you can't run lots of cheap simulations in full detail"

[Ngo][14:49] 

(Meta note: let's wrap up in 10 mins? I'm starting to feel a bit sleepy.)

[Yudkowsky: 👍][Soares: 👍]

This seems like a pretty reasonable bar

Let me think a bit about where to go from that

While I'm doing so, since this question of takeoff speeds seems like an important one, I'm wondering if you could gesture at your biggest disagreement with this post: https://sideways-view.com/2018/02/24/takeoff-speeds/

[Yudkowsky][14:51] 

Oh, also in terms of scifi possibilities, I can imagine seeing 5% GDP loss because text transformers successfully scaled to automatically filing lawsuits and environmental impact objections.

My read on the entire modern world is that GDP is primarily constrained by bureaucratic sclerosis rather than by where the technological frontiers lie, so AI ends up impacting GDP mainly insofar as it allows new ways to bypass regulatory constraints, rather than insofar as it allows new technological capabilities. I expect a sudden transition to paperclips, not just because of how fast I expect cognitive capacities to scale over time, but because nanomachines eating the biosphere bypass regulatory constraints, whereas earlier phases of AI will not be advantaged relative to all the other things we have the technological capacity to do but which aren't legal to do.

[Shah][12:13]  (Sep. 21 follow-up comment) 

My read on the entire modern world is that GDP is primarily constrained by bureaucratic sclerosis rather than by where the technological frontiers lie

This is a fair point and updates me somewhat towards fast takeoff as operationalized by Paul, though I'm not sure how much it updates me on p(doom).

Er, wait, really fast takeoff as operationalized by Paul makes less sense as a thing to be looking for -- presumably we die before any 1 year doubling. Whatever, it updates me somewhat towards "less deployed stuff before scary stuff is around"

[Ngo][14:56] 

Ah, interesting. What are the two or three main things in that category?

[Yudkowsky][14:57] 

mRNA vaccines, building houses, building cities? Not sure what you mean there.

[Ngo][14:57] 

"things we have the technological capacity to do but which aren't legal to do"

[Yudkowsky][14:58][15:00] 

Eg, you might imagine, "What if AIs were smart enough to build houses, wouldn't that raise GDP?" and the answer is that we already have the pure technology to manufacture homes cheaply, but the upright-stick-construction industry already successfully lobbied to get it banned as it was starting to develop, by adding on various constraints; so the question is not "Is AI advantaged in doing this?" but "Is AI advantaged at bypassing regulatory constraints on doing this?" Not to mention all the other ways that building a house in an existing city is illegal, or that it's been made difficult to start a new city, etcetera.

"What if AIs could design a new vaccine in a day?" We can already do that. It's no longer the relevant constraint. Bureaucracy is the process-limiting constraint.

I would - looking in again at the Sideways View essay on takeoff speeds - wonder whether it occurred to you, Richard, to ask about what detailed predictions all the theories there had made.

After all, a lot of it is spending time explaining why the theories there shouldn't be expected to retrodict even the data points we have about progress rates over hominid evolution.

Surely you, being the evenhanded judge that you are, must have been reading through that document saying, "My goodness, this is even worse than retrodicting a few data points!"

A lot of why I have a bad taste in my mouth about certain classes of epistemological criticism is my sense that certain sentences tend to be uttered on incredibly selective occasions.

[Ngo][14:59][15:06] 

Some meta thoughts: I now feel like I have a pretty reasonable broad outline of Eliezer's views. I haven't yet changed my mind much, but plausibly mostly because I haven't taken the time to internalise those views; once I ruminate on them a bunch, I expect my opinions will shift (uncertain how far; unlikely to be most of the way).

Meta thoughts (continued): Insofar as a strong disagreement remains after that (which it probably will) I feel pretty uncertain about what would resolve it. Best guess is that I should write up some longer essays that try to tie a bunch of disparate strands together.

Near the end it seemed like the crux, to a surprising extent, hinged on this question of takeoff speeds. So the other thing which seems like it'd plausibly help a lot is Eliezer writing up a longer version of his response to Paul's Takeoff Speeds post.

(Just as a brief comment, I don't find the "bureaucratic sclerosis" explanation very compelling. I do agree that regulatory barriers are a huge problem, but they still don't seem nearly severe enough to cause a fast takeoff. I don't have strong arguments for that position right now though.)

[Soares][15:12] 

This seems like a fine point to call it!

Some wrap-up notes

  • I had the impression this round was a bit more frustrating than last rounds. Thanks all for sticking with things 🙂
  • I have a sense that Richard was making a couple points that didn't quite land. I plan to attempt to articulate versions of them myself in the interim.
  • Richard noted he had a sense we're in decreasing return territory. My own sense is that it's worth having at least one more discussion in this format about specific non-consequentialist plans Richard may have hope in, but I also think we shouldn't plow forward in spite of things feeling less useful, and I'm open to various alternative proposals.

In particular, it seems maybe plausible to me we should have a pause for some offline write-ups, such as Richard digesting a bit and then writing up some of his current state, and/or Eliezer writing up some object-level response to the takeoff speed post above?

[Ngo: 👍]

(I also could plausibly give that a go myself, either from my own models or from my model of Eliezer's model which he could then correct)

[Ngo][15:15] 

Thanks Nate!

I endorse the idea of offline writeups

[Soares][15:17] 

Cool. Then I claim we are adjourned for the day, and Richard has the ball on digesting & doing a write-up from his end, and I have the ball on both writing up my attempts to articulate some points, and on either Eliezer or I writing some takes on timelines or something.

(And we can coordinate our next discussion, if any, via email, once the write-ups are in shape.)

[Yudkowsky][15:18] 

I also have a sense that there's more to be said about specifics of govt stuff or specifics of "ways to bypass consequentialism" and that I wish we could spend at least one session trying to stick to concrete details only

Even if it's not where cruxes ultimately lie, often you learn more about the abstract by talking about the concrete than by talking about the abstract.

[Soares][15:22] 

(I, too, would be enthusiastic to see such a discussion, and Richard, if you find yourself feeling enthusiastic or at least not-despairing about it, I'd happily moderate.)

[Yudkowsky][15:37] 

(I'm a little surprised about how poorly I did at staying concrete after saying that aloud, and would nominate Nate to take on the stern duty of blowing the whistle at myself or at both of us.)

46

44 comments, sorted by Highlighting new comments since Today at 10:45 AM
New Comment

It feels to me like the typical Other within EA has no experience with discovering unexpected order, with operating a generalization that you can expect will cover new cases even when that isn't immediately obvious, with operating that generalization to cover those new cases correctly, with seeing simple structures that generalize a lot and having that be a real and useful and technical experience; instead of somebody blathering in a non-expectation-constraining way about how "capitalism is responsible for everything wrong with the world", and being able to extend that to lots of cases.

So, there's been a bunch of places in the previous discussion and this one where Yudkowsky is like "there's this difficult-to-explain concept which you don't see and I don't know how to make it visible to you". And this is an extremely frustrating thing to be told repeatedly, kudos to Richard for patience in dealing with it.

I want to highlight this particular difficult-to-explain concept as one which resonates especially strongly with me. It feels to me like an unusually fundamental, crucial aesthetic-heuristic-of-good-reasoning which I rely on very heavily and which unusually many people (including on LessWrong) are completely missing. It's a generalization of the sort of thing I was trying to explain in What's So Bad About Ad-Hoc Mathematical Definitions, or trying to train in specific cases using the Framing Practica. Like, there's a certain kind of theory/model which generalizes well to many classes of new cases and makes nontrivial predictions in those new cases, and those kinds-of-theories/models have a pattern to them which is recognizable.

... but I think an awful lot of people haven't noticed this enough to internalize the pattern, and to such people I expect the world looks like it's just full of lots of narrow theories and there isn't any recognizable-in-advance property which makes some theories way more powerful.

... and now I'm also going to say "sorry, I don't know a fast and effective way to make this visible to you, and I know this is extremely frustrating but it's an important concept and I honestly hope you figure it out". But I can at least give some pointers: try the links above (especially the Framing Practica), and take some physics and economics classes if you haven't done that before, and go do Framing Practicum-like exercises with whatever you learn in physics/economics - i.e. go look for applications of the ideas which don't resemble those you've seen before. And then try to notice the patterns in the kinds-of-models which generalize well in this way, the kinds-of-models for which you'd expect to be able to find lots of novel different-looking applications, and contrast how such models compare to more ad-hoc things.

I second the kudos to Richard, by the way.  In a lot of ways he's an innocent bystander while I say things that aren't aimed mainly at him.

Not a problem. I share many of your frustrations about modesty epistemology and about most alignment research missing the point, so I sympathise with your wanting to express them.

On consequentialism: I imagine that it's pretty frustrating to keep having people misunderstand such an important concept, so thanks for trying to convey it. I currently feel like I have a reasonable outline of what you mean (e.g. to the level where I could generate an analogy about as good as Nate's laser analogy), but I still don't know whether the reason you find it much more compelling than I do is because you understand the details and implications better, or because you have different intuitions about how to treat high-level abstractions (compared with the intuitions I describe here).

At some point when I have a few spare days, I might try to write down my own best understanding of the concept, and try generate some of those useful analogies and intuition pumps, in the hope that explaining it from a different angle will prove fruitful. Until then, other people who try to do so should feel free to bounce drafts off me.

Like, there's a certain kind of theory/model which generalizes well to many classes of new cases and makes nontrivial predictions in those new cases, and those kinds-of-theories/models have a pattern to them which is recognizable.

Could I ask you to say more about what you mean by "nontrivial predictions" in this context? It seems to me like this was a rather large sticking point in the discussion between Richard and Eliezer (that is, the question of whether expected utility theory--as a specific candidate for a "strongly generalizing theory"--produces "nontrivial predictions", where it seemed like Eliezer leaned "yes" and Richard leaned "no"), so I'd be interested in hearing more takes on what constitutes "nontrivial predictions", and what role said (nontrivial) predictions play in making a theory more convincing (as compared to other factors such as e.g. elegance/parsimony/[the pattern John talks about which is recognizable]).

Of course, I'd be interested in hearing what Richard thinks of the above as well.

Oh, I can just give you a class of nontrivial predictions of expected utility theory. I have not seen any empirical results on whether these actually hold, so consider them advance predictions.

So, a bacteria needs a handful of different metabolic resources - most obviously energy (i.e. ATP), but also amino acids, membrane lipids, etc. And often bacteria can produce some metabolic resources via multiple different paths, including cyclical paths - e.g. it's useful to be able to turn A into B but also B into A, because sometimes the environment will have lots of B and other times it will have lots of A. Now, there's the obvious prediction that the bacteria won't waste energy turning B into A and then back into B again - i.e. it will suppress one of those two pathways (assuming the cycle is energy-burning), depending on which metabolite is more abundant. Utility generalizes this idea to arbitrarily many reactions and products, and predicts that at any given time we can assign some (non-unique) "values" to each metabolite (including energy carriers), such that any reaction whose reactants have more total "value" than its products is suppressed (or at least not catalyzed; the cell doesn't really have good ways to suppress spontaneous reactions other than putting things in separate compartments).

Of course in practice this will be an approximation, and there may be occasional exceptions where the cell is doing something the model doesn't capture. If we were to do this sort of analysis in a signalling network rather than a metabolic network, for instance, there would likely be many exceptions: cells sometimes burn energy to maintain a concentration at a specific level, or to respond quickly to changes, and this particular model doesn't capture the "value" of information-content in signals; we'd have to extend our value-function in order for the utility framework to capture that. But for metabolic networks, I expect that to mostly not be an issue.

That's really just utility theory; expected utility theory would involve an organism storing some resources over time (like e.g. fat). Then we'd expect to be able to assign "values" such that the relative "value" assigned to stored resources which are not currently used is a weighted sum of the "values" assigned to those resources in different possible future environments (of the sort the organism might find itself in after something like its current environment, in the ancestral world), and the weights in the sums should be consistent. (This is a less-fleshed-out prediction than the other one, but hopefully it's enough of a sketch to give the idea.)

Of course, if we understand expected utility theory deeply, then these predictions are quite trivial; they're just saying that organisms won't make pareto-suboptimal use of their resources! It's one of those predictions where, if it's false, then we've probably discovered something interesting - most likely some place where an organism is spending resources to do something useful which we haven't understood yet. [EDIT-TO-ADD: This is itself intended as a falsifiable prediction - if we go look at an anomaly and don't find any unaccounted-for phenomenon, then that's a very big strike against expected utility theory.] And that's the really cool prediction here: it gives us a tool to uncover unknown-unknowns in our understanding of a cell's behavior.

Thanks John for this whole thread!

(Note that I only read the whole Epistemology section of this post and skimmed the rest, so I might be saying stuff that are repeated/resolved elsewhere. Please point me to the relevant parts/quotes if that's the case. ;) )

Einstein's arrogance sounds to me like an early pointer in the Sequences for that kind of thing, with a specific claim about General Relativity being that kind of theory.

That being said, I still understand Richard's position and difficulty with this whole part (or at least what I read of Richard's difficulty). He's coming from the perspective of philosophy of science, which has focused mostly on ideas related to advanced predictions and taking into account the mental machinery of humans to catch biases and mistakes that we systematically make. The Sequences also spend a massive amount of words on exactly this, and yet in this discussion (and in select points in the Sequences like the aforementioned post), Yudkowsky sounds a bit like considering that his fundamental theory/observation doesn't need any of these to be accepted as obvious (I don't think he is thinking that way, but that's hard to extract out of the text).

It's even more frustrating because Yudkowsky focuses on "showing epistemic modesty" as his answer/rebuttal to Richard's inquiry, when Richard just sounds like he's asking the completely relevant question "why should we take your word on it?" And the confusion IMO is because the last sentence sounds very status-y (How do you dare claiming such crazy stuff?), but I'm pretty convinced Richard actually means it in a very methodological/philosophy of science/epistemic strategies way of "What are the ways of thinking that you're using here that you expect to be particularly good at aiming at the truth?"

Furthermore, I agree with (my model of) Richard that the main issue with the way Yudkowsky (and you John) are presenting your deep idea is that you don't give a way of showing it wrong. For example, you (John) write:

It's one of those predictions where, if it's false, then we've probably discovered something interesting - most likely some place where an organism is spending resources to do something useful which we haven't understood yet.

And even if I feel what you're gesturing at, this sounds/looks like you're saying "even if my prediction is false, that doesn't mean that my theory would be invalidated". Whereas I feel you want to convey something like "this is not a prediction/part of the theory that has the ability to falsify the theory" or "it's part of the obvious wiggle room of the theory". What I want is a way of finding the parts of the theory/model/prediction that could actually invalidate it, because that's what we should be discussing really. (A difficulty might be that such theories are so fundamental and powerful than being able to see them makes it really hard to find any way they could go wrong and endanger the theory)

An analogy that comes to my mind is with the barriers for proving P vs NP. These make explicit ways in which you can't solve the P vs NP question, such that it becomes far easier to weed proof attempts out. My impression is that You (Yudkowky and John) have models/generators that help you see at a glance that a given alignment proposal will fail. Which is awesome! I want to be able to find and extract and use those. But what Richard is pointing out IMO is that having the generators explicit would give us a way to stress test them, which is a super important step to start believing in them further. Just like we want people to actually try to go beyond GR, and for that they need to understand it deeply.

(Obviously, maybe the problem is that as you two are pointing it out, making the models/generators explicit and understandable is just really hard and you don't know how to do that. That's fair).

To be clear, this part:

It's one of those predictions where, if it's false, then we've probably discovered something interesting - most likely some place where an organism is spending resources to do something useful which we haven't understood yet.

... is also intended as a falsifiable prediction. Like, if we go look at the anomaly and there's no new thing going on there, then that's a very big strike against expected utility theory.

This particular type of fallback-prediction is a common one in general: we have some theory which makes predictions, but "there's a phenomenon which breaks one of the modelling assumption in a way noncentral to the main theory" is a major way the predictions can fail. But then we expect to be able to go look and find the violation of that noncentral modelling assumption, which would itself yield some interesting information. If we don't find such a violation, it's a big strike against the theory.

This particular type of fallback-prediction is a common one in general: we have some theory which makes predictions, but "there's a phenomenon which breaks one of the modelling assumption in a way noncentral to the main theory" is a major way the predictions can fail.

That's a great way of framing it! And a great way of thinking about why these are not failures that are "worrysome" at first/in most cases.

And even if I feel what you're gesturing at, this sounds/looks like you're saying "even if my prediction is false, that doesn't mean that my theory would be invalidated". 

So, thermodynamics also feels like a deep fundamental theory to me, and one of the predictions it makes is "you can't make an engine more efficient than a Carnot engine." Suppose someone exhibits an engine that appears to be more efficient than a Carnot engine; my response is not going to be "oh, thermodynamics is wrong", and instead it's going to be "oh, this engine is making use of some unseen source."

[Of course, you can show me enough such engines that I end up convinced, or show me the different theoretical edifice that explains both the old observations and these new engines.]

What I want is a way of finding the parts of the theory/model/prediction that could actually invalidate it, because that's what we should be discussing really. (A difficulty might be that such theories are so fundamental and powerful than being able to see them makes it really hard to find any way they could go wrong and endanger the theory)

So, later Eliezer gives "addition" as an example of a deep fundamental theory. And... I'm not sure I can imagine a universe where addition is wrong? Like, I can say "you would add 2 and 2 and get 5" but that sentence doesn't actually correspond to any universes.

Like, similarly, I can imagine universes where evolution doesn't describe the historical origin of species in that universe. But I can't imagine universes where the elements of evolution are present and evolution doesn't happen.

[That said, I can imagine universes with Euclidean geometry and different universes with non-Euclidean geometry, so I'm not trying to claim this is true of all deep fundamental theories, but maybe the right way to think about this is "geometry except for the parallel postulate" is the deep fundamental theory.]

"you can't make an engine more efficient than a Carnot engine."

That's not what it predicts. It predicts you can't make a heat engine more efficient than a Carnot engine.

Thanks for the thoughtful answer!

So, thermodynamics also feels like a deep fundamental theory to me, and one of the predictions it makes is "you can't make an engine more efficient than a Carnot engine." Suppose someone exhibits an engine that appears to be more efficient than a Carnot engine; my response is not going to be "oh, thermodynamics is wrong", and instead it's going to be "oh, this engine is making use of some unseen source."

My gut reaction here is that "you can't make an engine more efficient than a Carnot engine" is not the right kind of prediction to try to break thermodynamics, because even if you could break it in principle, staying at that level without going into the detailed mechanisms of thermodynamics will only make you try the same thing as everyone else does. Do you think that's an adequate response to your point, or am I missing what you're trying to say?

So, later Eliezer gives "addition" as an example of a deep fundamental theory. And... I'm not sure I can imagine a universe where addition is wrong? Like, I can say "you would add 2 and 2 and get 5" but that sentence doesn't actually correspond to any universes.

Like, similarly, I can imagine universes where evolution doesn't describe the historical origin of species in that universe. But I can't imagine universes where the elements of evolution are present and evolution doesn't happen.

[That said, I can imagine universes with Euclidean geometry and different universes with non-Euclidean geometry, so I'm not trying to claim this is true of all deep fundamental theories, but maybe the right way to think about this is "geometry except for the parallel postulate" is the deep fundamental theory.]

The mental move I'm doing for each of these examples is not imagining universes where addition/evolution/other deep theory is wrong, but imagining phenomena/problems where addition/evolution/other deep theory is not adapted. If you're describing something that doesn't commute, addition might be a deep theory, but it's not useful for what you want. Similarly, you could argue that given how we're building AIs and trying to build AGI, evolution is not the deep theory that you want to use. 

It sounds to me like you (and your internal-Yudkowsky) are using "deep fundamental theory" to mean "powerful abstraction that is useful in a lot of domains". Which addition and evolution fundamentally are. But claiming that the abstraction is useful in some new domain requires some justification IMO. And even if you think the burden of proof is on the critics, the difficulty of formulating the generators makes that really hard.

Once again, do you think that answers your point adequately?

From my (dxu's) perspective, it's allowable for there to be "deep fundamental theories" such that, once you understand those theories well enough, you lose the ability to imagine coherent counterfactual worlds where the theories in question are false.

To use thermodynamics as an example: the first law of thermodynamics (conservation of energy) is actually a consequence of Noether's theorem, which ties conserved quantities in physics to symmetries in physical laws. Before someone becomes aware of this, it's perhaps possible for them to imagine a universe exactly like our own, except that energy is not conserved; once they understand the connection implied by Noether's theorem, this becomes an incoherent notion: you cannot remove the conservation-of-energy property without changing deep aspects of the laws of physics.

The second law of thermodynamics is similarly deep: it's actually a consequence of there being a (low-entropy) boundary condition at the beginning of the universe, but no corresponding (low-entropy) boundary condition at any future state. This asymmetry in boundary conditions is what causes entropy to appear directionally increasing--and again, once someone becomes aware of this, it is no longer possible for them to imagine living in a universe which started out in a very low-entropy state, but where the second law of thermodynamics does not hold.

In other words, thermodynamics as a "deep fundamental theory" is not merely [what you characterized as] a "powerful abstraction that is useful in a lot of domains". Thermodynamics is a logically necessary consequence of existing, more primitive notions--and the fact that (historically) we arrived at our understanding of thermodynamics via a substantially longer route (involving heat engines and the like), without noticing this deep connection until much later on, does not change the fact that grasping said deep connection allows one to see "at a glance" why the laws of thermodynamics inevitably follow.

Of course, this doesn't imply infinite certainty, but it does imply a level of certainty substantially higher than what would be assigned merely to a "powerful abstraction that is useful in a lot of domains". So the relevant question would seem to be: given my above described epistemic state, how might one convince me that the case for thermodynamics is not as airtight as I currently think it is? I think there are essentially two angles of attack: (1) convince me that the arguments for thermodynamics being a logically necessary consequence of the laws of physics are somehow flawed, or (2) convince me that the laws of physics don't have the properties I think they do.

Both of these are hard to do, however--and for good reason! And absent arguments along those lines, I don't think I am (or should be) particularly moved by [what you characterized as] philosophy-of-science-style objections about "advance predictions", "systematic biases", and the like. I think there are certain theories for which the object-level case is strong enough that it more or less screens off meta-level objections; and I think this is right, and good.

Which is to say:

The mental move I'm doing for each of these examples is not imagining universes where addition/evolution/other deep theory is wrong, but imagining phenomena/problems where addition/evolution/other deep theory is not adapted. If you're describing something that doesn't commute, addition might be a deep theory, but it's not useful for what you want. Similarly, you could argue that given how we're building AIs and trying to build AGI, evolution is not the deep theory that you want to use. (emphasis mine)

I think you could argue this, yes--but the crucial point is that you have to actually argue it. You have to (1) highlight some aspect of the evolutionary paradigm, (2) point out [what appears to you to be] an important disanalogy between that aspect and [what you expect cognition to look like in] AGI, and then (3) argue that that disanalogy directly undercuts the reliability of the conclusions you would like to contest. In other words, you have to do things the "hard way"--no shortcuts.

...and the sense I got from Richard's questions in the post (as well as the arguments you made in this subthread) is one that very much smells like a shortcut is being attempted. This is why I wrote, in my other comment, that

I don't think I have a good sense of the implied objections contained within Richard's model. That is to say: I don't have a good handle on the way(s) in which Richard expects expected utility theory to fail, even conditioning on Eliezer being wrong about the theory being useful. I think this important because--absent a strong model of expected utility theory's likely failure modes--I don't think questions of the form "but why hasn't your theory made a lot of successful advance predictions yet?" move me very much on the object level.

I think I share Eliezer's sense of not really knowing what Richard means by "deep fundamental theory" or "wide range of applications we hadn't previous thought of", and I think what would clarify this for me would have been for Richard to provide examples of "deep fundamental theories [with] a wide range of applications we hadn't previously thought of", accompanied by an explanation of why, if those applications hadn't been present, that would have indicated something wrong with the theory.

My objection is mostly fleshed out in my other comment. I'd just flag here that "In other words, you have to do things the "hard way"--no shortcuts" assigns the burden of proof in a way which I think is not usually helpful. You shouldn't believe my argument that I have a deep theory linking AGI and evolution unless I can explain some really compelling aspects of that theory. Because otherwise you'll also believe in the deep theory linking AGI and capitalism, and the one linking AGI and symbolic logic, and the one linking intelligence and ethics, and the one linking recursive self-improvement with cultural evolution, etc etc etc.

Now, I'm happy to agree that all of the links I just mentioned are useful lenses which help you understand AGI. But for utility theory to do the type of work Eliezer tries to make it do, it can't just be a useful lens - it has to be something much more fundamental. And that's what I don't think Eliezer's established.

It also isn't clear to me that Eliezer has established the strong inferences he draws from noticing this general pattern ("expected utility theory/consequentialism"). But when you asked Eliezer (in the original dialogue) to give examples of successful predictions, I was thinking "No, that's not how these things work." In the mistaken applications of Grand Theories you mention (AGI and capitalism, AGI and symbolic logic, intelligence and ethics, recursive self-improvement and cultural evolution, etc.), the easiest way to point out why they are dumb is with counterexamples. We can quickly "see" the counterexamples. E.g., if you're trying to see AGI as the next step in capitalism, you'll be able to find counterexamples where things become altogether different (misaligned AI killing everything; singleton that brings an end to the need to compete). By contrast, if the theory fits, you'll find that whenever you try to construct such a counterexample, it is just a non-central (but still valid) manifestation of the theory. Eliezer would probably say that people who are good at this sort of thinking will quickly see how the skeptics' counterexamples fall relevantly short. 

---

The reason I remain a bit skeptical about Eliezer's general picture: I'm not sure if his thinking about AGI makes implicit questionable predictions about humans

  • I don't understand his thinking well enough to be confident that it doesn't
  • It seems to me that Eliezer_2011 placed weirdly strong emphasis on presenting humans in ways that matched the pattern "(scary) consequentialism always generalizes as you scale capabilities." I consider some of these claims false or at least would want to make the counterexamples more salient

For instance: 

  • Eliezer seemed to think that "extremely few things are worse than death" is something all philosophically sophisticated humans would agree with
  • Early writings on CEV seemed to emphasize things like the "psychological unity of humankind" and talk as though humans would mostly have the same motivational drives, also with respect to how it relates to "enjoying being agenty" as opposed to "grudgingly doing agenty things but wishing you could be done with your obligations faster"
  • In HPMOR all the characters are either not philosophically sophisticated or they were amped up into scary consequentialists plotting all the time

All of the above could be totally innocent matters of wanting to emphasize the thing that other commenters were missing, so they aren't necessarily indicative of overlooking certain possibilities. Still, the pattern there makes me wonder if maybe Eliezer hasn't spent a lot of time imagining what sorts of motivations humans can have that make them benign not in terms outcome-related ethics (what they want the world to look like), but relational ethics (who they want to respect or assist, what sort of role model they want to follow). It makes me wonder if it's really true that when you try to train an AI to be helpful and corrigible, the "consequentialism-wants-to-become-agenty-with-its-own-goals part" will be stronger than the "helping this person feels meaningful" part. (Leading to an agent that's consequentialist about following proper cognition rather than about other world-outcomes.) 

FWIW I think I mostly share Eliezer's intuitions about the arguments where he makes them; I just feel like I lack the part of his picture that lets him discount the observation that some humans are interpersonally corrigible and not all that focused on other explicit goals, and that maybe this means corrigibility has a crisp/natural shape after all. 

the easiest way to point out why they are dumb is with counterexamples. We can quickly "see" the counterexamples. E.g., if you're trying to see AGI as the next step in capitalism, you'll be able to find counterexamples where things become altogether different (misaligned AI killing everything; singleton that brings an end to the need to compete).

I'm not sure how this would actually work. The proponent of the AGI-capitalism analogy might say "ah yes, AGI killing everyone is another data point on the trend of capitalism becoming increasingly destructive". Or they might say (as Marx did) that capitalism contains the seeds of its own destruction. Or they might just deny that AGI will play out the way you claim, because their analogy to capitalism is more persuasive than your analogy to humans (or whatever other reasoning you're using). How do you then classify this as a counterexample rather than a "non-central (but still valid) manifestation of the theory"?

My broader point is that these types of theories are usually sufficiently flexible that they can "predict" most outcomes, which is why it's so important to pin them down by forcing them to make advance predictions.

On the rest of your comment, +1. I think that one of the weakest parts of Eliezer's argument was when he appealed to the difference between von Neumann and the village idiot in trying to explain why the next step above humans will be much more consequentialist than most humans (although unfortunately I failed to pursue this point much in the dialogue).

How do you then classify this as a counterexample rather than a "non-central (but still valid) manifestation of the theory"?


My only reply is "You know it when you see it." And yeah, a crackpot would reason the same way, but non-modest epistemology says that if it's obvious to you that you're not a crackpot then you have to operate on the assumption that you're not a crackpot. (In the alternative scenario, you won't have much impact anyway.) 

Specifically, the situation I mean is the following:

  • You have an epistemic track record like Eliezer or someone making lots of highly upvoted posts in our communities.
  • You find yourself having strong intuitions about how to apply powerful principles like "consequentialism" to new domains, and your intuitions are strong because it feels to you like you have a gears-level understanding that others lack. You trust your intuitions in cases like these.

My recommended policy in cases where this applies is "trust your intuitions and operate on the assumption that you're not a crackpot." 

Maybe there's a potential crux here about how much of scientific knowledge is dependent on successful predictions. In my view, the sequences have convincingly argued that locating the hypothesis in the first place is often done in the absence of already successful predictions, which goes to show that there's a core of "good reasoning" that lets you jump to (tentative) conclusions, or at least good guesses, much faster than if you were to try lots of things at random.

My recommended policy in cases where this applies is "trust your intuitions and operate on the assumption that you're not a crackpot." 
 

Oh, certainly Eliezer should trust his intuitions and believe that he's not a crackpot. But I'm not arguing about what the person with the theory should believe, I'm arguing about what outside observers should believe, if they don't have enough time to fully download and evaluate the relevant intuitions. Asking the person with the theory to give evidence that their intuitions track reality isn't modest epistemology.

Damn. I actually think you might have provided the first clear pointer I've seen about this form of knowledge production, why and how it works, and what could break it. There's a lot to chew on in this reply, but thanks a lot for the amazing food for thought!

(I especially like that you explained the physical points and put links that actually explain the specific implication)

And I agree (tentatively) that a lot of the epistemology of science stuff doesn't have the same object-level impact. I was not claiming that normal philosophy of science was required, just that if that was not how we should evaluate and try to break the deep theory, I wanted to understand how I was supposed to do that.

I think "deep fundamental theory" is deeper than just "powerful abstraction that is useful in a lot of domains".

Part of what makes a Deep Fundamental Theory deeper is that it is inevitably relevant for anything existing in a certain way. For example, Ramón y Cajal (discoverer of the neuronal structure of brains) wrote:

Before the correction of the law of polarization, we have thought in vain about the usefulness of the referred facts. Thus, the early emergence of the axon, or the displacement of the soma, appeared to us as unfavorable arrangements acting against the conduction velocity, or the convenient separation of cellulipetal and cellulifugal impulses in each neuron. But as soon as we ruled out the requirement of the passage of the nerve impulse through the soma, everything became clear; because we realized that the referred displacements were morphologic adaptations ruled by the laws of economy of time, space and matter. These laws of economy must be considered as the teleological causes that preceded the variations in the position of the soma and the emergence of the axon. They are so general and evident that, if carefully considered, they impose themselves with great force on the intellect, and once becoming accepted, they are firm bases for the theory of axipetal polarization.

At first, I was surprised to see that the structure of physical space gave the fundamental principles in neuroscience too! But then I realized I shouldn't have been: neurons exist in physical spacetime. It's not a coincidence that neurons look like lightning: they're satisfying similar constraints in the same spatial universe. And once observed, it's easy to guess that what Ramón y Cajal might call "economy of metabolic energy" is also a fundamental principle of neuroscience, which of course is attested by modern neuroscientists. That's when I understood that spatial structure is a Deep Fundamental Theory.

And it doesn't stop there. The same thing explains the structure of our roadways, blood vessels, telecomm networks, and even why the first order differential equations for electric currents, masses on springs, and water in pipes are the same.

(The exact deep structure of physical space which explains all of these is differential topology, which I think is what Vaniver was gesturing towards with "geometry except for the parallel postulate".)

That's when I understood that spatial structure is a Deep Fundamental Theory.

And it doesn't stop there. The same thing explains the structure of our roadways, blood vessels, telecomm networks, and even why the first order differential equations for electric currents, masses on springs, and water in pipes are the same.

(The exact deep structure of physical space which explains all of these is differential topology, which I think is what Vaniver was gesturing towards with "geometry except for the parallel postulate".)

Can you go into more detail here? I have done a decent amount of maths but always had trouble in physics due to my lack of physical intuition, so it might be completely obvious but I'm not clear about what is "that same thing" or how it explains all your examples? Is it about shortest path? What aspect of differential topology (a really large field) captures it?

(Maybe you literally can't explain it to me without me seeing the deep theory, which would be frustrating, but I'd want to know if that was the case. )

There's more than just differential topology going on, but it's the thing that unifies it all. You can think of differential topology as being about spaces you can divide into cells, and the boundaries of those cells. Conservation laws are naturally expressed here as constraints that the net flow across the boundary must be zero. This makes conserved quantities into resources, for which the use of is convergently minimized. Minimal structures with certain constraints are thus led to forming the same network-like shapes, obeying the same sorts of laws. (See chapter 3 of Grady's Discrete Calculus for details of how this works in the electric circuit case.)

The mental move I'm doing for each of these examples is not imagining universes where addition/evolution/other deep theory is wrong, but imagining phenomena/problems where addition/evolution/other deep theory is not adapted. If you're describing something that doesn't commute, addition might be a deep theory, but it's not useful for what you want. 

Yeah, this seems reasonable to me. I think "how could you tell that theory is relevant to this domain?" seems like a reasonable question in a way that "what predictions does that theory make?" seems like it's somehow coming at things from the wrong angle.

Thanks! I think that this is a very useful example of an advance prediction of utility theory; and that gathering more examples like this is one of the most promising way to make progress on bridging the gap between Eliezer's and most other people's understandings of consequentialism.

Potentially important thing to flag here: at least in my mind, expected utility theory (i.e. the property Eliezer was calling "laser-like" or "coherence") and consequentialism are two distinct things. Consequentialism will tend to produce systems with (approximate) coherent expected utilities, and that is one major way I expect coherent utilities to show up in practice. But coherent utilities can in-principle occur even without consequentialism (e.g. conservative vector fields in physics), and consequentialism can in-principle not be very coherent (e.g. if it just has tons of resources and doesn't have to be very efficient to achieve a goal-state).

(I'm not sure whether Eliezer would agree with this. The thing-I-think-Eliezer-means-by-consequentialism does not yet have a good mathematical formulation which I know of, which makes it harder to check that two people even mean the same thing when pointing to the concept.)

My model of Eliezer says that there is some deep underlying concept of consequentialism, of which the "not very coherent consequentialism" is a distorted reflection; and that this deep underlying concept is very closely related to expected utility theory. (I believe he said at one point that he started using the word "consequentialism" instead of "expected utility maximisation" mainly because people kept misunderstanding what he meant by the latter.)

I don't know enough about conservative vector fields to comment, but on priors I'm pretty skeptical of this being a good example of coherent utilities; I also don't have a good guess about what Eliezer would say here.

Speaking from my own perspective: I definitely had a sense, reading through that section of the conversation, that Richard's questions were somewhat... skewed? ... relative to the way I normally think about the topic. I'm having some difficulty articulating the source of that skewness, so I'll start by talking about how I think the skewness relates to the conversation itself:

I interpreted Eliezer's remarks as basically attempting to engage with Richard's questions on the same level they were being asked--but I think his lack of ability to come up with compelling examples (to be clear: by "compelling" here I mean "compelling to Richard") likely points at a deeper source of disagreement (which may or may not be the same generator as the "skewness" I noticed). And if I were forced to articulate the thing I think the generator might be...

I don't think I have a good sense of the implied objections contained within Richard's model. That is to say: I don't have a good handle on the way(s) in which Richard expects expected utility theory to fail, even conditioning on Eliezer being wrong about the theory being useful. I think this important because--absent a strong model of expected utility theory's likely failure modes--I don't think questions of the form "but why hasn't your theory made a lot of successful advance predictions yet?" move me very much on the object level.

Probing more at the sense of skewness, I'm getting the sense that this exchange here is deeply relevant:

Richard: I'm accepting your premise that it's something deep and fundamental, and making the claim that deep, fundamental theories are likely to have a wide range of applications, including ones we hadn't previously thought of.

Do you disagree with that premise, in general?

Eliezer: I don't know what you really mean by "deep fundamental theory" or "wide range of applications we hadn't previously thought of", especially when it comes to structures that are this simple. It sounds like you're still imagining something I mean by Expected Utility which is some narrow specific theory like a particular collection of gears that are appearing in lots of places.

I think I share Eliezer's sense of not really knowing what Richard means by "deep fundamental theory" or "wide range of applications we hadn't previous thought of", and I think what would clarify this for me would have been for Richard to provide examples of "deep fundamental theories [with] a wide range of applications we hadn't previously thought of", accompanied by an explanation of why, if those applications hadn't been present, that would have indicated something wrong with the theory.

But the reason I'm calling the thing "skewness", rather than something more prosaic like "disagreement", is because I suspect Richard isn't actually operating from a frame where he can produce the thing I asked for in the previous paragraphs (a strong model of where expected utility is likely to fail, a strong model of how a lack of "successful advance predictions"/"wide applications" corresponds to those likely failure modes, etc). I suspect that the frame Richard is operating in would dismiss these questions as largely inconsequential, even though I'm not sure why or what that frame actually looks like; this is a large part of the reason why I have this flagged as a place to look for a deep hidden crux.

(One [somewhat uncharitable] part of me wants to point out that the crux in question may actually just be the "usual culprit" in discussions like this: outside-view/modest-epistemology-style reasoning. This does seem to rhyme a lot with what I wrote above, e.g. it would explain why Richard didn't seem particularly concerned with gears-level failure modes or competing models or the like, and why his line of questioning seemed mostly insensitive to the object-level details of what "advance predictions" look like, why that matters, etc. I do note that Richard actively denied being motivated by this style of reasoning later on in the dialogue, however, which is why I still have substantial uncertainty about his position.)

Strong upvote, you're pointing at something very important here. I don't think I'm defending epistemic modesty, I think I'm defending epistemic rigour, of the sort that's valuable even if you're the only person in the world.

I suspect Richard isn't actually operating from a frame where he can produce the thing I asked for in the previous paragraphs (a strong model of where expected utility is likely to fail, a strong model of how a lack of "successful advance predictions"/"wide applications" corresponds to those likely failure modes, etc).

Yes, this is correct. In my frame, getting to a theory that's wrong is actually the hardest part - most theories aimed at unifying phenomena from a range of different domains (aka attempted "deep fundamental theories") are not even wrong (e.g. incoherent, underspecified, ambiguous). Perhaps they can better be understood as evocative metaphors, or intuitions pointing in a given direction, than "theories" in the standard scientific sense.

Expected utility is a well-defined theory in very limited artificial domains. When applied to the rest of the world, the big question is whether it's actually a theory in any meaningful sense, as opposed to just a set of vague intuitions about how a formalism from a particular artificial domain generalises. (As an aside, I think of FDT as being roughly in the same category: well-defined in Newcomb's problem and with exact duplicates, but reliant on vague intuitions to generalise to anything else.)

So my default reaction to being asked how expected utility theory is wrong about AI feels like the same way I'd react if asked how the theory of fluid dynamics is wrong about the economy. I mean, money flows, right? And the economy can be more or less turbulent... Now, this is an exaggerated analogy, because I do think that there's something very important about consequentialism as an abstraction. But I'd like Yudkowsky to tell me what that is in a way which someone couldn't do if they were trying to sell me on an evocative metaphor about how a technical theory should be applied outside its usual domain - and advance predictions are one of the best ways to verify that.

A more realistic example: cultural evolution. Clearly there's a real phenomenon there, one which is crucial to human history. But calling cultural evolution a type of "evolution" is more like an evocative metaphor than a fundamental truth which we should expect to hold up in very novel circumstances (like worlds where AIs are shaping culture).

I also wrote about this intuition (using the example of the "health points" abstraction) in this comment.

I think some of your confusion may be that you're putting "probability theory" and "Newtonian gravity" into the same bucket.  You've been raised to believe that powerful theories ought to meet certain standards, like successful bold advance experimental predictions, such as Newtonian gravity made about the existence of Neptune (quite a while after the theory was first put forth, though).  "Probability theory" also sounds like a powerful theory, and the people around you believe it, so you think you ought to be able to produce a powerful advance prediction it made; but it is for some reason hard to come up with an example like the discovery of Neptune, so you cast about a bit and think of the central limit theorem.  That theorem is widely used and praised, so it's "powerful", and it wasn't invented before probability theory, so it's "advance", right?  So we can go on putting probability theory in the same bucket as Newtonian gravity?

They're actually just very different kinds of ideas, ontologically speaking, and the standards to which we hold them are properly different ones.  It seems like the sort of thing that would take a subsequence I don't have time to write, expanding beyond the underlying obvious ontological difference between validities and empirical-truths, to cover the way in which "How do we trust this, when" differs between "I have the following new empirical theory about the underlying model of gravity" and "I think that the logical notion of 'arithmetic' is a good tool to use to organize our current understanding of this little-observed phenomenon, and it appears within making the following empirical predictions..."  But at least step one could be saying, "Wait, do these two kinds of ideas actually go into the same bucket at all?"

In particular it seems to me that you want properly to be asking "How do we know this empirical thing ends up looking like it's close to the abstraction?" and not "Can you show me that this abstraction is a very powerful one?"  Like, imagine that instead of asking Newton about planetary movements and how we know that the particular bits of calculus he used were empirically true about the planets in particular, you instead started asking Newton for proof that calculus is a very powerful piece of mathematics worthy to predict the planets themselves - but in a way where you wanted to see some highly valuable material object that calculus had produced, like earlier praiseworthy achievements in alchemy.  I think this would reflect confusion and a wrongly directed inquiry; you would have lost sight of the particular reasoning steps that made ontological sense, in the course of trying to figure out whether calculus was praiseworthy under the standards of praiseworthiness that you'd been previously raised to believe in as universal standards about all ideas.

it seems to me that you want properly to be asking "How do we know this empirical thing ends up looking like it's close to the abstraction?" and not "Can you show me that this abstraction is a very powerful one?"

I agree that "powerful" is probably not the best term here, so I'll stop using it going forward (note, though, that I didn't use it in my previous comment, which I endorse more than my claims in the original debate).

But before I ask "How do we know this empirical thing ends up looking like it's close to the abstraction?", I need to ask "Does the abstraction even make sense?" Because you have the abstraction in your head, and I don't, and so whenever you tell me that X is a (non-advance) prediction of your theory of consequentialism, I end up in a pretty similar epistemic state as if George Soros tells me that X is a prediction of the theory of reflexivity, or if a complexity theorist tells me that X is a prediction of the theory of self-organisation. The problem in those two cases is less that the abstraction is a bad fit for this specific domain, and more that the abstraction is not sufficiently well-defined (outside very special cases) to even be the type of thing that can robustly make predictions.

Perhaps another way of saying it is that they're not crisp/robust/coherent concepts (although I'm open to other terms, I don't think these ones are particularly good). And it would be useful for me to have evidence that the abstraction of consequentialism you're using is a crisper concept than Soros' theory of reflexivity or the theory of self-organisation. If you could explain the full abstraction to me, that'd be the most reliable way - but given the difficulties of doing so, my backup plan was to ask for impressive advance predictions, which are the type of evidence that I don't think Soros could come up with.

I also think that, when you talk about me being raised to hold certain standards of praiseworthiness, you're still ascribing too much modesty epistemology to me. I mainly care about novel predictions or applications insofar as they help me distinguish crisp abstractions from evocative metaphors. To me it's the same type of rationality technique as asking people to make bets, to help distinguish post-hoc confabulations from actual predictions.

Of course there's a social component to both, but that's not what I'm primarily interested in. And of course there's a strand of naive science-worship which thinks you have to follow the Rules in order to get anywhere, but I'd thank you to assume I'm at least making a more interesting error than that.

Lastly, on probability theory and Newtonian mechanics: I agree that you shouldn't question how much sense it makes to use calculus in the way that you described, but that's because the application of calculus to mechanics is so clearly-defined that it'd be very hard for the type of confusion I talked about above to sneak in. I'd put evolutionary theory halfway between them: it's partly a novel abstraction, and partly a novel empirical truth. And in this case I do think you have to be very careful in applying the core abstraction of evolution to things like cultural evolution, because it's easy to do so in a confused way.

That's a really helpful comment (at least for me)!

But at least step one could be saying, "Wait, do these two kinds of ideas actually go into the same bucket at all?"

I'm guessing that a lot of the hidden work here and in the next steps would come from asking stuff like:

  • so I need to alter the bucket for each new idea, or does it instead fit in its current form each time?
  • does the mental act of finding that an idea fit into the bucket removes some confusion and clarifies, or is it just a mysterious answer?
  • Does the bucket become more simple and more elegant with each new idea that fit in it?

Is there some truth in this, or am I completely off the mark?

It seems like the sort of thing that would take a subsequence I don't have time to write

You obviously can do whatever you want, but I find myself confused at this idea being discarded. Like, it sounds exactly like the antidote to so much confusion around these discussions and your position, such that if that was clarified, more people could contribute helpfully to the discussion, and either come to your side or point out non-trivial issues with your perspective. Which sounds really valuable for both you and the field!

So I'm left wondering:

  • Do you disagree with my impression of the value of such a subsequence?
  • Do you think it would have this value but are spending your time doing something more valuable?
  • Do you think it would be valuable but really don't want to write it?
  • Do you think it would be valuable, you could in principle write it, but probably no one would get it even if you did?
  • Something else I'm failing to imagine?

Once again, you do what you want, but I feel like this would be super valuable if there was anyway of making that possible. That's also completely relevant to my own focus on the different epistemic strategies used in alignment research, especially because we don't have access to empirical evidence or trial and error at all for AGI-type problems.

(I'm also quite curious if you think this comment by dxu points at the same thing you are pointing at)

I'm guessing that a lot of the hidden work here and in the next steps would come from asking stuff like:

  • so I need to alter the bucket for each new idea, or does it instead fit in its current form each time?
  • does the mental act of finding that an idea fit into the bucket removes some confusion and clarifies, or is it just a mysterious answer?
  • Does the bucket become more simple and more elegant with each new idea that fit in it?

Sounds like you should try writing it.

You obviously can do whatever you want, but I find myself confused at this idea being discarded. Like, it sounds exactly like the antidote to so much confusion around these discussions and your position, such that if that was clarified, more people could contribute helpfully to the discussion, and either come to your side or point out non-trivial issues with your perspective. Which sounds really valuable for both you and the field!

I'ma guess that Eliezer thinks there's a long list of sequences he could write meeting these conditions, each on a different topic.

Good point, I hadn't thought about that one.

Still, I have to admit that my first reaction is that this particular sequence seems quite uniquely in a position to increase the quality of the debate and of alignment research singlehandedly. Of course, maybe I only feel that way because it's the only one of the long list that I know of. ^^

(Another possibility I just thought of is that maybe this subsequence requires a lot of new preliminary subsequences, such that the work is far larger than you could expect from reading the words "a subsequence". Still sounds like it would be really valuable though.

I don't expect such a sequence to be particularly useful, compared with focusing on more object-level arguments. Eliezer says that the largest mistake he made in writing his original sequences was that he "didn’t realize that the big problem in learning this valuable way of thinking was figuring out how to practice it, not knowing the theory". Better, I expect, to correct the specific mistakes alignment researchers are currently making, until people have enough data points to generalise better.

I'm honestly confused by this answer.

Do you actually think that Yudkowsky having to correct everyone's object-level mistakes all the time is strictly more productive and will lead faster to the meat of the deconfusion than trying to state the underlying form of the argument and theory, and then adapting it to the object-level arguments and comments?

I have trouble understanding this, because for me the outcome of the first one is that no one gets it, he has to repeat himself all the time without making the debate progress, and this is one more giant hurdle for anyone trying to get into alignment and understand his position. It's unclear whether the alternative would solve all these problems (as you quote from the preface of the Sequences, learning the theory is often easier and less useful than practicing), but it still sounds like a powerful accelerator.

There is no dichotomy of "theory or practice", we probably need both here. And based on my own experience reading the discussion posts and the discussions I've seen around these posts, the object-level refutations have not been particularly useful forms of practice, even if they're better than nothing.

Your comment is phrased as if the object-level refutations have been tried, while conveying the meta-level intuitions hasn't been tried. If anything, it's the opposite: the sequences (and to some extent HPMOR) are practically all content about how to think, whereas Yudkowsky hasn't written anywhere near as extensively on object-level AI safety.

This has been valuable for community-building, but less so for making intellectual progress - because in almost all domains, the most important way to make progress is to grapple with many object-level problems, until you've developed very good intuitions for how those problems work. In the case of alignment, it's hard to learn things from grappling with most of these problems, because we don't have signals of when we're going in the right direction. Insofar as Eliezer has correct intuitions about when and why attempted solutions are wrong, those intuitions are important training data.

By contrast, trying to first agree on very high-level epistemological principles, and then do the object-level work, has a very poor track record. See how philosophy of science has done very little to improve how science works; and how reading the sequences doesn't improve people's object-level rationality very much.

I model you as having a strong tendency to abstract towards higher-level discussion of epistemology in order to understand things. (I also have a strong tendency to do this, but I think yours is significantly stronger than mine.) I expect that there's just a strong clash of intuitions here, which would be hard to resolve. But one prompt which might be useful: why aren't epistemologists making breakthroughs in all sorts of other domains?

Thanks for giving more details about your perspective.

Your comment is phrased as if the object-level refutations have been tried, while conveying the meta-level intuitions hasn't been tried. If anything, it's the opposite: the sequences (and to some extent HPMOR) are practically all content about how to think, whereas Yudkowsky hasn't written anywhere near as extensively on object-level AI safety.

It's not clear to me that the sequences and HPMOR are good pointers for this particular approach to theory building. I mean, I'm sure there are posts in the sequences that touch on that (Einstein's Arrogance is an example I already mentioned), but I expect that they only talk about it in passing and obliquely, and that such posts are spread all over the sequences. Plus the fact that Yudkowsky said that there was a new subsequence to write lead me to believe that he doesn't think the information is clearly stated already.

So I don't think you can really put the current confusion as an evidence that the explanation of how that kind of theory would work doesn't help, given that this isn't readily available in a form I or anyone reading this can access AFAIK.

This has been valuable for community-building, but less so for making intellectual progress - because in almost all domains, the most important way to make progress is to grapple with many object-level problems, until you've developed very good intuitions for how those problems work. In the case of alignment, it's hard to learn things from grappling with most of these problems, because we don't have signals of when we're going in the right direction. Insofar as Eliezer has correct intuitions about when and why attempted solutions are wrong, those intuitions are important training data.

Completely agree that these intuitions are important training data. But your whole point in other comments is that we want to understand why we should expect these intuitions to differ from apparently bad/useless analogies between AGI and other stuff. And some explanation of where these intuitions come from could help with evaluating these intuitions, even more because Yudkowsky has said that he could write a sequence about the process. 

By contrast, trying to first agree on very high-level epistemological principles, and then do the object-level work, has a very poor track record. See how philosophy of science has done very little to improve how science works; and how reading the sequences doesn't improve people's object-level rationality very much.

This sounds to me like a strawman of my position (which might be my fault for not explaining it well).

  • First, I don't think explaining a methodology is a "very high-level epistemological principle", because it let us concretely pick apart and criticize the methodology as a truthfinding method.
  • Second, the object-level work has already been done by Yudkowsky! I'm not saying that some outside-of-the-field epistemologist should ponder really hard about what would make sense for alignment without ever working on it concretely and then give us their teaching. Instead I'm pushing for a researcher who has built a coherent collections of intuitions and has thought about the epistemology of this process to share the latter to help us understand the former.
  • A bit similar to my last point, I think the correct comparison here is not "philosophers of science outside the field helping the field", which happens but is rare as you say, but "scientists thinking about epistemology for very practical reasons". And given that the latter is from my understanding what started the scientific revolution and a common activity of all scientists until the big paradigms were established (in Physics and biology at least) in the early 20th century, I would say there is a good track record here.
    (Note that this is more your specialty, so I would appreciate evidence that I'm wrong in my historical interpretation here)

I model you as having a strong tendency to abstract towards higher-level discussion of epistemology in order to understand things. (I also have a strong tendency to do this, but I think yours is significantly stronger than mine.)

Hum, I certainly like a lot of epistemic stuff, but I would say my tendencies to use epistemology are almost always grounded in concrete questions, like understanding why a given experiment tells us something relevant about what we're studying.

I also have to admit that I'm kind of confused, because I feel like you're consistently using the sort of epistemic discussion that I'm advocating for when discussing predictions and what gives us confidence in a theory, and yet you don't think it would be useful to have a similar-level model of the epistemology used by Yudkowsky to make the sort of judgment you're investigating?

I expect that there's just a strong clash of intuitions here, which would be hard to resolve. But one prompt which might be useful: why aren't epistemologists making breakthroughs in all sorts of other domains?

As I wrote about, I don't think this is a good prompt, because we're talking about scientists using epistemology to make sense of their own work there.

Here is an analogy I just thought of: I feel that in this discussion, you and Yudkowsky are talking about objects which have different types. So when you're asking question about his model, there's a type mismatch. And when he's answering, having noticed the type mismatch, he's trying to find what to ascribe it to (his answer has been quite consistently modest epistemology, which I think is clearly incorrect). Tracking the confusing does tell you some information about the type mismatch, and is probably part of the process to resolve it. But having his best description of his type (given that your type is quite standardized) would make this process far faster, by helping you triangulate the differences.

As an aside, I think of FDT as being roughly in the same category: well-defined in Newcomb's problem and with exact duplicates, but reliant on vague intuitions to generalise to anything else.

FDT was made rigorous by infra-Bayesianism, at least in the pseudocausal case.

One of my updates from reading this is that Rapid vs. Gradual takeoff seems like an even more important variable for many people's model than I had assumed. Making this debate less one-sided might thus be super valuable even if writing up arguments is costly.

Comment after reading section 5.3:

Eliezer: What's an example of a novel prediction made by the notion of probability?

Richard: Most applications of the central limit theorem.

Eliezer: Then I should get to claim every kind of optimization algorithm which used expected utility, as a successful advance prediction of expected utility?

...

Richard: These seem better than nothing, but still fairly unsatisfying, insofar as I think they are related to more shallow properties of the theory.

This exchange makes me wonder whether Richard would accept the successes of reinforcement learning as "predictions" of the kind he is looking for? Because RL is essentially the straightforward engineering implementation of "expected utility theory".

I'm still trying to understand the scope of expected utility theory, so examples like this are very helpful! I'd need to think much more about it before I had a strong opinion about how much they support Eliezer's applications of the theory, though.

It's taking a massive massive failure and trying to find exactly the right abstract gloss to put on it that makes it sound like exactly the right perfect thing will be done next time.

I feel like Ngo didn't really respond to this?

Like, later he says: 

Right, I'm not endorsing this as my mainline prediction about what happens. Mainly what I'm doing here is highlighting that your view seems like one which cherrypicks pessimistic interpretations.

But... Richard, are you endorsing it as 'at all in line with the evidence?' Like, when I imagine living in that world, it doesn't have gain-of-function research, which our world clearly does. [And somehow this seems connected to Eliezer's earlier complaints, where it's not obvious to me that when you wrote the explanation, your next step was to figure out what that would actually imply and check if it were true or not.]

I think we live in a world where there are very strong forces opposed to technological progress, which actively impede a lot of impactful work, including technologies which have the potential to be very economically and strategically important (e.g. nuclear power, vaccines, genetic engineering, geoengineering).

This observation doesn't lead me to a strong prediction that all such technologies will be banned; nor even that the most costly technologies will be banned - if the forces opposed to technological progress were even approximately rational, then gain of function research would be one of their main priorities (although I note that they did manage to ban it, the ban just didn't stick).

But when Eliezer points to covid as an example of generalised government failure, and I point to covid as also being an example of the specific phenomenon of people being very wary of new technology, I don't think that my gloss is clearly absurd. I'm open to arguments that say that serious opposition to AI progress won't be an important factor in how the future plays out; and I'm also open to arguments that covid doesn't provide much evidence that there will be serious opposition to AI progress. But I do think that those arguments need to be made.

Parts of this remind me of flaming my team in a cooperative game.

A key rule to remember about team chat in videogames is that chat actions are moves in the game. It might feel satisfying to verbally dunk on my teammate for a̶s̶k̶i̶n̶g̶ ̶b̶i̶a̶s̶e̶d̶ ̶̶q̶u̶e̶s̶t̶i̶o̶n̶s̶ not ganking my lane, and I definitely do it sometimes, but I do it less if I occasionally think "what chat actions can help me win the game from this state?"

This is less than maximally helpful advice in a conversation where you're not sure what "winning" looks like. And some of the more obvious implications might look like the dreaded social obeisance.