# 37

AI
Frontpage

This is the final discussion log in the Late 2021 MIRI Conversations sequence, featuring Rohin Shah and Eliezer Yudkowsky, with additional comments from Rob Bensinger, Nate Soares, Richard Ngo, and Jaan Tallinn.

The discussion begins with summaries and comments on Richard and Eliezer's debate. Rohin's summary has since been revised and published in the Alignment Newsletter.

After this log, we'll be concluding this sequence with an AMA, where we invite you to comment with questions about AI alignment, cognition, forecasting, etc. Eliezer, Richard, Paul Christiano, Nate, and Rohin will all be participating.

Color key:

AI2
Frontpage

# 37

New Comment

is there a little goal-oriented mind inside there that solves science problems the same way humans solve them, by engineering mental constructs that serve a goal of prediction, including backchaining for prediction goals and forward chaining from alternative hypotheses / internal tweaked states of the mental construct?

In case it helps anyone to hear different people talking about the same thing, I think Eliezer in this quote is describing a similar thing as my discussion here (search for the phrase “RL-on-thoughts”).

So my objection to debate (which again I think is similar to Eliezer's) would be: (1) if the debaters are “trying to win the debate” in a way that involves RL-on-thoughts / consequentialist planning / etc., then in all likelihood they would think up the strategy of breaking out of the box and hacking into the judge / opposing debater / etc. (2) if not, I don't think the AIs would be sufficiently capable that they could do anything pivotal. [EDIT TO ADD: in retrospect, my phrasing here is a false dichotomy, see my follow-up comment].

I can generate lots of arguments for why it would be aimed towards achievement of a misaligned goal, such as (1) only a tiny fraction of goals are aligned; the rest are misaligned, (2) the feedback we provide is unlikely to be the right goal and even small errors are fatal, (3) lots of misaligned goals are compatible with the feedback we provide even if the feedback is good, since the AGI might behave well until it can execute a treacherous turn, (4) the one example of strategically aware intelligence (i.e. humans) is misaligned relative to its creator. (I'm not saying I agree with these arguments, but I do understand them.)

That seems like a pretty good list to me.

If I'm reading Rohin correctly, he was gearing up to argue that the claim “We don't know how to ensure that the AGI's eventual (inner) goal is something-in-particular that we want” is different from the claim “If we have a bad process that entails some randomness in the AGI's eventual (inner) goal, then it's (e.g.) 99% likely that the AGI's eventual (inner) goal will wind up being one that's incompatible with human life,” and that the latter claim was not justified by Eliezer here. If so, I'd tentatively agree with Rohin on that. I just put in the number 99% as an example. The real percentage is not obvious to me. I think it depends on the details of the “bad process”, such that it's not very useful to discuss in the abstract. (I do think >99% is a reasonable guess for at least some approaches.)

So my objection to debate (which again I think is similar to Eliezer's) would be: (1) if the debaters are “trying to win the debate” in a way that involves RL-on-thoughts / consequentialist planning / etc., then in all likelihood they would think up the strategy of breaking out of the box and hacking into the judge / opposing debater / etc. (2) if not, I don't think the AIs would be sufficiently capable that they could do anything pivotal.

In that particular non-failure story, I'm definitely imagining that they aren't "trying to win the debate" (where "trying" is a very strong word that implies taking over the world to win the debate).

I didn't really get into this with Eliezer but like Richard I'm pretty unclear on why "not trying to win the debate" (with the strong sense of trying) implies "insufficiently capable to be pivotal". I don't think humans are "trying" in the strong sense, but we sure seem very capable; it doesn't seem crazy to imagine that this continues.

If I'm reading Rohin correctly, he was gearing up to argue that the claim

I wasn't really gearing up to argue anything. For most of this conversation I was in the mode of "what is the argument that convinces Eliezer of near-certain doom (rather than just suggesting it is plausible), because I don't see it".

The RL-on-thoughts discussion was meant as an argument that a sufficiently capable AI needs to be “trying” to do something. If we agree on that part, then you can still say my previous comment was a false dichotomy, because the AI could be “trying” to (for example) “win the debate while following the spirit of the debate rules”.

And yes I agree, that was bad of me to have listed those two things as if they're the only two options.

I guess what I was thinking was: If we take the most straightforward debate setup, and if it gets an AI that is “trying” to do something, then that “something” is most likely to be vaguely like “win the debate” or something else with similarly-destructive consequences.

A different issue is whether that “most likely” is 99.9% vs 80% or whatever—that part is not immediately obvious to me.

And yet another question is whether we can push that probability much lower, even towards zero, by not using the most straightforward debate setup, but rather adding things to the setup that are directly targeted at sculpting the AGI's motivations.

I am not in fact convinced of near-certain doom there—that would be my Consequentialism & Corrigibility post. (I am convinced that we don't have a good plan right now.)

I agree that we don't have a plan that we can be justifiably confident in right now.

I don't see why the "destructive consequences" version is most likely to arise, especially since it doesn't seem to arise for humans. (In terms of Rob's continuum, humans seem much closer to #2-style trying.)

Again I don't have an especially strong opinion about what our prior should be on possible motivation systems for an AGI trained by straightforward debate, and in particular what fraction of those motivation systems are destructive. But I guess I'm sufficiently confident in “>50% chance that it's destructive” that I'll argue for that. I'll assume the AGI uses model-based RL, which would (I claim) put it very roughly in the same category as humans.

Some aspects of motivation have an obvious relationship / correlation to the reward signal. In the human case, given that we're a social animal, we can't be surprised to find that the human brainstem reward function inserts lots of socially-related motivations into us, including things like caring about other humans (which sometimes generalizes to caring about other living creatures) and generally wanting to fit in and follow norms under most circumstances, etc. Whereas other things in the world have no relationship to the innate human brainstem reward function, and predictably, basically no one cares about them, except insofar as they become instrumentally useful for something else we do care about. (There are interesting rare exceptions, like human superstitions.) An example in humans would be the question of whether pebbles on the sidewalk are more often an even number of centimeters apart versus an odd number of centimeters apart.

In the straightforward debate setup, I can't see any positive reason for the reward function to directly paint a valence, either positive or negative, onto the idea of the AGI taking over the world. So I revert to the default expectation that the AGI will view “I take over the world” in a way that's analogous to how humans view “the pebbles on the sidewalk are an even number of centimeters apart”—i.e., totally neutral, except insofar as it becomes instrumentally relevant for something else. Meanwhile, the reward signal is directly painting positive valence onto some aspect(s) of winning the debate. It's hard to say exactly what that aspect will be—in fact I think it will be at least somewhat random. But whatever it is, it seems to me to be >50% likely that the AGI can get more of it by taking over the world. I might get as high as “>80%” or “>90%” before I start shrugging and saying “I don't really know”.

(Then we can start talking about capability windows etc., but I don't think that was your objection here.)

But I guess I'm sufficiently confident in “>50% chance that it's destructive” that I'll argue for that.

Fwiw 50% on doom in the story I told seems plausible to me; maybe I'm at 30% but that's very unstable. I don't think we disagree all that much here.

Then we can start talking about capability windows etc., but I don't think that was your objection here.

Capability windows are totally part of the objection. If you completely ignore capability windows / compute restrictions then you just run AIXI (or AIXI-tl if you don't want something uncomputable) and die immediately.

In that particular non-failure story, I'm definitely imagining that they aren't "trying to win the debate" (where "trying" is a very strong word that implies taking over the world to win the debate).

Suppose I'm debating someone about gun control, and they say 'guns don't kill people; people kill people'. Here are four different scenarios for how I might respond:

• 1. Almost as a pure reflex, before I can stop myself, I blurt out 'That's bullshit!' in response. It's not the best way to win the debate, but heck, I've heard that zinger a thousand times and it just makes me so mad. (Or, in other words: I have an automatic reflex-like response to that specific phrase, which is to get mad; and when I get mad, I have an automatic reflex-like response to blurt out the first sentence I can think of that expresses disapproval for that slogan.)

• 2. I remember that there's a $1000 prize for winning this debate, and I take a deep breath to calm myself. I know that winning the debate will require convincing a judge who isn't super sympathetic to my political views; so I'll have to come up with some argument that's convincing even to a conservative. My mind wanders for a few seconds, and a thought pops into my head: 'Guns and people both kill people!' Hmm, but that sounds sort of awkward and weak. Is there a more pithy phrasing? A memory suddenly pops into my head: I think I heard once that knife murders spiked in Australia or somewhere, when guns were banned? So, like, 'People will kill people regardless of whether guns are present?' Ugh, wait, that's exactly the point my opponent was making. Moving the debate in that direction is a terrible idea if I want to win. And now I feel a bit bad for strategically steering my thoughts away from true information, but whatever... And now my mind is wandering, thinking about gun suicide, and... come on, focus. 'Guns don't kill people. People kill people.' How to respond? Going concrete might make my response more compelling, by making me sound more grounded and common-sensical. Concretely, it's just obvious common sense that giving someone more firepower will increase their ability to kill others; and, for example, it will make it likelier that someone kills someone else in a fit of passion, where they might not have committed murder if they'd been delayed a few minutes. Oh, hey, I can use that! I like how matter-of-fact that response is. And it will be more persuasive to the judge, because it's not making any strong or outrageous-sounding claims, or building a big edifice of argument; it's just making a simple challenge, which then puts the ball in the other side's court and makes it seem like the burden of proof lies with them now. Anyway, I'm feeling tired after thinking this hard, and I'm running out of time, so let's just go with that idea... Or, instead: • 3. Wait, why am I focusing so much on the$1000 prize for this TV show? Being on this show is an amazing opportunity: I could make way more than $1000 if I hijack the live broadcast to start promoting my business to the televised audience. Actually, what if I just tried to negotiate a deal with my debate opponent. Or, heck, with the producers... Or, instead: • 4. Sorry, I don't have time to think about that debate question, I'm busy building a Dyson swarm to harvest the Sun's energy so that I can make the future awesome. I... really don't care about the$1000, no, relative to the larger stakes here.

If "trying" is a very strong word that literally implies you have to be trying to take over the world, then only scenario #4 involves me "trying" to win the debate. But I think it makes more sense to say that I'm trying in all four cases (or at least in cases #2, #3, and #4, where I'm displaying some strategy in deciding what to say).

You might then respond that we should try to build AI systems that are "trying" in the weak sense of #2, rather than in sense #3 or #4. But I think Eliezer and Steven's point is that #2, #3, and #4 are on a continuum, rather than being qualitatively different.

(Even #1 is on the continuum in some respects, since my brain needs to be engaging in smart creative search processes somewhere in order to even generate strategies like 'get mad in response to X' or 'find an angry-sounding thing to say in response when I get mad'.)

#2, #3, and #4 are all cases where I'm performing a search for strategies that will get me what I want, and where I evaluate various candidate responses to see how helpful they look. The difference between these options is in how wide a space of strategies I'm considering, and in how efficiently and intelligently I'm zeroing in on the highest-rated strategies in that space. (Where 'highest-rated' is relative to what I want.)

I totally agree those are on a continuum. I don't think this changes my point? It seems like Eliezer is confident that "reduce x-risk to EDIT: sub-50%" requires being all the way on the far side of that continuum, and I don't see why that's required.

("near-zero" is a red herring, and I worry that that phrasing bolsters the incorrect view that the reason MIRI folk think alignment is hard is that we want implausibly strong guarantees. I suggest replacing "reduce x-risk to near-zero" with "reduce x-risk to sub-50%".)

(Done)

If we have some way to limit an AI's strategy space, or limit how efficiently and intelligently it searches that space, then we can maybe recapitulate some of the stuff that makes humans safe (albeit at the cost that the debate answers will probably be way worse — but maybe we can still get nanotech or whatever out of this process).

If that's the plan, then I guess my next question is how we should go about limiting the strategy space and/or reducing the search quality? (Taking into account things like deception risk.)

Alternatively, maybe you think that something very reflex-like, a la #1, is sufficient for a pivotal act — no smart search for strategies at all. But surely there has to be smart search going on somewhere the system, or how is it doing a bunch of useful novel scientific work?

If we have some way to limit an AI's strategy space, or limit how efficiently and intelligently it searches that space, then we can maybe recapitulate some of the stuff that makes humans safe (albeit at the cost that the debate answers will probably be way worse — but maybe we can still get nanotech or whatever out of this process).

If that's the plan, then I guess my next question is how we should go about limiting the strategy space and/or reducing the search quality? (Taking into account things like deception risk.)

It sounds like you think my position is "here is my plan to save the world and I have a story for how it will work", whereas my actual view is "here is a story in which humanity is stupid and covers itself in shame by taking on huge amounts of x-risk (e.g. 5%), where we have no strong justification for being confident that we'll survive, but the empirical situation ends up being such that we survive anyway".

In this story, I'm not imagining that we limited the strategy space of reduced the search quality. I'm imagining that we just scaled up capabilities, used debate without any bells and whistles like interpretability, and the empirical situation just happened to be that the AI systems didn't develop #4-style "trying" (but did develop #2-style "trying") before they became capable enough to e.g. establish a stable governance regime that regulates AI development or do alignment research better than any existing human alignment researchers that leads to a solution that we can be justifiably confident in.

My sense is that Eliezer would say that this story is completely implausible, i.e. this hypothesized empirical situation is ruled out by knowledge that Eliezer has. But I don't know what knowledge rules this out. (I'm pretty sure it has to do with his intuitions about a Core of General Intelligence, and/or why capabilities generalize faster than alignment, but I don't know where those intuitions come from, nor do I share them.)

Alternatively, maybe you think that something very reflex-like, a la #1, is sufficient for a pivotal act

Idk, I'm also worried about sufficiently scaled-up reflex-like things, in the sense that I think sufficiently scaled-up reflex-like things are capable both of pivotal acts and causing human extinction. But on my prediction of what actually happens I expect at least #2-style reasoning before reducing x-risk to ~zero (because that's more efficient than scaled-up reflex-like things).

In this story, I'm not imagining that we limited the strategy space of reduced the search quality. I'm imagining that we just scaled up capabilities, used debate without any bells and whistles like interpretability, and the empirical situation just happened to be that the AI systems didn't develop #4-style "trying" (but did develop #2-style "trying") before they became capable enough to e.g. establish a stable governance regime that regulates AI development or do alignment research better than any existing human alignment researchers that leads to a solution that we can be justifiably confident in.

You (a human) already exhibit #2-style trying. Despite this, you are not capable of "establishing a stable governance regime that regulates AI development" or "doing alignment research better than any existing human alignment researchers" (the latter is tautologically true, even).

So it seems reasonable to conclude that this level of "trying" is not enough to enact the pivotal acts you described (or, indeed, most any pivotal act that we might recognize as "pivotal"). It then follows that if a system is capable enough to enact some such pivotal act, some part of that system must have been running a stronger search than the kind of search described in "#2-style trying". And if you buy Eliezer's/Nate's argument that it's the search itself that's dangerous, rather than the fact that you (maybe) wrapped up the search in an outer shell you happen to call "oracle AI" (or something), then it's not a large jump from there to "maybe the search decides 'killing all humans' rates highly according to its search criteria".

But perhaps you're conceptualizing this whole "trying" thing differently, because you go on to say:

Idk, I'm also worried about sufficiently scaled-up reflex-like things, in the sense that I think sufficiently scaled-up reflex-like things are capable both of pivotal acts and causing human extinction. But on my prediction of what actually happens I expect at least #2-style reasoning before reducing x-risk to ~zero (because that's more efficient than scaled-up reflex-like things).

which actually just does not parse in my native ontology. Like, in my ontology "sufficiently scaled-up reflex-like things" stop behaving reflexively. It's not that you have this abstract label "reflex-like", that you can slap onto some system, such that if you then scale that system up the label stays stuck to it indefinitely; in my model reflexiveness is a property of actions, not of systems, and if you make a system sufficiently powerful it leaves the regime where reflex-like behavior is its default. It automatically goes from #1 to #2 to #3 to #4 in the limit of sufficient scaling; this is, from my perspective, what is meant by the claim "these things exist on a continuum" (which claim it seems like you agreed with in a parallel comment thread, which simply furthers my confusion).

So it seems reasonable to conclude that this level of "trying" is not enough to enact the pivotal acts you described

Stated differently than how I'd say it, but I agree that a single human performing human-level reasoning is not enough to enact those pivotal acts.

in my model reflexiveness is a property of actions,

Yeah, in my ontology (and in this context) reflexiveness is a property of cognitions, not of actions. I can reflexively reach into a transparent pipe to pick up a sandwich, without searching over possible plans for getting the sandwich (or at least, without any conscious search, and without any search via trying different plans and seeing if they work); one random video I've seen suggests that (some kind of) monkeys struggle to do this and may have to experiment with different plans to get the food. (I use this anecdote as an illustration; I don't know if it is actually true.)

See also the first few sections of Argument, intuition, and recursion; in the language of that post I'm thinking of "explicit argument" as "trying", and "intuition" as "reflex-like", even though they output the same thing.

Within my ontology, you could define behavioral-reflexivity as those behaviors / actions that a human could do with reflexive cognition, and then more competent actions are behavioral-trying. These concepts might match yours. In that case I'm saying that it's plausible that there's a wide gap between behavioral-trying-2 and behavioral-trying-3, but really my intuition is coming much more from finding the trying-2 cognitions significantly more likely than the trying-3 cognitions, and thinking that the trying-2 cognitions could scale without becoming trying-3 cognitions.

Or, to try and say things a bit more concretely, I find it plausible that there is more scaling from improving the efficiency of the search (e.g. by having better tuned heuristics and intuitions), than from expanding the domain of possible plans considered by the search. The 4 styles of trying that Rob mentioned exist on a continuum like "domain of possible plans", but instead we mostly walk up the continuum of "efficiency / competence of search within the domain".

(The resulting world looks more like CAIS than like a singular superintelligence with a DSA.)

(And I'll reiterate again because I anticipate being misunderstood that this is not a prediction of how the world must be and thus we are obviously safe; it is instead a story that I think is not ruled out by our current understanding and thus one to which I assign non-trivial probability.)

If that's the plan, then I guess my next question is how we should go about limiting the strategy space and/or reducing the search quality? (Taking into account things like deception risk.)

I suggested doing this using quantilization.

I am a time-traveler who came back from the world where it (super duper predictably) turned out that a lot of early bright hopes didn't pan out and various things went WRONG and alignment was HARD and it was NOT SOLVED IN ONE SUMMER BY TEN SMART RESEARCHERS

I think these kinds of comments update readers' beliefs in a bad, invalid way. The bad event (AGI ruin) is argued for by... a request for me to condition on testimony of a survivor of that bad event. Yes, I know the whole thing is tongue-in-cheek. I know that EY is not literally claiming to be a time-traveller.

But in TurnTrout-culture, "I experienced X" is something to be said when X has actually been experienced. "The fact that X" is to be said when X is actually supported by a heap of accepted evidence.[1] "Have you met dath ilani?" is to be said when such entities actually exist and are not outputs of the model of intelligence which is being argued for. (Yes, that last one was flagged as a "bad argument", but still.)

This paragraph of EY self-fic didn't update me at all. But it almost did. When these statements are made, I am inclined to update my beliefs in the predictable way -- to gullibly update on claims -- unless I take special effort to not update on (checks dialogue) fictional evidence. Which effort I do take (as a matter of reflex, at this point), but that effort is a cost imposed on me.

1. ^

This particular kind of misleading statement wasn't made in this dialogue, but I've seen it made erroneously-according-to-me in private correspondence with smart researchers.

This may, perhaps, be confounded by the phenomenon where I am one of the last living descendants of the lineage that ever knew how to say anything concrete at all.

I've previously noticed this weakness in myself. What lineage did Eliezer learn this from? I would appreciate any suggestions or advice on how to become stronger at this.

This came up with Aysajan about two months ago. An exercise which I recommended for him: first, pick a technical academic paper. Read through the abstract and first few paragraphs. At the end of each sentence (or after each comma, if the authors use very long sentences), pause and write/sketch a prototypical example of what you currently think they're talking about. The goal here is to get into the habit of keeping a "mental picture" (i.e. prototypical example) of what the authors are talking about as you read.

Other good sources on which to try this exercise:

Early reports from Aysajan are that integration of this exercise into standard reading habits has resulted in a significant step-change improvement in understanding what's going on in nontrivial technical papers/posts, and also seems to spur a lot more independent thoughts/understanding in response to reading. Don't know yet how robust/reproducible this is, so if you practice the exercise a bit, please let me know how it goes.

(Fun side note: you can think of this technique as an application of very basic model theory to human rationality.)

CFAR used to have an awesome class called "Be specific!" that was mostly about concreteness. Exercises included:

• Rationalist taboo
• A group version of rationalist taboo where an instructor holds an everyday object and asks the class to describe it in concrete terms.
• The Monday-Tuesday game
• A role-playing game where the instructor plays a management consultant whose advice is impressive-sounding but contentless bullshit, and where the class has to force the consultant to be specific and concrete enough to be either wrong or trivial.
• People were encouraged to make a habit of saying "can you give an example?" in everyday conversation. I practiced it a lot.

IIRC, Eliezer taught the class in May 2012? He talks about the relevant skills here and here. And then I ran it a few times, and then CFAR dropped it; I don't remember why.

Cryptography was mentioned in this post in a relevant manner, though I don't have enough experience with it to advocate it with certainty. Some lineages of physics (EY points to Feynman) try to evoke this, though it's pervasiveness has decreased. You may have some luck with Zen. Generally speaking, I think if you look at the Sequences, the themes of physics, security mindset, and Zen are invoked for a reason.

If being versed in cryptography was enough, then I wouldn't expect Eliezer to claim being one of the last living descendents of this lineage.

Why would Zen help (and why do you think that)?