With the release of Rohin Shah and Eliezer Yudkowsky's conversation, the Late 2021 MIRI Conversations sequence is now complete.
This post is intended as a generalized comment section for discussing the whole sequence, now that it's finished. Feel free to:
- raise any topics that seem relevant
- signal-boost particular excerpts or comments that deserve more attention
- direct questions to participants
In particular, Eliezer Yudkowsky, Richard Ngo, Paul Christiano, Nate Soares, and Rohin Shah expressed active interest in receiving follow-up questions here. The Schelling time when they're likeliest to be answering questions is Wednesday March 2, though they may participate on other days too.
Question for Richard, Paul, and/or Rohin: What's a story, full of implausibly concrete details but nevertheless a member of some largish plausible-to-you cluster of possible outcomes, in which things go well? (Paying particular attention to how early AGI systems are deployed and to what purposes, or how catastrophic deployments are otherwise forstalled.)
I wrote this doc a couple of years ago (while I was at CHAI). It's got many rough edges (I think I wrote it in one sitting and never bothered to rewrite it to make it better), but I still endorse the general gist, if we're talking about what systems are being deployed to do and what happens amongst organizations. It doesn't totally answer your question (it's more focused on what happens before we get systems that could kill everyone), but it seems pretty related.
(I haven't brought it up before because it seems to me like the disagreement is much more in the "mechanisms underlying intelligence", which that doc barely talks about, and the stuff it does say feels pretty outdated; I'd say different things now.)
This is mostly in response to stuff written by Richard, but I'm interested in everyone's read of the situation.
I'm not sure yet how to word this as a question without some introductory paragraphs. When I read Eliezer, I often feel like he has a coherent worldview that sees lots of deep connections and explains lots of things, and that he's actively trying to be coherent / explain everything. [This is what I think you're pointing to with his 'attitude toward... (read more)
I feel like I have a broad distribution over worlds and usually answer questions with probability distributions, that I have a complete mental universe (which feels to me like it outputs answers to a much broader set of questions than Eliezer's, albeit probabilistic ones, rather than bailing with "the future is hard to predict"). At a high level I don't think "mainline" is a great concept for describing probability distributions over the future except in certain exceptional cases (though I may not understand what "mainline" means), and that neat stories that fit everything usually don't work well (unless, or often even if, generated in hindsight).
In answer to your "why is this," I think it's a combination of moderate differences in functioning and large differences in communication style. I think Eliezer has a way of thinking about the future that is quite different from mine and I'm somewhat skeptical of and feel like Eliezer is overselling (which is what got me into this discussion), but that's probably smaller than a large difference in communication style (driven partly by different skills, different aesthetics, and different ideas about what kinds of standards discourse should aspire to).
I think I may not understand well the basic lesson / broader point, so will probably be more helpful on object level points and will mostly go answer those in the time I have.
Sometimes I'll be tracking a finite number of "concrete hypotheses", where every hypothesis is 'fully fleshed out', and be doing a particle-filtering style updating process, where sometimes hypotheses gain or lose weight, sometimes they get ruled out or need to split, or so on. In those cases, I'm moderately confident that every 'hypothesis' corresponds to a 'real world', constrained by how well as I can get my imagination to correspond to reality. [A 'finite number' depends on the situation, but I think it's normally something like 2-5, unless it's an area I've built up a lot of cache about.]
Sometimes I'll be tracking a bunch of "surface-level features", where the distributions on the features don't always imply coherent underlying worlds, either on their own or in combination with other features. (For example, I might have guesses about the probability th... (read more)
I think my way of thinking about things is often a lot like "draw random samples," more like drawing N random samples rather than particle filtering (I guess since we aren't making observations as we go---if I notice an inconsistency the thing I do is more like backtrack and start over with N fresh samples having updated on the logical fact).
The main complexity feels like the thing you point out where it's impossible to make them fully fleshed out, so you build a bunch of intuitions about what is consistent (and could be fleshed out given enough time) and then refine those intuitions only periodically when you actually try to flesh something out and see if it makes sense. And often you go even further and just talk about relationships amongst surface level features using intuitions refined from a bunch of samples.
I feel like a distinctive feature of Eliezer's dialog w.r.t. foom / alignment difficulty is that he has a lot of views about strong regularities that should hold across all of these worlds. And then disputes about whether worlds are plausible often turn on things like "is this property of the described world likely?" which is tough because obviously everyone agrees that ev... (read more)
The most recent post has a related exchange between Eliezer and Rohin:
If I'm being locally nitpicky, I argue that Eliezer's thing is a very mild overstatement (it should be "≤" instead of "<") but given that we're talking about forecasts, we're talking about uncertainty, and so we should expect "less" optimism instead of just "not more" optimism, and so I think Eliezer's statement stands as a general principle about engineering design.
This also feels to me like the sort of thing that I somehow want to direct attention towards. Either this principle is right and relevant (and it would be good for the field if all the AI safety thinkers held it!), or there's some deep confusion of mine that I'd like cleared up.
Sorry, I probably should have been more clear about the "this is a quote from a longer dialogue, the missing context is important." I do think that the disagreement about "how relevant is this to 'actual disagreement'?" is basically the live thing, not whether or not you agree with the basic abstract point.
My current sense is that you're right that the thing you're doing is more specific than the general case (and one of the ways you can tell is the line of argumentation you give about chance of doom), and also Eliezer can still be correctly observing that you have too many free parameters (even if the number of free parameters is two instead of arbitrarily large). I think arguments about what you're selecting for either cash out in mechanistic algorithms, or they can deceive you in this particular way.
Or, to put this somewhat differently, in my view the basic abstract point implies that having one extra free parameter allows you to believe in a 5% chance of doom when in fact there's 100% chance of doom, and so in order to get estimations like that right this needs to be one of the basic principles shaping your thoughts, tho ofc your prior should come from many examples instead of ... (read more)
I want to add an additional meta-pattern – there was a once a person who thought I had a particular bias. They'd go around telling me "Ray, you're exhibiting that bias right now. Whatever rationalization you're coming up with right now, it's not the real reason you're arguing X." And I was like "c'mon man. I have a ton of introspective access to myself and I can tell that this 'rationalization' is actually a pretty good reason to believe X and I trust that my reasoning process is real."
But... eventually I realized I just actually had two motivations going on. When I introspected, I was running a check for a positive result on "is Ray displaying rational thought?". When they extrospected me (i.e. reading my facial expressions), they were checking for a positive result on "does Ray seem biased in this particular way?".
And both checks totally returned 'true', and that was an accurate assessment.
The partic... (read more)
EDIT: I wrote this before seeing Paul's response; hence a significant amount of repetition.
Well, there are many boring cases that are explained by pedagogy / argument structure. When I say things like "in the limit of infinite oversight capacity, we could just understand everything about the AI system and reengineer it to be safe", I'm obviously not claiming that this is a realistic thing that I expect to happen, so it's not coming from my "complete mental universe"; I'm just using this as an intuition pump for the listener to establish that a sufficiently powerful oversight process would solve AI alignment.
That being said, I think there is a more interesting difference here, but that your description of it is inaccurate (at least for me).
From my perspective I am implicitly representing a probability distribution over possible futures in my head. When I say "maybe X happens", or "X is not absurd", I'm saying that my probability distribution assign... (read more)
In response to your last couple paragraphs: the critique, afaict, is not "a real human cannot keep multiple concrete scenarios in mind and speak probabilistically about those", but rather "a common method for representing lots of hypotheses at once, is to decompose the hypotheses into component properties that can be used to describe lots of concrete hypotheses. (toy model: instead of imagining all numbers, you note that some numbers are odd and some numbers are even, and then think of evenness and oddness). A common failure mode when attempting this is that you lose track of which properties are incompatible (toy model: you claim you can visualize a number that is both even and odd). A way to avert this failure mode is to regularly exhibit at least one concrete hypothesis that simultaneousy posseses whatever collection of properties you say you can simultaneously visualize (toy model: demonstrating that 14 is even and 7 is odd does not in fact convince me that you are correct to imagine a number that is both even and odd)."
On my understanding of Eliezer's picture (and on my own personal picture), almost nobody ever visibly tries to do this (never mind succeeding), when it comes to hopeful AGI scenarios.
Insofar as you have thought about at least one specific hopeful world in great detail, I strongly recommend, spelling it out, in all its great detail, to Eliezer, next time you two chat. In fact, I personally request that you do this! It sounds great, and I expect it to constitute some progress in the debate.
Relevant Feynman quote:
I'll try to explain the technique and why it's useful. I'll start with a non-probabilistic version of the idea, since it's a little simpler conceptually, then talk about the corresponding idea in the presence of uncertainty.
Suppose I'm building a mathematical model of some system or class of systems. As part of the modelling process, I write down some conditions which I expect the system to satisfy - think energy conservation, or Newton's Laws, or market efficiency, depending on what kind of systems we're talking about. My hope/plan is to derive (i.e. prove) some predictions from these... (read more)
(For object-level responses, see comments on parallel threads.)
I want to push back on an implicit framing in lines like:
This makes it sound like the rest of us don't try to break our proposals, push the work to Eliezer, agree with Eliezer when he finds a problem, and then not update that maybe future proposals will have problems.
Whereas in reality, I try to break my proposals, don't agree with Eliezer's diagnoses of the problems, and usually don't ask Eliezer because I don't expect his answer to be useful to me (and previously didn't expect him to respond). I expect this is true of others (like Paul and Richard) as well.
I agree with this. Or, if you feel ~evenly split between two options, have two mainlines and focus a bunch on those (including picking at cruxes and revising your mainline view over time).
I do note that there are some situations where rushing to tell a 'mainline story' might be the wrong move:
... (read more)
- Maybe your beliefs feel wildly unstable day-to-day -- because you're learning a lot quickly, or because it's just hard to know how to assign weight to the dozens of different considerations that bear on these questions. Then trying to take a quick snapshot of your current view might feel beside the point.
- It might even feel actively counterproductive, like rushing too quickly to impose meaning/structure on data when step one is to make sure you have the data properly loaded up in your head.
- Maybe there are many scen
These conversations are great and I really admire the transparency. It's really nice to see discussions that normally happen in private happen instead in public where everyone can reflect, give feedback, and improve their own thoughts. On the other hand, the combined conversations combined to a decent-sized novel - LW says 198,846 words! Is anyone considering investing heavily in summarizing the content for people to get involved without having to read all that content?
Echoing that I loved these conversations and I'm super grateful to everyone who participated — especially Richard, Paul, Eliezer, Nate, Ajeya, Carl, Rohin, and Jaan, who contributed a lot.
I don't plan to try to summarize the discussions or distill key take-aways myself (other than the extremely cursory job I did on https://intelligence.org/late-2021-miri-conversations/), but I'm very keen on seeing others attempt that, especially as part of a process to figure out their own models and do some evaluative work.
I think I'd rather see partial summaries/responses that go deep, instead of a more exhaustive but shallow summary; and I'd rather see summaries that center the author's own view (what's your personal take-away? what are your objections? which things were small versus large updates? etc.) over something that tries to be maximally objective and impersonal. But all the options seem good to me.
One thing in the posts I found surprising was Eliezers assertion that you needed a dangerous superintelligence to get nanotech. If the AI is expected to do everything itself, including inventing the concept of nanotech, I agree that this is dangerously superintelligent.
However, suppose Alpha Quantum can reliably approximate the behaviour of almost any particle configuration. Not literally any, it can't run a quantum computer factorizing large numbers better than factoring algorithms, but enough to design a nanomachine. (It has been trained to approximate the ground truth of quantum mechanics equations, and it does this very well.)
For example, you could use IDA, start training to imitate a simulation of a handful of particles, then compose several smaller nets into one large one.
Add a nice user interface and we can drag and drop atoms.
You can add optimization, gradient descent trying to maximize the efficiency of a motor, or minimize the size of a logic gate. All of this is optimised to fit a simple equation, so assuming you don't have smart general mesaoptimizers forming, and deducing how to manipulate humans based on very little info about humans, you shoul... (read more)
I wrote Consequentialism & Corrigibility shortly after and partly in response to the first (Ngo-Yudkowsky) discussion. If anyone has an argument or belief that the general architecture / approach I have in mind (see the “My corrigibility proposal sketch” section) is fundamentally doomed as a path to corrigibility and capability—as opposed to merely “reliant on solving lots of hard-but-not-necessarily-impossible open problems”—I'd be interested to hear it. Thanks in advance. :)
Eliezer and Nate, my guess is that most of your perspective on the alignment problem for the past several years has come from the thinking and explorations you've personally done, rather than reading work done by others.
But, if you have read interesting work by others that's changed your mind or given you helpful insights, what has it been? Some old CS textbook? Random Gwern articles? An economics textbook? Playing around yourself with ML systems?
A question for Eliezer: If you were superintelligent, would you destroy the world? If not, why not?
If your answer is "yes" and the same would be true for me and everyone else for some reason I don't understand, then we're probably doomed. If it is "no" (or even just "maybe"), then there must be something about the way we humans think that would prevent world destruction even if one of us were ultra-powerful. If we can understand that and transfer it to an AGI, we should be able to prevent destruction, right?
I would "destroy the world" from the perspective of natural selection in the sense that I would transform it in many ways, none of which were making lots of copies of my DNA, or the information in it, or even having tons of kids half resembling my old biological self.
From the perspective of my highly similar fellow humans with whom I evolved in context, they'd get nice stuff, because "my fellow humans get nice stuff" happens to be the weird unpredictable desire that I ended up with at the equilibrium of reflection on the weird unpredictable godshatter that ended up inside me, as the result of my being strictly outer-optimized over millions of generations for inclusive genetic fitness, which I now don't care about at all.
Paperclip-numbers do well out of paperclip-number maximization. The hapless outer creators of the thing that weirdly ends up a paperclip maximizer, not so much.
This may not be what evolution had "in mind" when it created us. But couldn't we copy something like this into a machine so that it "thinks" of us (and our descendants) as its "fellow humans" who should "get nice stuff"? I understand that we don't know how to do that yet. But the fact that Eliezer has some kind of "don't destroy the world from a fellow human perspective" goal function inside his brain seems to mean a) that such a function exists and b) that it can be coded into a neuronal network, right?
I was also thinking about the specific way we humans weigh competing goals and values against each other. So while for instance we do destroy much of the biosphere by blindly pursuing our misaligned goals, some of us still care about nature and animal welfare and rain forests, and we may even be able to prevent total destruction of them.
I see how my above question seems naive. Maybe it is. But if one potential answer to the alignment problem lies in the way our brains work, maybe we should try to understand that better, instead of (or in addition to) letting a machine figure it out for us through some kind of "value learning". (Copied from my answer to AprilSR:) I stumbled across two papers from a few years ago by a psychologist, Mark Muraven, who thinks that the way humans deal with conflicting goals could be important for AI alignment (https://arxiv.org/abs/1701.01487 and https://arxiv.org/abs/1703.06354).They appear a bit shallow to me and don't contain any specific ideas on how to implement this. But maybe Muraven has a point here.
To what extent do you think pivotal-acts-in-particular are strategically important (i.e. "successfully do a pivotal act, and if necessary build an AGI to do it" is the primary driving goal), vs "pivotal acts are useful shorthand to refer to the kind of intelligence level where it matters than an AGI be 'really safe'".
I'm interested in particular in responses from Eliezer, Rohin, and perhaps Richard Ngo. (I've had private chats with Rohin that I thought were useful to share and this comment is sort of creating a framing device for sharing them, but I've bee... (read more)
The goal is to bring x-risk down to near-zero, aka "End the Acute Risk Period". My usual story for how we do this is roughly "we create a methodology for building AI systems that allows you to align them at low cost relative to the cost of gaining capabilities; everyone uses this method, we have some governance / regulations to catch any stragglers who aren't using it but still can make dangerous systems".
If I talk to Eliezer, I expect him to say "yes, in this story you have executed a pivotal act, via magical low-cost alignment that we definitely do not get before we all die". In other words, the crux is in whether you can get an alignment solution with the properties I mentioned (and maybe also in whether people will be sensible enough to use the method + do the right governance). So with Eliezer I end up talking about those cruxes, rather than talking about "pivotal acts" per se, but I'm always imagining the "get an alignment solution, have everyone use it" plan.
When I talk to people who are attempting to model Eliezer, or defer to Eliezer, or speaking out of their own model that's heavily Eliezer-based, and I present this plan to them, and then they start thinking about pivotal... (read more)
My Eliezer-model thinks pivotal acts are genuinely, for-real, actually important. Like, he's not being metaphorical or making a pedagogical point when he says (paraphrasing) 'we need to use the first AGI systems to execute a huge, disruptive, game-board-flipping action, or we're all dead'.
When my Eliezer-model says that the most plausible pivotal acts he's aware of involve capabilities roughly at the level of 'develop nanotech' or 'put two cellular-identical strawberries on a plate', he's being completely literal. If some significantly weaker capability level realistically suffices for a pivotal act, then my Eliezer-model wants us to switch to focusing on that (far safer) capability level instead.
If we can save the world before we get anywhere near AGI, then we don't necessarily have to sort out how consequentialist, dangerous, hardware-overhang-y, etc. the first AGI systems will be. We can just push the 'End The Acute Existential Risk Period' button, and punt most other questions to the non-time-pressured Reflection that follows.
Curated. I found the entire sequence of conversations quite valuable, and it seemed good both to let people know it had wrapped up, and curate it while the AMA was still going on.
Question from evelynciara on the EA Forum:
For sure. It's tricky to wipe out humanity entirely without optimizing for that in particular -- nuclear war, climate change, and extremely bad natural pandemics look to me like they're at most global catastrophes, rather than existential threats. It might in fact be easier to wipe out humanity by enginering a pandemic that's specifically optimized for this task (than it is to develop AGI), but we don't see vast resources flowing into humanity-killing-virus projects, the way that we see vast resources flowing into AGI projects. By my accounting, most other x-risks look like wild tail risks (what if there's a large, competent, state-funded successfully-secretive death-cult???), whereas the AI x-risk is what happens by default, on the mainline (humanity is storming ahead towards AGI as fast as they can, pouring billions of dollars into it per year, and by default what happens when they succeed is that they accidentally unleash an optimizer that optimizes for our extinction, as a convergent instrumental subgoal of whatever rando thing it's optimizing).
It would be pretty easy and cheap for something much smarter than a human to kill all humans. The classic scenario is:... (read more)
There's something I had interpreted the original CEV paper to be implying, but wasn't sure if it was still part of the strategic landscape, which was "have the alignment project being working towards a goal that was highly visibly fair, to disincentive races." Was that an intentional part of the goal, or was it just that CEV seemed something like "the right thing to do" (independent of it's impact on races?)
How does Eliezer think about it now?
Yes, it was an intentional part of the goal.
If there were any possibility of surviving the first AGI built, then it would be nice to have AGI projects promising to do something that wouldn't look like trying to seize control of the Future for themselves, when, much later (subjectively?), they became able to do something like CEV. I don't see much evidence that they're able to think on the level of abstraction that CEV was stated on, though, nor that they're able to understand the 'seizing control of the Future' failure mode that CEV is meant to prevent, and they would not understand why CEV was a solution to the problem while 'Apple pie and democracy for everyone forever!' was not a solution to that problem. If at most one AGI project can understand the problem to which CEV is a solution, then it's not a solution to races between AGI projects. I suppose it could still be a solution to letting one AGI project scale even when incorporating highly intelligent people with some object-level moral disagreements.
Questions about the standard-university-textbook from the future that tells us how to build an AGI. I'll take answers on any of these!
I'm going to try and write a table of contents for the textbook, just because it seems like a fun exercise.
Epistemic status: unbridled speculation
Volume I: Foundation
Part I: Statistical Learning Theory
Part II: Computational Learning Theory
Part III: Universal Priors
I don't think there is an "AGI textbook" any more than there is an "industrialization textbook." There are lots of books about general principles and useful kinds of machines. That said, if I had to make wild guesses about roughly what that future understanding would look like:
It seems to me that a major crux about AI strategy routes through "is civilization generally adequate or not?". It seems like people have pretty different intuitions and ontologies here. Here's an attempt at some questions of varying levels of concreteness, to tease out some worldview implications.
(I normally use the phrase "civilizational adequacy", but I think that's kinda a technical term that means a specific thing and I think maybe I'm pointing at a broader concept.)
"Does civilization generally behave sensibly?" This is a vague question, some possible subquestions:
... (read more)
- Do you think major AI orgs will realize that AI is potentially worldendingly dangerous, and have any kind of process at all to handle that? [edit: followup: how sufficient are those processes?]
- Do you think government intervention on AI regulations or policies will be net-positive or net-negative, for purposes of preventing x-risk?
- How quickly do you think the AI ecosystem will update on new "promising" advances (either in the realm of capabilities or the realm of safety)
- How many intelligent, sensible people do there seem to be in the world who are thinking about AGI? (order of magnitude. like is there 1, 10, 100
I don't think this is the main crux -- disagreements about mechanisms of intelligence seem far more important -- but to answer the questions:
Clearly yes? They have safety teams that are focused on x-risk? I suspect I have misunderstood your question.
(Maybe you mean the bigger tech companies like FAANG, in which case I'm still at > 95% on yes, but I suspect I am still misunderstanding your question.)
(I know less about Chinese orgs but I still think "probably yes" if they become major AGI orgs.)
Net positive, though mostly because it seems kinda hard to be net negative relative to "no regulation at all", not because I think the regulations will be well thought out. The main tradeoff that companies face seems to be speed / capabilities vs safety; it seems unlikely that even "random" regulations increase the speed and capabilities that companies can achieve. (Though it's certainly possible, e.g. a regulation fo... (read more)
It was all very interesting, but what was the goal of these discussions? I mean I had an impression that pretty much everyone assigned >5% probability to "if we scale we all die" so it's already enough reason to work on global coordination on safety. Is the reasoning that the same mental process that assigned too low probability would not be able to come up with actual solution? Or something like "at the time they think their solution reduced probability of failure from 5% to 0.1% it would still be much higher"? Seems to be only possible if people don't understand arguments about inner optimisators or what not, as opposed to disagreeing with them.
What specific actions do you have in mind when you say "global coordination on safety", and how much of the problem do you think these actions solve?
My own view is that 'caring about AI x-risk at all' is a pretty small (albeit indispensable) step. There are lots of decisions that hinge on things other than 'is AGI risky at all'.
I agree with Rohin that the useful thing is trying to understand each other's overall models of the world and try to converge on them, not p(doom) per se. I gave some examples here of some important implications of having more Paul-ish models versus more Eliezer-ish models.
More broadly, examples of important questions people in the field seem to disagree a lot about:
... (read more)
- How hard is alignment? What are the central obstacles? What kind of difficulty is it? (Is it hard like 'building a secure OS that works on the first try'? Hard like 'the engineering/logistics/implementation portion of the Manhattan Project'? Both? Some other option? Etc.)
- What alignment research directions are p
Eliezer, when you told Richard that your probability of a successful miracle is very low, you added the following note:
I don't mean to ask for positive fairy tales when I ask: could you list some things you could see in the world that would cause you to feel that we were well-prepared to take advantage of one if we got one?
My obvious quick guess would be "I know of an ML project that made a breakthrough as impressive as GPT-3 and this is secret to the outer world, and the organization is keenly interested in alignment". But I am also interested in broader and less obvious ones. For example if the folks around here had successfully made a covid vaccine I think that would likely require us to be in a much more competent and responsive situation. Alternatively if folks made other historic scientific breakthroughs guided by some model of how it helps prevent AI doom, I'd feel more like this power could be turned to relevant directions.
Anyway, these are some of the things I quickly generate, but I'm interested in what comes to your mind?
I'm late to the party by a month, but I'm interested in your take (especially Rohin's) on the following:
Conditional on an existential catastrophe happening due to AI systems, what is your credence that the catastrophe will occur only after the involved systems are deployed?
This question is not directed at anyone in particular, but I'd want to hear some alignment researchers answer it. As a rough guess, how much would it affect your research—in the sense of changing your priorities, or altering your strategy of impact, and method of attack on the problem—if you made any of the following epistemic updates?
(Feel free to disambiguate anything here that's ambiguous or poorly worded.)
... (read more)
- You update to think that AI takeoff will happen twice as slowly as your current best-estimate. e.g. instead of the peak-rate of yearly GWP growth bei
Will MIRI want to hire programmers once the pandemic is over? What kind of programmers? What other kinds of people do you seek to hire?