Philosopher David Chalmers asked:
[I]s there a canonical source for "the argument for AGI ruin" somewhere, preferably laid out as an explicit argument with premises and a conclusion?
Unsurprisingly, the actual reason people expect AGI ruin isn't a crisp deductive argument; it's a probabilistic update based on many lines of evidence. The specific observations and heuristics that carried the most weight for someone will vary for each individual, and can be hard to accurately draw out.
That said, Eliezer Yudkowsky's So Far: Unfriendly AI Edition might be a good place to start if we want a pseudo-deductive argument just for the sake of organizing discussion. People can then say which premises they want to drill down on.
In The Basic Reasons I Expect AGI Ruin, I wrote:
When I say "general intelligence", I'm usually thinking about "whatever it is that lets human brains do astrophysics, category theory, etc. even though our brains evolved under literally zero selection pressure to solve astrophysics or category theory problems".It's possible that we should already be thinking of GPT-4 as "AGI" on some definitions, so to be clear about the threshold of generality I have in mind, I'll specifically talk about "STEM-level AGI", though I expect such systems to be good at non-STEM tasks too.
When I say "general intelligence", I'm usually thinking about "whatever it is that lets human brains do astrophysics, category theory, etc. even though our brains evolved under literally zero selection pressure to solve astrophysics or category theory problems".
It's possible that we should already be thinking of GPT-4 as "AGI" on some definitions, so to be clear about the threshold of generality I have in mind, I'll specifically talk about "STEM-level AGI", though I expect such systems to be good at non-STEM tasks too.
STEM-level AGI is AGI that has "the basic mental machinery required to do par-human reasoning about all the hard sciences", though a specific STEM-level AGI could (e.g.) lack physics ability for the same reasons many smart humans can't solve physics problems, such as "lack of familiarity with the field".
A simple way of stating the argument in terms of STEM-level AGI is:
I'll say that the "invention of STEM-level AGI" is the first moment when an AI developer (correctly) recognizes that it can build a working STEM-level AGI system within a year. I usually operationalize "early STEM-level AGI" as "STEM-level AGI that is built within five years of the invention of STEM-level AGI".
I think humanity is very likely to destroy itself within five years of the invention of STEM-level AGI. And plausibly far sooner — e.g., within three months or a year of the technology's invention. A lot of the technical and political difficulty of the situation stems from this high level of time pressure: if we had decades to work with STEM-level AGI before catastrophe, rather than months or years, we would have far more time to act, learn, try and fail at various approaches, build political will, craft and implement policy, etc.
This argument focuses on "human survival", but from my perspective the more important claim is that STEM-level AGI systems very likely won't value awesome cosmopolitan outcomes at all. It's not just that we'll die; it's that there probably won't be anything else of significant value that the AGI creates in our place.
Elaborating on the five premises:
1. Substantial Difficulty of Averting Instrumental Pressures
In Superintelligence, Nick Bostrom defines an "Instrumental Convergence Thesis":
[A]s long as they possess a sufficient level of intelligence, agents having any of a wide range of final goals will pursue similar intermediary goals because they have instrumental reasons to do so.[...]Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by many intelligent agents.
[A]s long as they possess a sufficient level of intelligence, agents having any of a wide range of final goals will pursue similar intermediary goals because they have instrumental reasons to do so.
Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by many intelligent agents.
Bostrom distinguishes between "instrumental goals" and "final goals" ("terminal goals" in Yudkowsky's writing). I call the former "instrumental strategies" instead, to make it clearer that instrumental "goals" are just strategies for achieving ends.
For the argument to carry, it isn't sufficient to argue that STEM-level AGI systems exhibit instrumental convergence at all; they need to exhibit catastrophic instrumental convergence, i.e., a wide variety of ends need to imply strategies that kill all humans (given the opportunity).
One way of arguing for 1 is via these three subclaims:
1b. Goal-Oriented Systems Exhibit Catastrophic Instrumental Convergence. E.g., considering the instrumental strategies Superintelligence focuses on. For most states of the world you could ultimately be pushing toward (i.e., most "goals"), once you understand your situation well enough, you'll tend to want there to exist optimizers that share your goal ("self-preservation", "goal-content integrity") and you'll tend to want more power ("cognitive enhancement", "technological perfection"), and resources ("resource acquisition").
Humans are potential threats, and we consume (and are made out of) resources that can be put to other ends, so most goals that don't specifically value human welfare as an end will endorse the conditional strategy "if you see a sufficiently cheap and reliable way to kill all humans, take that opportunity".
1a and 1b suggest that if STEM-level AGI technology proliferates widely, we're dead (conditional on 2+3+4). If it makes sense to try to build STEM-level AGI at all in that situation, then the obvious thing to do with your STEM-level AGI is to try to leverage its capabilities to prevent other AGIs from destroying the world (a "pivotal act"). But:
1c. Averting Instrumental Pressures in Pivotal-Act-Enabling AGI is Substantially Difficult. It looks very difficult to safely perform a pivotal act with an AGI system that doesn't value human survival and flourishing as an end, because there's no obvious way to avoid dangerous instrumental strategies in systems that capable.
Substantial alignment breakthroughs are very likely required here (and in value loading, interpretability, etc.). We likely won't get such breakthroughs in time, though we should certainly put a huge effort into trying.
1a and 1b are in effect saying that the least informed and safety-conscious people in the world are likely to build AI systems with dangerous conditional incentives. If you don't try at all to instill the right goals into your STEM-level AGI systems, and don't otherwise try to avert these default instrumental pressures, then your systems will be catastrophically dangerous (if they become capable enough).
1c makes the much stronger claim that the most safety-conscious people will fail to avert these instrumental pressures, as a strong default. (Assuming they build AGI that's powerful enough to possibly be useful for a pivotal act or any similarly ambitious feat.)
Chalmers asked for "canonical (or at least MIRI-canonical) cases for the premises (esp 1, 2, and 5)", so I'll collect some sources for supporting arguments here, though I don't think there's a single "canonical" source. Many of the arguments support multiple premises or sub-premises, so there's some arbitrariness in where I mention these below.
I'm not aware of a good resource that fully captures the MIRI-ish perspective on 1a ("STEM-Level AGIs Exhibit Goal-Oriented Behavior by Default"), but from my perspective some of the key supporting arguments are:
Some sources discussing arguments for 1b ("Goal-Oriented Systems Exhibit Catastrophic Instrumental Convergence"):
AGI Ruin emphasizes that there's no impossibility in producing AGI minds with basically whatever properties you want; it just looks too difficult for humanity to do, under time pressure, given anything remotely like our current technical understanding, before AGI causes an existential catastrophe.
To a large extent the reason we think this is just the reason Nate Soares gives in Ensuring Smarter-Than-Human Intelligence Has a Positive Outcome: "Why do I think that AI alignment looks fairly difficult? The main reason is just that this has been my experience from actually working on these problems." But we can say more than that about the shape of some of the difficulties. (Keeping in mind that we think many of the difficulties will turn out to be things that aren't on our radar today.)
Sources arguing for 1c ("Averting Instrumental Pressures in Pivotal-Act-Enabling AGI is Substantially Difficult"):
2. Substantial Difficulty of Value Loading
When I say that "value loading is difficult", I tend to distinguish four different claims:
If full value loading is going to be out of reach initially, then we can instead try to load enough goals into the first powerful AGI systems to at least cause them to not want to cause catastrophes (e.g., human extinction) while they're performing various powerful tasks for us. But:
2a gives us a reason to care about 2b: if AGI won't have our values by default, then the obvious response is to try to instill these values into the system. And 2b give us a reason to care about 2c: if we can't have everything right off the bat, we can shoot for "enough to prevent disasters".
2b, in combination with 1+3+4+5 (and 2c), again gives us a reason to care about pivotal acts and thereby motivates 1d. If it's difficult to cause AGI systems to share our values, then (given 1, 3, etc.) we face an enormous danger from the first STEM-level AGI systems. This would hold even if 1c were false, since AGI tech will proliferate by default and, given wide access to AGI, sooner or later someone will run a powerful AGI without the safeties.
If we can use AGI to perform some pivotal act (or find some other way to pause AGI development and proliferation for as long as the research community needs), then we can take as much time as needed to nail down full value loading.
So the urgent priority is to find some way to be able to hit the breaks, either before humanity reaches STEM-level AGI, or before STEM-level AGI technology proliferates.
Some sources discussing arguments for 2a ("Values Aren't Shared By Default"):
Sources arguing for 2b ("Full Value Loading is Extremely Difficult"):
Sources arguing for 2c ("Sufficient-for-Safety Goal Loading is Substantially Difficult"):
Other arguments for 2d ("Pivotal Act Loading is Substantially Difficult"):
The argument for 2d heavily overlaps with the arguments for 2b and 2c. It matters for 2d what the range of plausible pivotal acts look like, and we haven't published a detailed write-up on pivotal acts, though we discuss them a decent amount in the (lengthy) Late 2021 MIRI Conversations.
3. High Early Capabilities
I'll distinguish three subclaims:
"Early developers" again means "within five years of the invention of STEM-level AGI". In fact this needs to happen faster than that in order to support 3b and 3c:
3b. If Some Early Developers Can Do So, Many Early Developers Will Be Able To Do So. (Assuming the very first developers don't kill us first; and absent defeaters like an AGI-enabled pivotal act or a sufficiently heavy-duty globally enforced ban.)
As a strong default, AGI tech will spread widely quite quickly. So even if the first developers are cautious enough to avoid disaster, we'll face the issue that not everybody is cautious enough. And we'll likely face this issue within only a few months or years of STEM-level AGI's invention, which make government responses and AGI-mediated pivotal acts far more difficult.
Another important claim I'd endorse is "early STEM-level AGIs will be capable enough to perform pivotal acts", but this is cause for hope rather than a distinct reason to worry (if you already accept 3a), so it isn't a supporting premise for this particular argument.
MIRI has never written a canonical "here are all the reasons we expect STEM-level AGI to be very powerful" argument. Some relevant sources for 3a ("Some Early Developers Will Be Able to Make Dangerously Capable STEM-Level AGIs") are:
2. The bottleneck on decisive strategic advantages is very likely cognition (of a deep and high-quality variety).The challenge of building the aforementioned nanomachines is very likely bottlenecked on cognition alone. (Ribosomes exist, and look sufficiently general to open the whole domain to any mind with sufficient mastery of protein folding, and are abundant.)In the modern world, significant amounts of infrastructure can be deployed with just an internet connection -- currency can be attained anonymously, humans can be hired to carry out various physical tasks (such as RNA synthesis) without needing to meet in person first, etc.The laws of physics have shown themselves to be "full of exploitable hacks" (such as the harnessing of electricity to power lights in every home at night, or nuclear fission to release large amounts of energy from matter, or great feats of molecular-precision engineering for which trees and viruses provide a lower-bound).3. The abilities of a cognitive system likely scale non-continuously with the depth and quality of the cognitions.For instance, if you can understand protein folding well enough to get 90% through the reasoning of how your nanomachines will operate in the real world, that doesn't let you build nanomachines that have 90% of the impact of ones that are successfully built to carry out a particular purpose.I expect I could do a lot with 100,000 trained-software-engineer-hours, that I cannot do with 1,000,000 six-year-old hours.
2. The bottleneck on decisive strategic advantages is very likely cognition (of a deep and high-quality variety).
3. The abilities of a cognitive system likely scale non-continuously with the depth and quality of the cognitions.
Some defeaters for 3a could include "STEM-level AGI is impossible (e.g., because there's something magical and special about human minds that lets us do science", "there's no way to leverage (absolute or relative) intelligence to take over the world", and "early STEM-level AGIs won't be (absolutely or relatively) smart enough to access any of those ways".
I'd tentatively guess that "there will be lots of different STEM-level AGIs before any AGI can destroy the world" is false, but if it's true, I think to a first approximation this doesn't lower the probability of AGI ruin. This is because:
I view 3b ("If Some Early Developers Can Do So, Many Early Developers Will Be Able To Do So") and 3c ("If Many Early Developers Can Do So, Some Will Do So") as following from the normal way AI tech has proliferated over time: it didn't take 10 years for other groups to match GPT-3 or ChatGPT once they were deployed, and there are plenty of incautious people who think alignment is silly, so it seems inevitable that someone will deploy powerful misaligned AGI if no major coordination effort or pivotal-act-via-AGI blocks this.
4. Conditional Ruin
Premises 1–3 each begin with "As a strong default...", so one way to object to this premise is just to concede these are three "strong defaults", but say they aren't jointly strong enough to carry an "X is very likely" conclusion.
Depending on the conversational goal, I could respond by switching to a probabilistic argument, or by stipulating that "strong default" here means "strong enough to make premise 4 true".
Beyond that, I think this claim is fairly obvious at a glance.
"[There will be no alignment breakthroughs or global coordination breakthroughs before we invent STEM-level AGI" is obviously a lot stronger than the conclusion requires: seeing breakthroughs in either domain doesn't mean that the breakthroughs were sufficient to avert catastrophe. But I weakly predict that there in fact won't be any breakthroughs in either domain, so this unnecessarily strong premise seems like a fine starting point.
When stronger claims are justifiable but weaker claims are sufficient, bad outcomes look more overdetermined, which strengthens the case for thinking we're in a dire situation calling for an extraordinary response.
I don't think MIRI has written a centralized argument regarding 5. We're much more interested in intervening on it than in describing it, and if things are going well, it should look like a moving target.
We've written at least a little about why AGI timelines don't look super long to us, and we've written at greater length about why alignment seems to us to be moving too slowly — e.g., in On How Various Plans Miss the Hard Bits of the Alignment Challenge and AGI Ruin. Posts like Security Mindset and Ordinary Paranoia, Security Mindset and the Logistic Success Curve, and Brainstorm of Things That Could Force an AI Team to Burn Their Lead help paint a qualitative picture of how hard we think it would be to actually succeed in STEM-level AGI alignment, and therefore how overdetermined failure looks.
The AGI ruin argument mostly rests on claims that the alignment and deployment problems are difficult and/or weird and novel, not on strong claims about society. The bar for a sufficient response seems high, and the responses required are unusual and extreme, with a high need for proactive rather than reactive action in the world.
Our arguments for discontinuous and rapid AI capability gains are possibly the main reason we're more pessimistic than others about governments responding well. We also have unusually high baselines pessimism about government sanity by EA standards, but I don't think this is the main source of model disagreement.
Other options include Joe Carlsmith's Is Power-Seeking AI an Existential Risk? (which Nate Soares replied to here) and Katja Grace's Argument for AI X-Risk from Competent Malign Agents.
Note that I'm releasing this post without waiting on other MIRI staff to endorse it or make changes, so this can be treated as my own attempt to build a structured argument, rather than as something Eliezer, Nate, Benya, or others would necessarily endorse.
Like "AGI", "STEM-level AGI" lacks a formal definition. (If we did have a deep formal understanding of reasoning about the physical world, we would presumably be able to do many feats with AI that we cannot do today.)
Absent such a definition, however, we shouldn't ignore the observed phenomenon that there's a certain kind of problem-solving ability (observed in humans) that generalizes to inventing steam engines and landing on the Moon, even though our brains didn't evolve under direct selection pressure to start industrial revolutions or visit other planets, and even though birds and nematodes can't invent steam engines or land on the Moon.
We can then ask what happens when we find a way to automate this kind of problem-solving ability.
"The basic mental machinery" is vague, and maybe some would argue that GPT-4 already has all of the right "mental machinery" in some sense, in spite of its extremely limited ability to do novel STEM work in practice. (I disagree with this claim myself.)
E.g., some might analogize GPT-4 to a human child: a sufficiently young John von Neumann will lack some "basic mental machinery" required for STEM reasoning, but will at least have meta-machinery that will predictably unfold into the required machinery via normal brain development and learning.
(And, indeed, the difference between "having the basic mental machinery for STEM" and "having meta-machinery that will predictably unfold into the basic mental machinery" may not be a crisp one. Even the adult von Neumann presumably continued to upgrade his own general problem-solving software via adopting new and better heuristics.)
I don't think that GPT-4 in fact has all of the basic mental machinery or meta-machinery for STEM, and I don't personally think that comparing GPT-4 to a human child is very illuminating. I'm also not confident one way or the other about whether GPTs will scale to "as good at science as smart humans".
That said, since people can disagree about the nature of general intelligence and about what's actually going on in humans or AI systems when we do scientific work, it might be helpful to instead define "STEM-level" AI as AI technology that can (e.g.) match smart human performance in a specific hard science field, across all the scientific work humans do in that field.
As a strong default, I expect AI with that level of capability to be able to generalize to all the sciences, and to reasoning about any other topic humans can reason about; and that level of generality and capability seems to me to be the level where we face AI-mediated extinction risks.
The Arbital articles I link in this post, and most of the AI alignment content on Arbital, were written by Eliezer Yudkowsky in 2015–2017. I consider this one of the best online resources regarding AI alignment, though a lot of it is relatively unedited or incomplete.
If human whole-brain emulation is built before (or shortly after) STEM-level AGI, and this allows us to run human minds at faster speeds, then this opens up a lot more possibility for things to occur "early" (as measured in sidereal time).
It might even be possible to solve coherent extrapolated volition within five sidereal years of the invention of STEM-level AGI. (Though if so, I'm imagining this happening via ems and AI systems achieving feats that might have otherwise taken thousands of years of work, including enormous amounts of work gaining a mature understanding of the human mind, iteratively improving the ems' speed and reasoning abilities, and very carefully and conservatively ratcheting up the capabilities of AI systems — and widening the set of tasks we can safely use them for — as we gain more mastery of alignment.)
To be clear: I'd consider it an obviously terrible idea, bordering on suicidal, to gamble the future on a pivotal act that does no monitoring or intervening in the wider world for five entire years after the invention of STEM-level AGI. I'd say that one year is already taking on a lot of risk, and three years is clearly too long.
But at the point where safety-conscious AGI developers are being cheaply run at 1000x speed relative to all the non-safety-conscious AGI developers, monitoring the world for planet-endangering threats (and intervening if necessary) is probably reasonably trivial. The hard part is getting to whole-brain emulation (and powerful hardware for running the ems) in the first place.
This is not, of course, to say that "AGI can achieve decisive strategic advantage within five years" is necessary for the AGI situation to be dire.
Also, "human survival" is a phrase some transhumanists (myself included) will object to as ambiguous. I think involuntary human death is bad, but I think it's probably good if we voluntarily upload ourselves and develop into cool posthumans, regardless of whether that counts as biological "death" or "extinction" in some purely technical sense.
I use the phrase "human survival" in spite of all these issues because I (perhaps wrongly) imagine that Chalmers is looking for an argument that a wide variety of non-transhumanists will immediately see the importance of. Ordinary people can clearly see that it's bad for AI to kill them and their loved ones (and can see why this is bad), without any need to wade into deep philosophical debates or utopia-crafting.
Focusing on something more abstract risks misleading people about the severity of the risk ("surely if you had something that scary in mind, you'd blurt it out rather than burying the lede"), and also about its nature ("surely if you thought AI would literally just kill everyone, you'd say that"). If I instead mostly worried about AI disaster scenarios where AI doesn't literally kill everyone, I'd talk about those instead.
In principle one could make a simpler argument for pivotal acts by just saying "World-destroyingly-powerful AGI technology will proliferate by default, and if everyone has the ability to destroy the world then someone will inevitably do it on purpose".
But in reality the situation is far worse than that, because even if we could limit AGI access to people who would never deliberately use AGI to try to do evil, AGI systems' own default incentives make them extremely dangerous. Moreover, this issue blocks our ability to safely use AGI for pivotal acts as well.
It does matter that the system be able to generate hypotheses and instrumental strategies concerning the physical world; but the system's terminal goal doesn't need to concern to the physical world in order for the system to care about steering the physical world. E.g., a system that just wants its mind to be in a certain state will care about its hardware (since changes in hardware state will affect its mind), which means caring about everything in the larger world that could potentially affect its hardware.
Cf. Microscope AI in Hubinger's An Overview of 11 Proposals for Building Safe Advanced AI.
Microscope AI also involves "using transparency tools to verify that the model isn’t performing any optimization", but part of my argument here is that it's extremely unlikely we'll be able to get major new scientific/predictive insights from AI without it doing any "optimizing". However, we might in principle be able to verify that the AI isn't doing too much optimizing, or optimizing in the wrong directions, or optimizing over relatively risky domains, etc. In any case, we can consider the wider space of strategies that involve inspecting the AI's mind as an alternative to using conventional outputs of the system.
If operators have enough visibility into the AGI's mind, and enough deep understanding and useful tools for making sense of all important information in that mind, then in principle "do useful science by looking at the AGI's mind rather than by giving it an output channel" can prevent any catastrophes that result from the AGI deliberately optimizing against human interests.
(Though we would still need to find ways to get the AGI to do specific useful cognitions and not just harmful ones. And also, if you have that much insight into the AGI's mind and can get it to think useful and relevant thoughts at all, then you may be able to avoid Microscope-AI approaches, by trusting the AI's outputs so long as it hasn't had any dangerous thoughts anywhere causally upstream of the outputs.)
In real life, however, it's very unlikely that we'll have that level of mastery of the first STEM-level AGI systems. If we only have partial visibility and understanding of the AGI's mind, then Microscope AI can in principle just be used by the AGI as another output channel, particularly if it learns or deduces things about which parts of its mind we're inspecting, how we tend to interpret different states of its brain, etc. This is a more constrained problem from the AI's perspective, but it still seems to demand some very difficult alignment breakthroughs for humanity to perform a pivotal act by this method.
Note that "sphexish" isn't an all-or-nothing property, and if you zoom in on any agentic brain in enough detail, you should expect the parts to eventually start looking more sphexish. This is because "agency" isn't a primitive property, but rather arises from the interaction of many gears, and sufficiently small gears will do things more automatically, without checking first to take into account context, etc.
The important question is: "To what extent do these sphex-like gears assemble into something that's steering toward outcomes at the macro-level, versus assembling into something that's more sphex-like at the macro-level?"
Quoting Yudkowsky in Ngo and Yudkowsky on Alignment Difficulty: "[A]n earlier part of the path [to building AGI systems that exhibit dangerous means-ends reasoning, etc.] is from being optimized to do things difficult enough that you need to stop stepping on your own feet and have different parts of your thoughts work well together".
Quoting Yudkowsky in Ngo and Yudkowsky on Scientific Reasoning and Pivotal Acts: "[...] Despite the inevitable fact that some surprises of this kind now exist, and that more such surprises will exist in the future, it continues to seem to me that science-and-engineering on the level of 'invent nanotech' still seems pretty unlikely to be easy to do with shallow thought, by means that humanity discovers before AGI tech manages to learn deep thought?
"What actual cognitive steps? Outside-the-box thinking, throwing away generalizations that governed your previous answers and even your previous questions, inventing new ways to represent your questions, figuring out which questions you need to ask and developing plans to answer them; these are some answers that I hope will be sufficiently useless to AI developers that it is safe to give them, while still pointing in the direction of things that have an un-GPT-3-like quality of depth about them.
"Doing this across unfamiliar domains that couldn't be directly trained in by gradient descent because they were too expensive to simulate a billion examples of[.]
"If you have something this powerful, why is it not also noticing that the world contains humans? Why is it not noticing itself?"
Issues that are visible today probably won't spontaneously solve themselves without a serious technical effort, but new obstacles can certainly crop up. (See the discussion of software development hell and robust-software-in-particular hell in The Basic Reasons I Expect AGI Ruin, and the "rocket-accelerating cryptographic Neptune probe" analogy in So Far: Unfriendly AI Edition.)
Note that "share all of our core values" is imprecise: what makes a value "core" in the relevant sense? How do we enable moral progress, and avoid locking in our current flawed values? It's an extremely thorny problem. I endorse coherent extrapolated volition as a good (very high-level and abstract) description of desiderata for a solution. On LessWrong and Arbital, the phrase "humane values" is often used to specifically point at "the sort of values we ought to want to converge on eventually", as opposed to our current incomplete and flawed conceptions of what's morally valuable, aesthetically valuable, etc.
Note also that the challenge here is causing AGI systems to consistently optimize for humane values; it's not merely to cause AGI systems to understand our values. The latter is far easier, because it doesn't depend on the AGI's goals; a sufficiently capable paperclip maximizer would also want to understand human goals, if its environment contained humans.
"Safely" doesn't necessarily require that the AGI terminally values human survival. I'd put more probability on AGI systems being safe if they aren't internally representing humans at all, with safety coming from this fact in combination with other alignment measures.
This doesn't rule out that some responses to arguments are more common than others; and indeed, we should expect sufficiently capable minds to converge on similar responses to things like "valid logical arguments", since accepting such arguments is very useful for being "sufficiently capable".
The problem is that sufficiently capable reasoners don't converge on accepting human morality. "Accept valid logical arguments" is useful for nearly all ambitious real-world ends, so we should expect it to arise relatively often as an instrumental strategy and/or as a terminal goal. "Care for humans" is useful for a far smaller range of ends.
Some relevant passages, discussing evolved aliens and then artificial minds:
"[...] I think my point estimate there is 'most aliens are not happy to see us', but I’m highly uncertain. Among other things, this question turns on how often the mixture of 'sociality (such that personal success relies on more than just the kin-group), stupidity (such that calculating the exact fitness-advantage of each interaction is infeasible), and speed (such that natural selection lacks the time to gnaw the large circle of concern back down)' occurs in intelligent races’ evolutionary histories.
"These are the sorts of features of human evolutionary history that resulted in us caring (at least upon reflection) about a much more diverse range of minds than 'my family', 'my coalitional allies', or even 'minds I could potentially trade with' or 'minds that share roughly the same values and faculties as me'.
"Humans today don’t treat a family member the same as a stranger, or a sufficiently-early-development human the same as a cephalopod; but our circle of concern is certainly vastly wider than it could have been, and it has widened further as we’ve grown in power and knowledge.
"[... T]he development process of misaligned superintelligent AI is very unlike the typical process by which biological organisms evolve.
"Some relatively important differences between intelligences built by evolution-ish processes and ones built by stochastic-gradient-descent-ish processes:
"• Evolved aliens are more likely to have a genome/connectome split, and a bottleneck on the genome.
"• Aliens are more likely to have gone through societal bottlenecks.
"• Aliens are much more likely the result of optimizing directly for intergenerational prevalence. The shatterings of a target like 'intergenerational prevalence' are more likely to contain overlap with the good stuff, compared to the shatterings of training for whatever-training-makes-the-AGI-smart-ASAP. (Which is the sort of developer goal that’s likely to win the AGI development race and kill humanity first.)
"Evolution tends to build patterns that hang around and proliferate, whereas AGIs are likely to come from an optimization target that's more directly like 'be good at these games that we chose with the hope that being good at them requires intelligence', and the shatterings of the latter are less likely to overlap with our values."
A version of The Hidden Complexity of Wishes also appears in Complex Value Systems Are Required to Realize Valuable Futures.
Note that this is separate from the issue that it's hard to instill particular goals into powerful AGI system at all. This point is discussed more in AGI Ruin.
Summarizing the relevant items:
3: "We need to get alignment right on the 'first critical try' at operating at a 'dangerous' level of intelligence". This makes it more difficult to achieve any desired property in STEM-level AGI.
5 and 6: "We can't just build a very weak system, which is less dangerous because it is so weak, and declare victory; because later there will be more actors that have the capability to build a stronger system and one of them will do so." If the system is weak, then flaws in its goals like "be low-impact" or "don't hurt humans" matter less. But we need at least one system strong enough to help in some pivotal act (unless we find some way to globally limit AGI proliferation without the help of STEM-level AGI), which makes it far more dangerous if its goals are flawed.
10: "Powerful AGIs doing dangerous things that will kill you if misaligned, must have an alignment property that generalized far out-of-distribution from safer building/training operations that didn't kill you."
12: "Operating at a highly intelligent level is a drastic shift in distribution from operating at a less intelligent level, opening up new external options, and probably opening up even more new internal choices and modes. Problems that materialize at high intelligence and danger levels may fail to show up at safe lower levels of intelligence, or may recur after being suppressed by a first patch."
13 and 14: "Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability."
15: "Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously."
16: "Even if you train really hard on an exact loss function, that doesn't thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments."
17: "[O]n the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they're there, rather than just observable outer ones you can run a loss function over."
18: "[I]f you show an agent a reward signal that's currently being generated by humans, the signal is not in general a reliable perfect ground truth about how aligned an action was, because another way of producing a high reward signal is to deceive, corrupt, or replace the human operators with a different causal system which generates that reward signal".
19: "More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment".
20: "Human operators are fallible, breakable, and manipulable."
21 and 22: "When you have a wrong belief, reality hits back at your wrong predictions. [...] Reality doesn't 'hit back' against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases." Thus "Capabilities generalize further than alignment once capabilities start to generalize far."
Section B.3 (25–33): Sufficiently good and useful transparency / interpretability seems extremely difficult.
"Why? Because things in the capabilities well have instrumental incentives that cut against your alignment patches. Just like how your previous arithmetic errors (such as the pebble sorters on the wrong side of the Great War of 1957) get steamrolled by the development of arithmetic, so too will your attempts to make the AGI low-impact and shutdownable ultimately (by default, and in the absence of technical solutions to core alignment problems) get steamrolled by a system that pits those reflexes / intuitions / much-more-alien-behavioral-patterns against the convergent instrumental incentive to survive the day."
Quoting from footnote 3 of A central AI alignment problem: capabilities generalization, and the sharp left turn: "Note that this is consistent with findings like 'large language models perform just as well on moral dilemmas as they perform on non-moral ones'; to find this reassuring is to misunderstand the problem. Chimps have an easier time than squirrels following and learning from human cues. Yet this fact doesn't particularly mean that enhanced chimps are more likely than enhanced squirrels to remove their hunger drives, once they understand inclusive genetic fitness and are able to eat purely for reasons of fitness maximization. Pre-left-turn AIs will get better at various 'alignment' metrics, in ways that I expect to build a false sense of security, without addressing the lurking difficulties."
The kinds of capabilities we expect to be needed for a pivotal act are similar to those required for the strawberry problem ("Place, onto this particular plate here, two strawberries identical down to the cellular but not molecular level."). Yudkowsky's unfinished Zermelo-Fraenkel provability oracle draft makes the specific claim that powerful theorem-proving wouldn't help save the world.
Cf. AGI Ruin, point 4.
I think the easiest pivotal acts are somewhat harder than the easiest strategies a misasligned AGI could use to seize power; but (looking only at capability and not alignability) I expect AGI to achieve both capabilities at around the same time, coinciding with (or following shortly after) the invention of STEM-level AGI.
STEM-level AGI is AGI that has "the basic mental machinery required to do par-human reasoning about all the hard sciences"
This definition seems very ambiguous to me, and I've already seen it confuse some people. Since the concept of a "STEM-level AGI" is the central concept underpinning the entire argument, I think it makes sense to spend more time making this definition less ambiguous.
Some specific questions:
Does "par-human reasoning" mean at the level of an individual human or at the level of all of humanity combined?If it's the former, what human should we compare it against? 50th percentile? 99.999th percentile?
Does "par-human reasoning" mean at the level of an individual human or at the level of all of humanity combined?
If it's the former, what human should we compare it against? 50th percentile? 99.999th percentile?
I partly answered that here, and I'll edit some of this into the post:
By 'matching smart human performance... across all the scientific work humans do in that field' I don't mean to require that there literally be nothing humans can do that the AI can't match. I do expect this kind of AI to quickly (or immediately) blow humans out of the water, but the threshold I have in mind is more like:STEM-level AGI is AI that's at least as scientifically productive as a human scientist who makes a variety of novel, original contributions to a hard-science field that requires understanding the physical world well. E.g., it can go toe-to-toe with highly productive human scientists on applying its abstract theories to real-world phenomena, using scientific ideas to design new tech, designing physical experiments, operating equipment, and generating new ideas that turn out to be true and that importantly advance the frontiers of our knowledge.The way I'm thinking about the threshold, AI doesn't have to be Nobel-prize-level, but it has to be "fully doing science". I'd also be happy with a definition like 'AI that can reason about the physical world in general', but I think that emphasizing hard-science tasks makes it clearer why I'm not thinking of GPT-4 as 'reasoning about the physical world in general' in the relevant sense.
By 'matching smart human performance... across all the scientific work humans do in that field' I don't mean to require that there literally be nothing humans can do that the AI can't match. I do expect this kind of AI to quickly (or immediately) blow humans out of the water, but the threshold I have in mind is more like:
STEM-level AGI is AI that's at least as scientifically productive as a human scientist who makes a variety of novel, original contributions to a hard-science field that requires understanding the physical world well. E.g., it can go toe-to-toe with highly productive human scientists on applying its abstract theories to real-world phenomena, using scientific ideas to design new tech, designing physical experiments, operating equipment, and generating new ideas that turn out to be true and that importantly advance the frontiers of our knowledge.
The way I'm thinking about the threshold, AI doesn't have to be Nobel-prize-level, but it has to be "fully doing science". I'd also be happy with a definition like 'AI that can reason about the physical world in general', but I think that emphasizing hard-science tasks makes it clearer why I'm not thinking of GPT-4 as 'reasoning about the physical world in general' in the relevant sense.
I'm not sure what the right percentile to target here is -- maybe we should be looking at the top 5% of Americans with STEM PhDs? Where Americans with STEM PhDs maybe are at the top 1% of STEM ability for Americans?
What is the "basic mental machinery" required to do par-human reasoning? What if a system has the basic mental machinery but not the more advanced mental machinery?Do you want this to include the robotic capabilities to run experiments and use physical tools? If not, why not (that seems important to me, but maybe you disagree)?
What is the "basic mental machinery" required to do par-human reasoning? What if a system has the basic mental machinery but not the more advanced mental machinery?
Do you want this to include the robotic capabilities to run experiments and use physical tools? If not, why not (that seems important to me, but maybe you disagree)?
I want it to include the ability to run experiments and use physical tools.
I don't know what the "basic mental machinery" required is -- I think GPT-4 is missing some of the basic cognitive machinery top human scientists use to advance the frontiers of knowledge (as opposed to GPT-4 doing all the same mental operations as a top scientist but slower, or something), but this is based on a gestalt impression from looking at how different their outputs are in many domains, not based on a detailed or precise model of how general intelligence works.
One way of thinking about the relevant threshold is: if you gave a million chimpanzees billions of years to try to build a superintelligence, I think they'd fail, unless maybe you let them reproduce and applied selection pressure to them to change their minds. (But the latter isn't something the chimps themselves realize is a good idea.)
In contrast, top human scientists pass the threshold 'give us enough time, and we'll be able to build a superintelligence'.
If an AI system, given enough time and empirical data and infrastructure, would eventually build a superintelligence, then I'm mostly happy to treat that as "STEM-level AGI". This isn't a necessary condition, and it's presumably not strictly sufficient (since in principle it should be possible to build a very narrow and dumb meta-learning system that also bootstraps in this way eventually), but it maybe does a better job of gesturing at where I'm drawing a line between "GPT-4" and "systems in a truly dangerous capability range".
(Though my reason for thinking systems in that capability range are dangerous isn't centered on "they can deliberately bootstrap to superintelligence eventually". It's far broader points like "if they can do that, they can probably do an enormous variety of other STEM tasks" and "falling exactly in the human capability range, and staying there, seems unlikely".)
Does a human count as a STEM-level NGI (natural general intelligence)?
I tend to think of us that way, since top human scientists aren't a separate species from average humans, so it would be hard for them to be born with complicated "basic mental machinery" that isn't widespread among humans. (Though local mutations can subtract complex machinery from a subset of humans in one generation, even if it can't add complex machinery to a subset of humans in one generation.)
Regardless, given how I defined the term, at least some humans are STEM-level.
If so, doesn't that imply that we should already be able to perform pivotal acts? You said: "If it makes sense to try to build STEM-level AGI at all in that situation, then the obvious thing to do with your STEM-level AGI is to try to leverage its capabilities to prevent other AGIs from destroying the world (a "pivotal act")."
The weakest STEM-level AGIs couldn't do a pivotal act; the reason I think you can do a pivotal act within a few years of inventing STEM-level AGI is that I think you can quickly get to far more powerful systems than "the weakest possible STEM-level AGIs".
The kinds of pivotal act I'm thinking about often involve Drexler-style feats, so one way of answering "why can't humans already do pivotal acts?" might be to answer "why can't humans just build nanotechnology without AGI?". I'd say we can, and I think we should divert a lot of resources into trying to do so; but my guess is that we'll destroy ourselves with misaligned AGI before we have time to reach nanotechnology "the hard way", so I currently have at least somewhat more hope in leveraging powerful future AI to achieve nanotech.
(The OP doesn't really talk about this, because the focus is 'is p(doom) high?' rather than 'what are the most plausible paths to us saving ourselves?'.)
In an unpublished 2017 draft, a MIRI researcher and I put together some ass numbers regarding how hard (wet, par-biology) nanotech looked to us:
We believe that the bottlenecks on current progress toward par-biology nanotechnology are (a) figuring out how to put all of the puzzle pieces together correctly, (b) executing certain difficult computations required for determining how to build materials, and (c) engineering certain basic tools that will allow us to engineer better tools, where there are likely to be mutual dependencies between progress on these fronts. If the world’s top scientific and engineering talent were actively focusing on this application and were inspired to solve the key technical problems, we would expect it to be possible to push past these bottlenecks with no more than 10x the compute that Google spent on research projects in 2016.Assuming no advances in AI algorithms over the state of the art in 2017, we would assign a 50% probability to fifty copies of John von Neumann, divided into five teams and supplied with a large number of lab technicians and other support staff, being able to achieve nanotechnology within 25 calendar years at a level that would be sufficient for a decisive advantage if the technology were available to a group in 2017.(footnote: We stipulate “in 2017” because we would not necessarily expect par-biology nanotechnology to confer a decisive advantage in a world where nanotechnology had been gradually advanced to that level by human engineers over multiple decades; in that scenario, factors such as leaks, regulations, and competition from other developers would make it harder for one group to strongly pull ahead. We would expect it to be much easier for one group to strongly pull ahead if nanotechnology advances too quickly for leaks, regulations, and competition to be significant factors on the relevant timescale, as we believe is possible using AGI.)Translating this into a more realistic scenario: we would assign a 40% probability to an organization with a $10 billion budget and the involvement of someone who can attract top researchers and leadership (e.g., Elon Musk) being able to reach this level of technological capability within 25 years, absent AI advances. Our probability would lower to 15% if there were only 10 calendar years available to the hypothetical Musk project instead of 25, and would rise to 85% if there were 50 calendar years and $20 billion available instead of 25 calendar years and $10 billion, holding these conditions stable and assuming no other large global disruptions.As in §1.3, the predictions here are rough and intuitive, and were not generated by a formal model. It would be difficult for our probability to rise much higher than 85% given additional time or other resources. Our inside-view evaluation of the arguments assigns high probability to par-biology nanotechnology being achievable in fifty years under these idealized conditions, such that the remaining uncertainty in our informal aggregate models largely stems from model uncertainty and deference to experts who disagree with our view and consider par-biology nanotechnology much more difficult. We would be very surprised to learn that par-biology nanotechnology were much more difficult (say, requiring more than 500 VNG research years), and this would have a fairly large impact on our overall expectations about early AGI systems’ potential uses and impact.
We believe that the bottlenecks on current progress toward par-biology nanotechnology are (a) figuring out how to put all of the puzzle pieces together correctly, (b) executing certain difficult computations required for determining how to build materials, and (c) engineering certain basic tools that will allow us to engineer better tools, where there are likely to be mutual dependencies between progress on these fronts. If the world’s top scientific and engineering talent were actively focusing on this application and were inspired to solve the key technical problems, we would expect it to be possible to push past these bottlenecks with no more than 10x the compute that Google spent on research projects in 2016.
Assuming no advances in AI algorithms over the state of the art in 2017, we would assign a 50% probability to fifty copies of John von Neumann, divided into five teams and supplied with a large number of lab technicians and other support staff, being able to achieve nanotechnology within 25 calendar years at a level that would be sufficient for a decisive advantage if the technology were available to a group in 2017.
(footnote: We stipulate “in 2017” because we would not necessarily expect par-biology nanotechnology to confer a decisive advantage in a world where nanotechnology had been gradually advanced to that level by human engineers over multiple decades; in that scenario, factors such as leaks, regulations, and competition from other developers would make it harder for one group to strongly pull ahead. We would expect it to be much easier for one group to strongly pull ahead if nanotechnology advances too quickly for leaks, regulations, and competition to be significant factors on the relevant timescale, as we believe is possible using AGI.)
Translating this into a more realistic scenario: we would assign a 40% probability to an organization with a $10 billion budget and the involvement of someone who can attract top researchers and leadership (e.g., Elon Musk) being able to reach this level of technological capability within 25 years, absent AI advances. Our probability would lower to 15% if there were only 10 calendar years available to the hypothetical Musk project instead of 25, and would rise to 85% if there were 50 calendar years and $20 billion available instead of 25 calendar years and $10 billion, holding these conditions stable and assuming no other large global disruptions.
As in §1.3, the predictions here are rough and intuitive, and were not generated by a formal model. It would be difficult for our probability to rise much higher than 85% given additional time or other resources. Our inside-view evaluation of the arguments assigns high probability to par-biology nanotechnology being achievable in fifty years under these idealized conditions, such that the remaining uncertainty in our informal aggregate models largely stems from model uncertainty and deference to experts who disagree with our view and consider par-biology nanotechnology much more difficult. We would be very surprised to learn that par-biology nanotechnology were much more difficult (say, requiring more than 500 VNG research years), and this would have a fairly large impact on our overall expectations about early AGI systems’ potential uses and impact.
(500 VNG research years = 500 von-Neumann-group research year, defined as 'how much progress ten copies of John von Neumann would make if they worked together on the problem, hard, for 500 serial years'.)
This is also why I think humanity should probably put lots of resources into whole-brain emulation: I don't think you need qualitatively superhuman cognition in order to get to nanotech, I think we're just short on time given how slowly whole-brain emulation has advanced thus far.
With STEM-level AGI I think we'll have more than enough cognition to do basically whatever we can align; but given how tenuous humanity's grasp on alignment is today, it would be prudent to at least take a stab at a "straight to whole-brain emulation" Manhattan Project. I don't think humanity as it exists today has the tech capabilities to hit the pause button on ML progress indefinitely, but I think we could readily do that with "run a thousand copies of your top researchers at 1000x speed" tech.
(Note that having dramatically improved hardware to run a lot of ems very fast is crucial here. This is another reason the straight-to-WBE path doesn't look hopeful at a glance, and seems more like a desperation move to me; but maybe there's a way to do it.)