AGI safety from first principles: Introduction

This is the first part of a six-part report called AGI safety from first principles, in which I've attempted to put together the most complete and compelling case I can for why the development of AGI might pose an existential threat. The report stems from my dissatisfaction with existing arguments about the potential risks from AGI. Early work tends to be less relevant in the context of modern machine learning; more recent work is scattered and brief. I originally intended to just summarise other people's arguments, but as this report has grown, it's become more representative of my own views and less representative of anyone else's. So while it covers the standard ideas, I also think that it provides a new perspective on how to think about AGI - one which doesn't take any previous claims for granted, but attempts to work them out from first principles.

Having said that, the breadth of the topic I'm attempting to cover means that I've included many arguments which are only hastily sketched out, and undoubtedly a number of mistakes. I hope to continue polishing this report, and I welcome feedback and help in doing so. I'm also grateful to many people who have given feedback and encouragement so far. I plan to cross-post some of the most useful comments I've received to the Alignment Forum once I've had a chance to ask permission. I've posted the report itself in six sections; the first and last are shorter framing sections, while the middle four correspond to the four premises of the argument laid out below.

AGI safety from first principles

The key concern motivating technical AGI safety research is that we might build autonomous artificially intelligent agents which are much more intelligent than humans, and which pursue goals that conflict with our own. Human intelligence allows us to coordinate complex societies and deploy advanced technology, and thereby control the world to a greater extent than any other species. But AIs will eventually become more capable than us at the types of tasks by which we maintain and exert that control. If they don’t want to obey us, then humanity might become only Earth's second most powerful "species", and lose the ability to create a valuable and worthwhile future.

I’ll call this the “second species” argument; I think it’s a plausible argument which we should take very seriously.[1] However, the version stated above relies on several vague concepts and intuitions. In this report I’ll give the most detailed presentation of the second species argument that I can, highlighting the aspects that I’m still confused about. In particular, I’ll defend a version of the second species argument which claims that, without a concerted effort to prevent it, there’s a significant chance that:

  1. We’ll build AIs which are much more intelligent than humans (i.e. superintelligent).
  2. Those AIs will be autonomous agents which pursue large-scale goals.
  3. Those goals will be misaligned with ours; that is, they will aim towards outcomes that aren’t desirable by our standards, and trade off against our goals.
  4. The development of such AIs would lead to them gaining control of humanity’s future.

While I use many examples from modern deep learning, this report is also intended to apply to AIs developed using very different models, training algorithms, optimisers, or training regimes than the ones we use today. However, many of my arguments would no longer be relevant if the field of AI moves away from focusing on machine learning. I also frequently compare AI development to the evolution of human intelligence; while the two aren’t fully analogous, humans are the best example we currently have to ground our thinking about generally intelligent AIs.


  1. Stuart Russell also refers to this as the “gorilla problem” in his recent book, Human Compatible. ↩︎

New Comment
15 comments, sorted by Click to highlight new comments since: Today at 4:39 AM

Early work tends to be less relevant in the context of modern machine learning

I'm curious why you think the orthogonality thesis, instrumental convergence, the treacherous turn or Goodhart's law arguments are less relevant in the context of modern machine learning. (We can use here Facebook's feed-creation-algorithm as an example of modern machine learning, for the sake of concreteness.)

I'm excited about this sequence!

Just a question: what audience do you have in mind? Is it a sequence for newcomers to AI Safety, or more a reframing of AI Safety arguments for researchers?

Planned summary of this sequence for the Alignment Newsletter:

This sequence presents the author’s personal view on the current best arguments for AI risk, explained from first principles (that is, without taking any previous claims for granted). The argument is a specific instantiation of the _second species argument_ that sufficiently intelligent AI systems could become the most intelligent species, in which case humans could lose the ability to create a valuable and worthwhile future.

We should clarify what we mean by superintelligence, and how it might arise. The author considers intelligence as quantifying simply whether a system “could” perform a wide range of tasks, separately from whether it is motivated to actually perform those tasks. In this case, we could imagine two rough types of intelligence. The first type, epitomized by most current AI systems, trains an AI system to perform many different tasks, so that it is then able to perform all of those tasks; however, it cannot perform tasks it has not been trained on. The second type, epitomized by human intelligence and <@GPT-3@>(@Language Models are Few-Shot Learners@), trains AI systems in a task-agnostic way, such that they develop general cognitive skills that allow them to solve new tasks quickly, perhaps with a small amount of training data. This second type seems particularly necessary for tasks where data is scarce, such as the task of being a CEO of a company. Note that these two types should be thought of as defining a spectrum, not a binary distinction, since the type of a particular system depends on how you define your space of “tasks”.

How might we get AI systems that are more intelligent than humans? Assuming we can get a model to human-level intelligence, there are then three key advantages of an AI system that would allow it to go further. First, they can be easily replicated, suggesting that we could get a _collective_ superintelligence via a collection of replicated AI systems working together and learning from each other. Second, there are no limits imposed by biology, and so we can e.g. make the models arbitrarily large, unlike with human brains. Finally, the process of creation of AI systems will be far better understood than that of human evolution, and AI systems will be easier to directly modify, allowing for AI systems to recursively improve their own training process (complementing human researchers) much more effectively than humans can improve themselves or their children.

The second species argument relies on the argument that superintelligent AI systems will gain power over humans, which is usually justified by arguing that the AI system will be goal-directed. Making this argument more formal is challenging: the EU maximizer framework <@doesn’t work for this purpose@>(@Coherent behaviour in the real world is an incoherent concept@) and applying the intentional stance only helps when you have some prior information about what goals the AI system might have, which begs the question.

The author decides to instead consider a more conceptual, less formal notion of agency, in which a system is more goal-directed the more its cognition has the following properties: (1) self-awareness, (2) planning, (3) judging actions or plans by their consequences, (4) being sensitive to consequences over large distances and long time horizons, (5) internal coherence, and (6) flexibility and adaptability. (Note that this can apply to a single unified model or a collective AI system.) It’s pretty hard to say whether current training regimes will lead to the development of these capabilities, but one argument for it is that many of these capabilities may end up being necessary prerequisites to training AI agents to do intellectual work.

Another potential framework is to identify a goal as some concept learned by the AI system, that then generalizes in such a way that the AI system pursues it over longer time horizons. In this case, we need to predict what concepts an AI system will learn and how likely it is that they generalize in this way. Unfortunately, we don’t yet know how to do this.

What does alignment look like? The author uses <@intent alignment@>(@Clarifying "AI Alignment"@), that is, the AI system should be “trying to do what the human wants it to do”, in order to rule out the cases where the AI system causes bad outcomes through incompetence where it didn’t know what it was supposed to do. Rather than focusing on the outer and inner alignment decomposition, the author prefers to take a holistic view in which the choice of reward function is just one (albeit quite important) tool in the overall project of choosing a training process that shapes the AI system towards safety (either by making it not agentic, or by shaping its motivations so that the agent is intent aligned).

Given that we’ll be trying to build aligned systems, why might we still get an existential catastrophe? First, a failure of alignment is still reasonably likely, since (1) good behavior is hard to identify, (2) human values are complex, (3) influence-seeking may be a useful subgoal during training, and thus incentivized, (4) it is hard to generate training data to disambiguate between different possible goals, (5) while interpretability could help it seems quite challenging. Then, given a failure of alignment, the AI systems could seize control via the mechanisms suggested in <@What failure looks like@> and Superintelligence. How likely this is depends on factors like (1) takeoff speed, (2) how easily we can understand what AI systems are doing, (3) how constrained AI systems are at deployment, and (4) how well humanity can coordinate.

Planned opinion:

I like this sequence: I think it’s a good “updated case” for AI risk that focuses on the situation in which intelligent AI systems arise through training of ML models. The points it makes are somewhat different from the ones I would make if I were writing such a case, but I think they are still sufficient to make the case that humanity has work to do if we are to ensure that AI systems we build are aligned.

Note: There is currently a lot of stuff I want to cover in the newsletter, so this will probably go out in the 10/21 newsletter.

Thanks! Good summary. A couple of quick points:

  • "that is, without relying on other people’s arguments" doesn't feel quite right to me, since obviously a bunch of these arguments have been made before. It's more like: without taking any previous claims for granted.
  • "there are then three key advantages of an AI system that would allow it to go further" Your list of 3 differs from my list of 3. Also, my list is not of key advantages, but of features which don't currently contribute to AI progress but will after we've got human-level AGI. I think that AIs will also have advantages over humans in data, compute and algorithms, which are the features that currently contribute to AI progress; and if I had to pick, I'd say data+compute+algorithms are more of an advantage than replication+cultural learning+recursive improvement. But I focus on the latter because they haven't been discussed as much.
  • "that is, without relying on other people’s arguments" doesn't feel quite right to me, since obviously a bunch of these arguments have been made before. It's more like: without taking any previous claims for granted.

Changed, though the way I use words those phrases mean the same thing.

Your list of 3 differs from my list of 3.

Yeah this was not meant to be a direct translation of your list. (Your list of 3 is encompassed by my first and third point.) You mentioned six things:

more compute, better algorithms, and better training data

and 

replication, cultural learning, and recursive improvement

which I wanted to condense. (The model size point was meant to capture the compute case.) I did have a lot of trouble understanding what the point of that section was, though, so it's plausible that I've condensed it poorly for whatever point you were making there.

Perhaps the best solution is to just delete that particular paragraph? As far as I can tell, it's not relevant to the rest of the arguments, and this summary is already fairly long and somewhat disjointed.

I did have a lot of trouble understanding what the point of that section was, though, so it's plausible that I've condensed it poorly for whatever point you were making there.

My thinking here is something like: humans became smart via cultural evolution, but standard AI safety arguments ignore this fact. When we think about AI progress from this perspective though, we get a different picture of the driving forces during the takeoff period. In particular, the three things I've listed are all ways that interactions between AGIs will be crucial to their capabilities, in addition to the three factors which are currently crucial for AI development.

Will edit to make this clearer.

I haven't had time to reread this sequence in depth, but I wanted to at least touch on how I'd evaluate it. It seems to be aiming to be both a good introductory sequence, while being a "complete and compelling case I can for why the development of AGI might pose an existential threat".

The question is who is this sequence for,  what is it's goal, and how does it compare to other writing targeting similar demographics. 

Some writing that comes to mind to compare/contrast it with includes:

  • Scott Alexander's Superintelligence FAQ. This is the post I've found most helpful for convincing people (including myself), that yes, AI is just actually a big deal and an extinction risk. It's 8000 words. It's written fairly entertainingly. What I find particularly compelling here are a bunch of factual statements about recent AI advances that I hadn't known about at the time.
  • Tim Urban's Road To Superintelligence series. This is even more optimized for entertainingness. I recall it being a bit more handwavy and making some claims that were either objectionable, or at least felt more objectionable. It's 22,000 words.
  • Alex Flint's AI Risk for Epistemic Minimalists. This goes in a pretty different direction – not entertaining, and not really comprehensive either . It came to mind because it's doing a sort-of-similar thing of "remove as many prerequisites or assumptions as possible". (I'm not actually sure it's that helpful, the specific assumptions it's avoiding making don't feel like issues I expect to come up for most people, and then it doesn't make a very strong claim about what to do)

(I recall Scott Alexander once trying to run a pseudo-study where he had people read a randomized intro post on AI alignment, I think including his own Superintelligence FAQ and Tim Urban's posts among others, and see how it changed people's minds. I vaguely recall it didn't find that big a difference between them. I'd be curious how this compared)

At a glance, AGI Safety From First Principles seems to be more complete than Alex Flint's piece, and more serious/a-bit-academic than Scott or Tim's writing. I assume it's aiming for a somewhat skeptical researcher, and is meant to not only convince them the problem exists, but give them some technical hooks of how to start thinking about it. I'm curious how well it actually succeeds at that.

Promoted to curated: I really enjoyed reading through this sequence. I have some disagreements with it, but overall it's one of the best plain language introductions to AI safety that I've seen, and I expect I will link to this as a good introduction many times in the future. I was also particularly happy with how the sequence bridged and synthesized a number of different perspectives that usually feel in conflict with each other.

Critch recently made the argument (and wrote it in his ARCHES paper, summarized by Rohin here) that "AI safety" is a straightforwardly misleading name because "safety" is a broader category than is being talked about in (for example) this sequence – it includes things like not making self-driving cars crash. (To quote directly: "the term “AI safety” should encompass research on any safety issue arising from the use of AI systems, whether the application or its impact is small or large in scope".) I wanted to raise the idea here and ask Richard what he thinks about renaming it to something like "AI existential safety from first principles" or "AI as an extinction risk from first principles" or "AI alignment from first principles".

Yeah, this seems like a reasonable point. But I'm not that much of a fan of the alternatives you suggest. What do you think about "AGI safety"?

Oli suggests that there are no fields with three-word-names, and so "AI Existential Risk" is not a choice. I think "AI Alignment" is the currently most accurate name for the field that encompasses work like Paul's and Vanessa's and Scott/Abram's and so on. I think "AI Alignment From First Principles" is probably a good name for the sequence.

It seems a definite improvement on the axis of specificity, I do prefer it over the status quo for that reason.

But it doesn't address the problem of scope-sensitivity. I don't think this sequence is about preventing medium-sized failures from AGI. It's about preventing extinction-level risks to our future.

"A First-Principles Explanation of the Extinction-Level Threat of AGI: Introduction"

"The AGI Extinction Threat from First Principles: Introduction"

"AGI Extinction From First Principles: Introduction"

Yeah, I agree that's a problem. Bur I don't think it's a big problem, because who's talking about medium-size risks from AGI?

In particular, the flag I want to plant is something like: "when you're talking about AGI, it's going to be So Big that existential safety is the default type of safety to be concerned with."

Also I think having the big EXTINCTION in the title costs weirdness points, because even within the field people don't use that word very much. So I'm leaning towards AGI safety.

A year later, as we consider this for the 2020 Review, I think figuring out a better name is worth another look.

Another option is "AI Catastrophe from First Principles"

because who's talking about medium-size risks from AGI?

Well, I have talked about them... :-)

The capability claim is often formulated as the possibility of an AI achieving a decisive strategic advantage (DSA). While the notion of a DSA has been implicit in many previous works, the concept was first explicitly defined by Bostrom (2014, p. 78) as “a level of technological and other advantages sufficient to enable [an AI] to achieve complete world domination.”

However, assuming that an AI will achieve a DSA seems like an unnecessarily strong form of the capability claim, as an AI could cause a catastrophe regardless. For instance, consider a scenario where an AI launches an attack calculated to destroy human civilization. If the AI was successful in destroying humanity or large parts of it, but the AI itself was also destroyed in the process, this would not count as a DSA as originally defined. Yet, it seems hard to deny that this outcome should nonetheless count as a catastrophe.

Because of this, this chapter focuses on situations where an AI achieves (at least) a major strategic advantage (MSA), which we will define as “a level of technological and other advantages sufficient to pose a catastrophic risk to human society.” A catastrophic risk is one that might inflict serious damage to human well-being on a global scale and cause 10 million or more fatalities (Bostrom & Ćirković 2008).