Quintin Pope & Nora Belrose have a new “AI Optimists” website, along with a new essay “AI is easy to control”, arguing that the risk of human extinction due to future AI (“AI x-risk”[1]) is a mere 1% (“a tail risk worth considering, but not the dominant source of risk in the world”). (I’m much more pessimistic.) It makes lots of interesting arguments, and I’m happy that the authors are engaging in substantive and productive discourse, unlike the ad hominem vibes-based drivel which is growing increasingly common on both sides of the AI x-risk issue in recent months.

This is not a comprehensive rebuttal or anything, but rather picking up on a few threads that seem important for where we disagree, or where I have something I want to say.

(Note: Nora has a reply here.)

Summary / table-of-contents:

Note: I think Sections 1 & 4 are the main reasons that I’m much more pessimistic about AI x-risk than Pope & Belrose, whereas Sections 2 & 3 are more nitpicky.

  • Section 1 argues that even if controllable AI has an “easy” technical solution, there are still good reasons to be concerned about AI takeover, because of things like competition and coordination issues, and in fact I would still be overall pessimistic about our prospects.
  • Section 2 talks about the terms “black box” versus “white box”.
  • Section 3 talks about what if anything we learn from “human alignment”, including some background on how I think about human innate drives.
  • Section 4 argues that pretty much the whole essay would need to be thrown out if future AI is trained in a substantially different way from current LLMs. If this strikes you as a bizarre unthinkable hypothetical, yes I am here to tell you that other types of AI do actually exist, and I specifically discuss the example of “brain-like AGI” (a version of actor-critic model-based RL), spelling out a bunch of areas where the essay makes claims that wouldn’t apply to that type of AI, and more generally how it would differ from LLMs in safety-relevant ways.

1. Even if controllable AI has an “easy” technical solution, I’d still be pessimistic about AI takeover

Most of Pope & Belrose’s essay is on the narrow question of whether the AI control problem has an easy technical solution. That’s great! I’m strongly in favor of arguing about narrow questions.  And after this section I’ll be talking about that narrow question as well. But the authors do also bring up the broader question of whether AI takeover is likely to happen, all things considered. These are not the same question; for example, there could be an easy technical solution, but people don’t use it.

So, for this section only, I will assume for the sake of argument that there is in fact an easy technical solution to the AI control and/or alignment problem. Unfortunately, in this world, I would still think future catastrophic takeover by out-of-control AI is not only plausible but likely.

Suppose someone makes an AI that really really wants something in the world to happen, in the same way a person might really really want to get out of debt, or Elon Musk really really wants for there to be a Mars colony—including via means-end reasoning, out-of-the-box solutions, inventing new tools to solve problems, and so on. Then, if that “want” is stronger than the AI’s desires and habits for obedience and norm-following (if any), and if the AI is sufficiently capable, then the natural result would be an AI that irreversibly escapes human control—see instrumental convergence.

But before we get to that, why might we suppose that someone might make an AI that really really wants something in the world to happen? Well, lots of reasons:

  • People have been trying to do that since the dawn of AI.
  • Humans often really really want something in the world to happen (e.g., for there to be more efficient solar cells, for my country to win the war, to make lots of money, to do a certain very impressive thing that will win fame and investors and NeurIPS papers, etc.), and one presumes that some of those humans will reason “Well, the best way to make X happen is to build an AI that really really wants X to happen”. You and I might declare that these people are being stupid, but boy, people do stupid things every day.
  • As AI advances, more and more people are likely to have an intuition that it’s unethical to exclusively make AIs that have no rights and are constitutionally subservient with no aspirations of their own. This is already starting to happen. I’ll put aside the question of whether or not that intuition is justified.
  • Some people think that irreversible AGI catastrophe cannot possibly happen regardless of the AGI’s motivations and capabilities, because of [insert stupid reason that doesn’t stand up to scrutiny], or will be prevented by [poorly-thought-through “guardrail” that won’t actually work]. One hopes that the number of such people will go down with time, but I don’t expect it to go to zero.
  • Some people want to make AGI as capable and independent as possible, even if it means that humanity will go extinct, because “AI is the next step of evolution” or whatever. Mercifully few people think that! But they do exist.
  • Sometimes people do things just to see what would happen (cf. chaos-GPT).

So now I wind up with a strong default assumption that the future world will have both AIs under close human supervision and out-of-control consequentialist AIs ruthlessly seeking power. So, what should we expect to happen at the end of the day? It depends on offense-defense balance, and regulation, and a host of other issues. This is a complicated topic with lots of uncertainties and considerations on both sides. As it happens, I lean pessimistic that humanity will survive; see my post What does it take to defend the world against out-of-control AGIs? for details. Again, I think there’s a lot of uncertainty, and scope for reasonable people to disagree—but I don’t think one can think carefully and fairly about this topic and wind up with a probability as low as 1% that there will ever be a catastrophic AI takeover.

2. Black-and-white (box) thinking

The authors repeat over and over that AIs are “white boxes” unlike human brains which are “black boxes”. I was arguing with Nora about this a couple months ago, and Charles Foster also chimed in with a helpful perspective on twitter, arguing convincingly that the terms “white box” and “black box” are used differently in different fields. My takeaway is: I’m sick of arguing about this and I really wish everybody would just taboo the words “black box” and “white box”—i.e., say whatever you want to say without using those particular words.

So, here are two things that I hope everybody can agree with:

  • (A) Hopefully everyone on all sides can agree that if my LLM reliably exhibits a certain behavior—e.g. it outputs “apple” after a certain prompt—and you ask me “Why did it output ‘apple’, rather than ‘banana’?”, then it might take me decades of work to give you a satisfying intuitive answer. In this respect, LLMs are different from many other engineered artifacts[2] such as bridges and airplanes. For example, if an airplane reliably exhibits a certain behavior (let’s say, it tends to pitch left in unusually low air pressure), and you ask me “why does it exhibit that behavior?” then it’s a safe bet that the airplane designers could figure out a satisfying intuitive answer pretty quickly (maybe immediately, maybe not, but certainly not decades). Likewise, if a non-ML computer program, like the Linux kernel, reliably exhibits a certain behavior, then it’s a safe bet that there’s a satisfying intuitive answer to “why does the program do that”, and that the people who have been writing and working with the source code could generate that answer pretty quickly, often in minutes. (There is such a thing as a buggy behavior that takes many person-years to understand, but they make good stories partly because they are so rare.)
  • (B) Hopefully everyone on all sides can agree that if you train an LLM, then you can view any or all the billions of weights and activations, and you can also perform gradient descent on the weights. In this respect, LLMs are different from biological intelligence, because biological neurons are far harder to measure and manipulate experimentally. Even mice have orders of magnitude too many neurons to measure and manipulate the activity of all of them in real time, and even if you could, you certainly couldn’t perform gradient descent on an entire mouse brain.

Again, I hope we can agree on those two things (and similar), even if we disagree about what those facts imply about AI x-risk. For the record, I don’t think either of the above bullet points by itself should be sufficient to make someone feel optimistic or pessimistic about AI x-risk. But they can be an ingredient in a larger argument. So can we all stop arguing about whether LLMs are “black box” or “white box”, and move on to the meaningful stuff, please?

3. What lessons do we learn from “human alignment” (such as it is)?

The authors write:

If we could observe and modify everything that’s going on in a human brain, we’d be able to use optimization algorithms to calculate the precise modifications to the synaptic weights which would cause a desired change in behavior. Since we can’t do this, we are forced to resort to crude and error-prone tools for shaping young humans into kind and productive adults. We provide role models for children to imitate, along with rewards and punishments that are tailored to their innate, evolved drives. In essence, we are poking and prodding at the human brain’s learning algorithms from the outside, instead of directly engineering those learning algorithms.

It’s striking how well these black box alignment methods work: most people do assimilate the values of their culture pretty well, and most people are reasonably pro-social. But human alignment is also highly imperfect. Lots of people are selfish and anti-social when they can get away with it, and cultural norms do change over time, for better or worse. Black box alignment is unreliable because there is no guarantee that an intervention intended to change behavior in a certain direction will in fact change behavior in that direction. Children often do the exact opposite of what their parents tell them to do, just to be rebellious.

I think there are two quite different stories here which are confusingly tangled together.

  • Story 1: “Humans have innate, evolved drives that lead to them wanting to be prosocial, fit into their culture, imitate role models, etc., at least to some extent.”
  • Story 2: “Human children are gradually sculpted into kind and productive adults by parents and society providing rewards and punishments, and controlling their life experience in other ways.”

I basically lean towards Story 1 for reasons in my post Heritability, Behaviorism, and Within-Lifetime RL.

There are some caveats—e.g. parents can obviously “sculpt” arbitrary children into unkind and unproductive adults by malnourishing them, or by isolating them from all human contact, or by exposing them to lead dust, etc. But generally, the sentence “we are forced to resort to crude and error-prone tools for shaping young humans into kind and productive adults” sounds almost as absurd to my ears as if the authors had written “we are forced to resort to crude and error-prone tools for shaping young humans into adults that have four-chambered hearts”. The credit goes to evolution, not “us”.

So what does that imply for AI x-risk? I don’t know, this is a few steps removed. But it brings us to the subject of “human innate drives”, a subject close to my (four-chambered) heart. I think the AGI-safety-relevant part of human innate drives—the part related to compassion and so on—is the equivalent of probably hundreds of lines of pseudocode, and nobody knows what they are. I think it would be nice if we did, and that happens to be a major research interest of mine. If memory serves, Quintin has kindly wished me luck in figuring this out. But the article here seems to strongly imply that it hardly matters, as we can easily get AI alignment and control regardless.

Instead, the authors make a big deal out of the fact that human innate drives are relatively simple (I think they mean “simple compared to a modern big trained ML model”, which I would agree with). I’m confused why that matters. Who cares if there’s a simple solution, when we don’t know what it is?

I think maybe the idea is that we’re approximating human innate drives via the RLHF reward function, so the fact that human innate drives are simple should give us confidence that the RLHF reward function (with its comparatively abundant amount of free parameters and training data) will accurately capture human innate drives? If so, I strongly disagree with the premise: The RLHF reward function is not approximating human innate drives. Instead it’s approximating the preferences of human adults, which are not totally unrelated to human innate drives, but sure aren’t the same thing. For example, here’s what an innate drive might vaguely look like for laughter—it’s this weird thing involving certain hormone levels and various poorly-studied innate signaling pathways in the hypothalamus (if I’m right). Compare that to a human adult’s sense of humor. The RLHF reward function is approximately capturing the latter (among many other things), but it has little relation to the former.

Again, so what? Is this an argument for AI doom? No. I’m making a narrow argument against some points raised in this post. If you want to argue that the RLHF reward function does a good job of capturing the preferences of human adults, then by all means, make that argument directly. I might even agree. But let’s leave human innate drives out of it.

4. Can we all agree in advance to disavow this whole “AI is easy to control” essay if future powerful AI is trained in a meaningfully different way from current LLMs?

My understanding is that the authors expect the most powerful future AI training approaches to be basically similar to what’s used in today’s Large Language Models—autoregressive prediction of human-created text and/or other data, followed by RLHF fine-tuning or similar.

As it happens, I disagree. But if the authors are right, then … I don’t know. “AI x-risk in the scenario that future transformative AI is trained in a similar way as current LLMs” is not really my area of expertise or interest. I don’t expect that scenario to actualize, so I have difficulty thinking clearly about it—like if someone says to me “Imagine a square circle, and now answer the following questions about it…”. Anyway, if we’re specifically talking about future AI whose training is basically the same as modern LLMs, then a lot of the optimistic takes in the essay would seem pretty plausible to me. But I also often read more pessimistic narratives, and those takes sound pretty plausible to me too!! I don’t really know how I feel. I’ll step aside and leave that debate to others.

So anyway, if the authors think that future transformative AI will be trained much like modern LLMs, then that’s a fine thing for them to believe—even if I happen to disagree. Lots of reasonable people believe that. And I think these authors in particular believe it for interesting and well-considered reasons, not just “ooh, chatGPT is cool!”. I don’t want to argue about that—we’ll find out one way or the other, sooner or later.

But it means that the post is full of claims and assumptions that are valid for current LLMs (or for future AI which is trained in basically the same way as current LLMs) but not for other kinds of AI. And I think this is not made sufficiently clear. In fact, it’s not even mentioned.

Why is this a problem? Because there are people right now trying to build transformative AI using architectures and training approaches that are quite different from LLMs, in safety-relevant ways. And they are reading this essay, and they are treating it as further confirmation that what they’re doing is fine and (practically) risk-free. But they shouldn’t! This essay just plain doesn’t apply to what those people are doing!! (For a real-life example of such a person, see herehere.)

So I propose that the authors should state clearly and repeatedly that, if the most powerful future AI is trained in a meaningfully different way from current LLMs, then they disavow their essay (and, I expect, much of the rest of their website). If the authors are super-confident that that will never happen, because LLM-like approaches are the future, then such a statement would be unimportant—they’re really not conceding anything, from their own perspective. But it would be really important from my perspective!

4.1 Examples where the essay is making points that don’t apply to “brain-like AGI” (≈ actor-critic model-based RL)

I’ll leave aside the many obvious examples throughout the essay where the authors use properties of current LLMs as direct evidence about the properties of future powerful AI. Here are some slightly-less-obvious examples:

Since AIs are white boxes, we have full control over their “sensory environment” (whether that consists of text, images, or other modalities).

As a human, I can be sitting in bed, staring into space, and I can think a specific abstruse thought about string theory, and now I’ve figured out something important. If a future AI can do that kind of thing, as I expect, then it’s not so clear that “controlling the AI’s sensory environment” is really all that much control.

If the AI is secretly planning to kill you, gradient descent will notice this and make it less likely to do that in the future, because the neural circuitry needed to make the secret murder plot can be dismantled and reconfigured into circuits that directly improve performance.

A human can harbor a secret desire for years, never acting on it, and their brain won’t necessarily overwrite that desire, even as they think millions of thoughts in the meantime. So evidently, the argument above is inapplicable to human brains. An interesting question is, where does it go wrong? My current guess is that the main problem is that the “desires” of actor-critic RL agents are (by necessity) mainly edited by TD learning, which I think of as generally a much cruder tool than gradient descent.

We can run large numbers of experiments to find the most effective interventions, and we can also run it in a variety of simulated environments and test whether it behaves as expected both with and without the cognitive intervention. Each time the AI’s “memories” can be reset, making the experiments perfectly reproducible and preventing the AI from adapting to our actions, very much unlike experiments in psychology and social science.

That sounds nice, but brain-like AGI (like most RL agents) does online learning. So if you run a bunch of experiments, then as soon as the AGI does anything whatsoever (e.g. reads the morning newspaper), your experiments are all invalid (or at least, open to question), because now your AGI is different than it was before (different ANN weights, not just different environment / different prompt). Humans are like that too, but LLMs are not.

When it comes to AIs, we are the innate reward system.

I have no idea how I’m supposed to interpret this sentence for brain-like AGI, such that it makes any sense at all. Actually, I’m not quite sure what it means even for LLMs!

4.2 No, “brain-like AGI” is not trained similarly to LLMs

This seems really obvious to me, but evidently it’s controversial, so let’s walk through some example differences. None of these are trying to prove some point about AI alignment and control being easy or hard; instead I am making the narrow point that the safety/danger of future LLMs is a different technical question than the safety/danger of hypothetical future brain-like AGI.

  • Brains can imitate, but do so in a fundamentally different way from LLM pretrainingSpecifically, after self-supervised pretraining, an LLM outputs exactly the thing that it expects to see. (After RLHF, that is no longer strictly true, but RLHF is just a fine-tuning step, most of the behavioral inclinations are coming from pretraining IMO.) That just doesn’t make sense in a human. When I take actions, I am sending motor commands to my own arms and my own mouth etc. Whereas when I observe another human and do self-supervised learning, my brain is internally computing predictions of upcoming sounds and images etc. These are different, and there isn’t any straightforward way to translate between them. (Cf. here where Owain Evans & Jacob Steinhardt show a picture of a movie frame and ask “what actions are being performed?”) Now, as it happens, humans do often imitate other humans. But other times they don’t. Anyway, insofar as humans-imitating-other-humans happens, it has to happen via a very different and much less direct algorithmic mechanism than how it happens in LLMs. Specifically, humans imitate other humans because they want to, i.e. because of the history of past reinforcement, directly or indirectly. Whereas a pretrained LLM will imitate human text with no RL or “wanting to imitate” at all, that’s just mechanically what it does.
  • Relatedly, brains have a distinction between expectations and desires, cleanly baked into the algorithms. I think this is obvious common sense, leaving aside galaxy-brain Free-Energy-Principle takes which try to deny it. By contrast, there isn’t any distinction between “the LLM expects the next token to be ‘a’” and “the LLM wants the next token to be ‘a’”. (Or if there is, it’s complicated and emergent and controversial, rather than directly and straightforwardly reflected in the source code, as I claim it would be in brain-like AGI.) So this is another disanalogy, and one with obvious relevance to technical arguments about safety.
  • In brains, online learning (editing weights, not just context window) is part of problem-solving. If I ask a smart human a hard science question, their brain may chug along from time t=0 to t=10 minutes, as they stare into space, and then out comes an answer. After that 10 minutes, their brain is permanently different than it was before (i.e., different weights)—they’ve figured things out about science that they didn’t previously know. Not only that, but the online-learning (weight editing) that they did during time 0<t<5 minutes is absolutely critical for the further processing that happens during time 5<t<10 minutes. This is not how today’s LLMs work—LLMs don’t edit weights in the course of “thinking”. I think this is safety-relevant for a number of reasons, including whether we can expect future AI to get rapidly more capable in an open-ended way without new human-provided training data (related discussion).

I want to reiterate that I’m delighted that people are contingency-planning for the possibility that future transformative AI will be LLM-like. We should definitely be doing that. But we should be very clear that that’s what we’re doing.

  1. ^

    “x-risk” is not quite synonymous with “extinction risk”, but they’re close enough in the context of the stuff I’m talking about here.

  2. ^

    Note that I said “many other engineered artifacts”, not “all”. The other examples I can think of tend to be in biology. For example, if I selectively breed a cultivar of cabbage that has lower irrigation needs, and I notice that its stalks are all a weird color, I may have no idea why, and it may take a decades-long research project to figure it out. Or as another example, there are many pharmaceutical drugs that are effective, but where nobody knows why they are effective, even after extraordinary efforts to figure it out.

New Comment
16 comments, sorted by Click to highlight new comments since: Today at 5:38 AM

What follows is a note I wrote responding to the AI Optimists essay, explaining where I agree and disagree. I was thinking about posting this somewhere, so I figure I'll leave it in the comments here. (So to be clear, it's responding to the AI Optimists essay, not responding to Steven's post.)

Places I think AI Optimists and I agree:

  • We have a number of advantages for aligning NNs that we don’t have for humans: white box access, better control over training environments and inputs, better control over the reward signal, and better ability to do research about which alignment techniques are most effective.
  • Evolution is a misleading analogy for many aspects of the alignment problem; in particular, gradient-based optimization seems likely to have importantly different training dynamics from evolution, like making it harder to gradient hack your training process into retaining cognition which isn’t directly useful for producing high-reward outputs during training.
  • Humans end up with learned drives, e.g. empathy and revenge, which are not hard-coded into our reward systems. AI systems also have not-strictly-optimal-for-their-training-signal learned drives like this.
  • It shouldn’t be difficult for AI systems to faithfully imitate human value judgements and uncertainty about those value judgements.

Places I think we disagree, but I’m not certain. The authors of the Optimists article promise a forthcoming document which addresses pessimistic arguments, and these bullet points are something like like “points I would like to see addressed in this document.”

  • I’m not sure we’re worrying about the same regimes.
    • The regime I’m most worried about is:
      • AI systems which are much smarter than the smartest humans
      • These AI systems are aligned in a controlled lab environment, but then deployed into the world at-large. Many of their interactions are difficult to monitor (and are also interactions with other AI systems).
      • Possibly: these AI systems are highly multi-modal, including sensors which look like “camera readouts of real-world data”
    • It’s unclear to me whether the authors are discussing alignment in a regime like the one above, or a regime like “LLMs which are not much smarter than the smartest humans.” (I too am very optimistic about remaining safe in this latter regime.)
      • When they write things like “AIs are white boxes, we have full control over their ‘sensory environment’,” it seems like they’re imagining the latter regime.
      • They’re not very clear about what intelligence regime they’re discussing, but I’m guessing they’re talking about the ~human-level intelligence regime (e.g. because they don’t spill much ink discussing scalable oversight problems; see below). 
  • I worry that the difference between “looks good to human evaluators” and “what human evaluators actually want” is important.
    • Concretely, I worry that training AI systems to produce outputs which look good to human evaluators will lead to AI systems which learn to systematically deceive their overseers, e.g. by introducing subtle errors which trick overseers into giving a too-high score, or by tampering with the sensors that overseers use to evaluate model outputs.
    • Note that arguments about the ease of learning human values and NN inductive biases don’t address this point — if our reward signal systematically prefers goals like “look good to evaluators” over goals like “actually be good,” then good priors won’t save us.
      • (Unless we do early stopping, in which case I want to hear a stronger case for why our models’ alignment will be sufficiently robust (robust enough that we’re happy to stop fine-tuning) before our models have learned to systematically deceive their overseers.)
  • I worry about sufficiently situationally aware AI systems learning to fixate on reward mechanisms (e.g. “was the thumbs-up button pressed” instead of “was the human happy”).
    • To sketch this concern out concretely, suppose an AI system is aware that it’s being fine-tuned and learned during pretraining that human overseers have a “thumbs-up” button which determines whether the model is rewarded. Suppose that so far during fine-tuning “thumbs-up button was pressed” and “human was happy” were always perfectly correlated. Will the model learn to form values around the thumbs-up button being pressed or around humans being happy? I think it’s not obvious.
    • Unlike before, NN inductive biases are relevant here. But it’s not clear to me that “humans are happy” will be favored over “thumbs-up button is pressed” — both seem similarly simple to an AI with a rich enough world model.
    • I don’t think the comparison with humans here is especially a cause for optimism: lots of humans get addicted to things, which feels to me like “forming drives around directly intervening on reward circuitry.”
  • For both of the above concerns, I worry that they might emerge suddenly with scale.
    • As argued here, “trick the overseer” will only be selected for in fine-tuning once the (pretrained) model is smart enough to do it well.
    • You can only form values around the thumbs-up button once you know it exists.
  • It seems to me that, on the authors’ view, an important input to “human alignment” is the environment that we’re trained in (rather than details of our brain’s reward circuitry, which is probably very simple). It doesn’t seem to me that environmental factors that make humans aligned (with each other) should generalize to make AI systems aligned (with humans).
    • In particular, I would guess that one important part of our environment is that humans need to interact with lots of similarly-capable humans, so that we form values around cooperation with humans. I also expect AI systems to interact with lots of AI systems (though not necessarily in training), which (if this analogy holds at all) would make AI systems care about each other, not about humans.
  • I neither have high enough confidence in our understanding of NN inductive biases, nor in the way Quintin/Nora make arguments based on said understanding, to consider these arguments as strong evidence that models won’t “play the training game” while they know they’re being trained/evaluated only to, in deployment, pursue goals they hid from their overseers.
    • I don’t really want to get into this, because it’s thorny and not my main source of P(doom).

A specific critique about the article:

  • The authors write “Some people point to the effectiveness of jailbreaks as an argument that AIs are difficult to control. We don’t think this argument makes sense at all, because jailbreaks are themselves an AI control method.” I don’t really understand this point.
    • The developer wanted their model to be sufficiently aligned that it would, e.g. never say racist stuff no matter what input it saw. In contrast, it takes only a little bit of adversarial pressure to produce inputs which will make the model say racist stuff. This indicates that the developer failed at alignment. (I agree that it means that the attacker succeeded at alignment.)
    • Part of the story here seems to be that AI systems have held-over drives from pretraining (e.g. drives like “produce continuations that look like plausible web text”). Eliminating these undesired drives is part of alignment.

The authors write “Some people point to the effectiveness of jailbreaks as an argument that AIs are difficult to control. We don’t think this argument makes sense at all, because jailbreaks are themselves an AI control method.” I don’t really understand this point.

The point is that it requires a human to execute the jailbreak, the AI is not the jailbreaker, and the examples show that humans can still retain control of the model.

The AI is not jailbreaking itself, here.

This link explains it better than I can, here:

https://www.aisnakeoil.com/p/model-alignment-protects-against

  • I’m not sure we’re worrying about the same regimes.
    • The regime I’m most worried about is:
      • AI systems which are much smarter than the smartest humans
      • ...
    • It’s unclear to me whether the authors are discussing alignment in a regime like the one above, or a regime like “LLMs which are not much smarter than the smartest humans.” (I too am very optimistic about remaining safe in this latter regime.)
      • ...

The AI Optimists don't make this argument AFAICT, but I think optimism about effectively utilizing "human level" models should transfer to a considerable amount of optimism about smarter than human models due to the potential for using these "human level" systems to develop considerably better safety technology (e.g. alignment research). AIs might have structural advantages (speed, cost, and standardization) which make it possible heavily accelerate R&D[1] even at around qualitatively "human level" capabilities. (That said, my overall view is that even if we had the exact human capability profile while also having ML structural advantages these systems would themselves pose substantial (e.g. 15%) catastrophic misalignment x-risk on the "default" trajectory because we'll want to run extremely large numbers of these systems at high speeds.)

The idea of using human level models like this has a bunch of important caveats which mean you shouldn't end up being extremely optimistic overall IMO[2]:

  •  It's not clear that "human level" will be a good description at any point. AIs might be way smarter than humans in some domains while way dumber in other domains. This can cause the oversight issues mentioned in the parent comment to manifest prior to massive acceleration of alignment research. (In practice, I'm moderately optimistic here.)
  • Is massive effective acceleration enough? We need safety technology to keep up with capabilites and capabilities might also be accelerated. There is the potential for arbitrarily scalable approaches to safety which should make us somewhat more optimistic. But, it might end up being the case that to avoid catastrophe from AIs which are one step smarter than humans we need the equivalent of having the 300 best safety researchers work for 500 years and we won't have enough acceleration and delay to manage this. (In practice I'm somewhat optimistic here so long as we can get a 1-3 year delay at a critical point.)
  • Will "human level" systems be sufficiently controlled to get enough useful work? Even if systems could hypothetically be very useful, it might be hard to quickly get them actually doing useful work (particularly in fuzzy domains like alignment etc.). This objection holds even if we aren't worried about catastrophic misalignment risk.
  1. ^

    At least R&D which isn't very limited by physical processes.

  2. ^

    I think <1% doom seems too optimistic without more of a story for how we're going to handle super human models.

Plans that rely on aligned AGIs working on alignment faster than humans would need to ensure that no AGIs work on anything else in the meantime. The reason humans have no time to develop alignment of superintelligence is that other humans develop misaligned superintelligence faster. Similarly by default very fast AGIs working on alignment end up having to compete with very fast AGIs working on other things that lead to misaligned superintelligence. Preventing aligned AGIs from building misaligned superintelligence is not clearly more manageable than preventing humans from building AGIs.

Plans that rely on aligned AGIs working on alignment faster than humans would need to ensure that no AGIs work on anything else in the meantime.

This isn't true. It could be that making an arbitrarily scalable solution to alignment takes X cognitive resources and in practice building an uncontrollably powerful AI takes Y cognitive resources with X < Y.

(Also, this plan doesn't require necessarily aligning "human level" AIs, just being able to get work out of them with sufficiently high productivity and low danger.)

I'm being a bit simplistic. The point is that it needs to stop being a losing or a close race, and all runners getting faster doesn't obviously help with that problem. I guess there is some refactor vs. rewrite feel to the distinction between the project of stopping humans from building AGIs right now, and the project of getting first AGIs to work on alignment and global security in a post-AGI world faster than other AGIs overshadow such work. The former has near/concrete difficulties, the latter has nebulous difficulties that don't as readily jump to attention. The whole problem is messiness and lack of coordination, so starting from scratch with AGIs seems more promising than reforming human society. But without strong coordination on development and deployment of first AGIs, the situation with activities of AGIs is going to be just as messy and uncoordinated, only unfolding much faster, and that's not even counting the risk of getting a superintelligence right away.

The "AI is easy to control" piece does talk about scaling to superhuman AI:

In what follows, we will argue that AI, even superhuman AI, will remain much more controllable than humans for the foreseeable future. Since each generation of controllable AIs can help control the next generation, it looks like this process can continue indefinitely, even to very high levels of capability.

If we assume that each generation can ensure a relatively strong notion of alignment between it and the next generation, then I think this argument goes through.

However, there are weaker notions of control which are insufficient for this sort of bootstrapping argument. Suppose each generation can ensure a the following weaker notion of control "we can set up a training, evaluation, and deployment protocol with sufficient safeguards (monitoring, auditing, etc) such that we can avoid generation N+1 AIs being capable of causing catastrophic outcomes (like AI takeover) while using those AIs to speed up labor of the generation N by a large multiple". This notion of control doesn't (clearly) allow the bootstrapping argument to go through. In particular, suppose that all AIs smarter than humans are deceptively aligned and they defect on humanity at the point where they are doing tasks which would be extremely hard for a human to oversee. (This isn't the only issue, but it is a sufficient counterexample.)

This weaker notion of control can be very useful in ensuring good outcomes via getting lots of useful work out of AIs, but we will likely need to build something more scalable eventually.

(See also my discussion of using human level ish AIs to automate safety research in the sibling.)

I agree with everything you wrote here and in the sibling comment: there are reasonable hopes for bootstrapping alignment as agents grow smarter; but without a concrete bootstrapping proposal with an accompanying argument, <1% P(doom) from failing to bootstrap alignment doesn't seem right to me.

I'm guessing this is my biggest crux with the Quintin/Nora worldview, so I guess I'm bidding for -- if Quintin/Nora have an argument for optimism about bootstrapping beyond "it feels like this should work because of iterative design" -- for that argument to make it into the forthcoming document.

(Didn't consult Quintin on this; I speak for myself)

I flatly deny that our arguments depend on AGI being anything like an LLM. I think the arguments go through in a very wide range of scenarios, basically as long as we're using some kind of white-box optimization to align them, rather than e.g. carrot-and-stick incentives or prompt engineering. Even if we only relied on prompt engineering, I think we'd be in a better spot than with humans (because we can run many controlled experiments).

A human can harbor a secret desire for years, never acting on it, and their brain won’t necessarily overwrite that desire, even as they think millions of thoughts in the meantime. So evidently, the argument above is inapplicable to human brains.

I’m pretty confused by this claim. Why should we expect the human reward system to overwrite all secret desires? Also how do we know it’s not doing that? Your desires are just causal effects of a bunch of stuff including your reward circuitry.

As a human, I can be sitting in bed, staring into space, and I can think a specific abstruse thought about string theory, and now I’ve figured out something important. If a future AI can do that kind of thing, as I expect, then it’s not so clear that “controlling the AI’s sensory environment” is really all that much control.

  1. This is just generally a pretty weak argument. You don't seem to be contesting the fact that we have full sensory control for AI and we don’t have full sensory control for humans. It’s just a claim that this doesn’t matter. Maybe this ends up being a brute clash of intuitions, but it seems obvious to me that full sensory control matters a lot, even if the AI is doing a lot of long running cognition without supervision.
  2. With AI we can choose to cut its reasoning short whenever we want, force it to explain itself in human language, roll it back to a previous state, etc. We just have a lot more control over this ongoing reasoning process for AIs and it’s baffling to me that you seem to think this mostly doesn’t matter.

That sounds nice, but brain-like AGI (like most RL agents) does online learning. So if you run a bunch of experiments, then as soon as the AGI does anything whatsoever (e.g. reads the morning newspaper), your experiments are all invalid (or at least, open to question), because now your AGI is different than it was before

You can just include online learning in your experimentation loop. See what happens when you let the AI online learn for a bit in different environments. I don't think online learning changes the equation very much. It's known to be less stable than offline RL, but that instability hurts capabilities as well as alignment, so we'd need a specific argument that it will hurt alignment significantly more than capabilities, in ways that we wouldn't be able to notice during training and evaluation.


I have no idea how I’m supposed to interpret this sentence ["we are the innate reward system"] for brain-like AGI, such that it makes any sense at all. Actually, I’m not quite sure what it means even for LLMs!

It just means we are directly updating the AI’s neural circuitry with white box optimizers. This will be true across a very wide range of scenarios, including (IIUC) your brain-like AGI scenario.

Brains can imitate, but do so in a fundamentally different way from LLM pretraining

I don’t see why any of the differences you listed are relevant for safety.

Relatedly, brains have a distinction between expectations and desires, cleanly baked into the algorithms. I think this is obvious common sense, leaving aside galaxy-brain Free-Energy-Principle takes which try to deny it.

I basically deny this, especially if you're stipulating that it's a "clean" distinction. Obviously folk psychology has a fuzzy distinction between beliefs and desires in it, but it's also well-known both in common sense and among neuroscientists etc. that beliefs and desires get mixed up all the time and there's not a particularly sharp divide.

Reply7221

I think this is one particularly striking example of a common problem in alignment discussions: they are confused when the type of AI we're talking about isn't made clear. I think this is a ubiquitous problem in alignment discussions: people are thinking of different types of AI without explicitly stating this, so they reach different conclusions about alignment. To some extent this is inevitable if we want to avoid advancing capabilities by proposing useful designs for AGI. But we could do better by distinguishing between known broad categories, in particular, agentic vs. tool AI and RL-trained vs. predictive AI. These are not sharp categories, but distinguishing what part of the spectrum we're primarily addressing would clarify discussions.

You've done an admirable job of doing that in this post, and doing so seems to make sense of your disagreements with Pope's conclusions.

Pope appears to be talking primarily about LLMs, so the extent to which his logic applies to other forms of AI is unclear. As you note, that logic does not seem to apply to AI that is agentic (explicitly goal-directed), or to actor-critic RL agents.

That is not the only problem with that essay, but it's a big one, since the essay comes to the conclusion that AI is safe, while analyzing only one type of AI.

I agree that human ethics is not the result solely of training, but has a critical component of innate drives to be pro-social. The existence of sociopaths whose upbringing was normal is pretty compelling evidence that the genetic component is causal.

While the genetic basis of prosocial behavior is probably simple in the sense that it is coded in a limited amount of DNA information and neural circuitry, it is likely quite complex in another sense: it is evolved to work properly in the context of a very particular type of environment, that of standard human experience. As such, I find it unlikely that those mechanisms would produce an aligned agent in a very different AI training regime, nor that that alignment would generalize to very different situations than humans commonly encounter.

As you note, even if we restricted ourselves to this type of AI, and alignment was easy, that would not reduce existential risks to near 1%. If powerful AI is accessible to many, someone is going to either make mistakes or deliberately use it destructively, probably rather quickly.

I mostly agree with what you say, just registering my disagreement/thoughts on some specific points. (Note that I haven't yet read the page you're responding to.)

Hopefully everyone on all sides can agree that if my LLM reliably exhibits a certain behavior—e.g. it outputs “apple” after a certain prompt—and you ask me “Why did it output ‘apple’, rather than ‘banana’?”, then it might take me decades of work to give you a satisfying intuitive answer. 

Maybe? Depends on what exactly you mean by the word "might", but it doesn't seem obvious to me that this would need to be the case. My intuition from seeing the kinds of interpretability results we've seen so far, is that within less of a decade we'd already have a pretty rigorous theory and toolkit for answering these kinds of questions. At least assuming that we don't keep switching to LLM architectures that work based on entirely different mechanisms and make all of the previous interpretability work irrelevant.

If by "might" you mean something like a "there's at least a 10% probability that this could take decades to answer" then sure, I'd agree with that. Now I haven't actually thought about this specific question very much before seeing it pop up in your post, so I might radically revise my intuition if I thought about it more, but at least it doesn't seem immediately obvious to me that I should assign "it would take decades of work to answer this" a very high probability.

Instead, the authors make a big deal out of the fact that human innate drives are relatively simple (I think they mean “simple compared to a modern big trained ML model”, which I would agree with). I’m confused why that matters. Who cares if there’s a simple solution, when we don’t know what it is?

I would assume the intuition to be something like "if they're simple, then given the ability to experiment on minds and access AI internals, it will be relatively easy to figure out how to make the same drives manifest in an AI; the amount of (theory + trial and error) required for that will not be as high as it would be if the drives were intrinsically complex".

We can run large numbers of experiments to find the most effective interventions, and we can also run it in a variety of simulated environments and test whether it behaves as expected both with and without the cognitive intervention. Each time the AI’s “memories” can be reset, making the experiments perfectly reproducible and preventing the AI from adapting to our actions, very much unlike experiments in psychology and social science.

That sounds nice, but brain-like AGI (like most RL agents) does online learning. So if you run a bunch of experiments, then as soon as the AGI does anything whatsoever (e.g. reads the morning newspaper), your experiments are all invalid (or at least, open to question), because now your AGI is different than it was before (different ANN weights, not just different environment / different prompt). Humans are like that too, but LLMs are not.

There's something to that, but this sounds too strong to me. If someone had hypothetically spent a year observing all of my behavior, having some sort of direct read access to what was happening in my mind, and also doing controlled experiments where they reset my memory and tested what happened with some different stimulus... it's not like all of their models would become meaningless the moment I read the morning newspaper. If I had read morning newspapers before, they would probably have a pretty good model of what the likely range of updates for me would be. 

Of course, if there was something very unexpected and surprising in the newspaper, that might cause a bigger update, but I expect that they would also have reasonably good models of the kinds of things that are likely to trigger major updates or significant emotional shifts in me. If they were at all competent, that's specifically the kind of thing that I'd expect them to work on trying to find out!

And even if there was a major shift, I think it's basically unheard of that literally everything about my thoughts and behavior would change. When I first understood the potentially transformative impact of AGI, it didn't change the motor programs that determine how I walk or brush my teeth, nor did it significantly change what kinds of people I feel safe around (aside for some increase in trust toward other people who I felt "get it"). I think that human brains quite strongly preserve their behavior and prediction structures, just adjusting them somewhat when faced with new information. Most of the models and predictions you've made about an adult will tend to stay valid, though of course with children and younger people there's much greater change.

Now, as it happens, humans do often imitate other humans. But other times they don’t. Anyway, insofar as humans-imitating-other-humans happens, it has to happen via a very different and much less direct algorithmic mechanism than how it happens in LLMs. Specifically, humans imitate other humans because they want to, i.e. because of the history of past reinforcement, directly or indirectly. Whereas a pretrained LLM will imitate human text with no RL or “wanting to imitate” at all, that’s just mechanically what it does.

In some sense yes, but it does also seem to me that prediction and desire does get conflated in humans in various ways, and that it would be misleading to say that the people in question want it. For example, I think about this post by @romeostevensit often:

Fascinating concept that I came across in military/police psychology dealing with the unique challenges people face in situations of extreme stress/danger: scenario completion. Take the normal pattern completion that people do and put fear blinders on them so they only perceive one possible outcome and they mechanically go through the motions *even when the outcome is terrible* and there were obvious alternatives. This leads to things like officers shooting *after* a suspect has already surrendered, having overly focused on the possibility of needing to shoot them. It seems similar to target fixation where people under duress will steer a vehicle directly into an obstacle that they are clearly perceiving (looking directly at) and can't seem to tear their gaze away from. Or like a self fulfilling prophecy where the details of the imagined bad scenario are so overwhelming, with so little mental space for anything else that the person behaves in accordance with that mental picture even though it is clearly the mental picture of the *un*desired outcome.

I often try to share the related concept of stress induced myopia. I think that even people not in life or death situations can get shades of this sort of blindness to alternatives. It is unsurprising when people make sleep a priority and take internet/screen fasts that they suddenly see that the things they were regarding as obviously necessary are optional. In discussion of trauma with people this often seems to be an element of relationships sadly enough. They perceive no alternative and so they resign themselves to slogging it out for a lifetime with a person they are very unexcited about. This is horrific for both people involved.

It's, of course, true that for an LLM, prediction is the only thing it can do, and that humans have a system of desires on top of that. But it looks to me that a lot of human behavior is just having LLM-ish predictive models of how someone like them would behave in a particular situation, which is also the reason why conceptual reframings the like one you can get in therapy can be so powerful ("I wasn't lazy after all, I just didn't have the right tools for being productive" can drastically reorient many predictions you're making of yourself and thus your behavior). (See also my post on human LLMs, which has more examples.)

While it's obviously true that there is a lot of stuff operating in brains besides LLM-like prediction, such as mechanisms that promote specific predictive models over other ones, that seems to me to only establish that "the human brain is not just LLM-like prediction", while you seem to be saying that "the human brain does not do LLM-like prediction at all". (Of course, "LLM-like prediction" is a vague concept and maybe we're just using it differently and ultimately agree.)

There's something to that, but this sounds too strong to me. If someone had hypothetically spent a year observing all of my behavior, having some sort of direct read access to what was happening in my mind, and also doing controlled experiments where they reset my memory and tested what happened with some different stimulus... it's not like all of their models would become meaningless the moment I read the morning newspaper. If I had read morning newspapers before, they would probably have a pretty good model of what the likely range of updates for me would be.

I dunno, I wrote “invalid (or at least, open to question)”. I don’t think that’s too strong. Like, just because it’s “open to question”, doesn’t mean that, upon questioning it, we won’t decide it’s fine. I.e., it’s not that the conclusion is necessarily wrong, it’s that the original argument for it is flawed.

Of course I agree that the morning paper thing would probably be fine for humans, unless the paper somehow triggered an existential crisis, or I try a highly-addictive substance while reading it, etc.  :)

Some relevant context is: I don’t think it’s realistic to assume that, in the future, AI models will be only slightly fine-tuned in a deployment-specific way. I think the relevant comparison is more like “can your values change over the course of years”, not “can your values change after reading the morning paper?”

Why do I think that? Well, let’s imagine a world where you could instantly clone an adult human. One might naively think that there would be no more on-the-job learning ever. Instead, (one might think), if you want a person to help with chemical manufacture, you open the catalog to find a human who already knows chemical manufacturing, and order a clone of them; and if you want a person to design widgets, you go to a different catalog page, and order a clone of a human widget design expert; so on.

But I think that’s wrong.

I claim there would be lots of demand to clone a generalist—a person who is generally smart and conscientious and can get things done, but not specifically an expert in metallurgy or whatever the domain is. And then, this generalist would be tasked with figuring out whatever domains and skills they didn’t already have.

Why do I think that? Because there’s just too many possible specialties, and especially combinations of specialties, for a pre-screened clone-able human to already exist in each of them. Like, think about startup founders. They’re learning how to do dozens of things. Why don’t they outsource their office supply questions to an office supply expert, and their hiring questions to a hiring expert, etc.? Well they do to some extent, but there are coordination costs, and more importantly the experts would lack all the context necessary to understand what the ultimate goals are. What are the chances that there’s a pre-screened clone-able human that knows about the specific combination of things that a particular application needs (rural Florida zoning laws AND anti-lock brakes AND hurricane preparedness AND …)

So instead I expect that future AIs will eventually do massive amounts of figuring-things-out in a nearly infinite variety of domains, and moreover that the figuring out will never end. (Just as the startup founder never stops needing to learn new things, in order to succeed.) So I don’t like plans where the AI is tested in a standardized way, and then it’s assumed that it won’t change much in whatever one of infinitely many real-world deployment niches it winds up in.

While it's obviously true that there is a lot of stuff operating in brains besides LLM-like prediction, such as mechanisms that promote specific predictive models over other ones, that seems to me to only establish that "the human brain is not just LLM-like prediction", while you seem to be saying that "the human brain does not do LLM-like prediction at all". (Of course, "LLM-like prediction" is a vague concept and maybe we're just using it differently and ultimately agree.)

I disagree with whether that distinction matters:

I think technical discussions of AI safety depend on the AI-algorithm-as-a-whole; I think “does the algorithm have such-and-such component” is not that helpful a question.

So for example, here’s a nightmare-scenario that I think about often:

  • (step 1) Someone reads a bunch of discussions about LLM x-risk
  • (step 2) They come down on the side of “LLM x-risk is low”, and therefore (they think) it would be great if TAI is an LLM as opposed to some other type of AI
  • (step 3) So then they think to themselves: Gee, how do we make LLMs more powerful? Aha, they find a clever way to build an AI that combines LLMs with open-ended real-world online reinforcement learning or whatever.

Even if (step 2) is OK (which I don’t want to argue about here), I am very opposed to (step 3), particularly the omission of the essential part where they should have said “Hey wait a minute, I had reasons for thinking that LLM x-risk is low, but do those reasons apply to this AI, which is not an LLM of the sort that I'm used to, but rather it’s a combination of LLM + open-ended real-world online reinforcement learning or whatever?” I want that person to step back and take a fresh look at every aspect of their preexisting beliefs about AI safety / control / alignment from the ground up, as soon as any aspect of the AI architecture and training approach changes, even if there’s still an LLM involved.  :)

Aligning human-level AGIs is important to the extent there is risk it doesn't happen before it's too late. Similarly with setting up a world where initially aligned human-level AGIs don't soon disempower humans (as literal humans might in the shoes of these AGIs), or fail to protect the world from misused or misaligned AGIs or superintelligences.

Then there is a problem of aligning superintelligences, and of setting up a world where initially aligned superintelligences don't cause disempowerment of humans down the line (whether that involves extinction or not). Humanity is a very small phenomenon compared to a society of superintelligences, remaining in control of it is a very unusual situation. (Humanity eventually growing up to become a society of superintelligences while holding off on creating a society of alien superintelligences in the meantime seems like a more plausible path to success.)

Solving any of these problems doesn't diminish importance of the others, which remain as sources of possible doom, unless they too get solved before it's too late. Urgency of all of these problems originates from the risk of succeeding in developing AGI. Tasking the first aligned AGIs with solving the rest of the problems caused by the technology that enables their existence seems like the only plausible way of keeping up, since by default all of this likely occurs in a matter of years (from development of first AGIs). Though economic incentives in AGI deployment risk escalating the problems faster than AGIs can implement solutions to them. Just as initial development of AGIs risks creating problems faster than humans can prepare for them.

(A) Hopefully everyone on all sides can agree that if my LLM reliably exhibits a certain behavior—e.g. it outputs “apple” after a certain prompt—and you ask me “Why did it output ‘apple’, rather than ‘banana’?”, then it might take me decades of work to give you a satisfying intuitive answer.

I don't confidently disagree with this statement, but it occurs to me that I haven't tried it myself and haven't followed it very closely, and have sometimes heard claims that there are promising methods.

A lot of people trying to come up with answers try to do it with mechanistic interpretability, but that probably isn't very feasible. However, investigations based on ideas like neural tangent kernels seem plausibly more satisfying and feasible. Like if you show that the dataset contains a bunch of instances that'd push it towards saying apple rather than banana, and you then investigate where those data points come from and realize that there's actually a pretty logical story for them, then that seems basically like success.

As an example, I remember a while ago there was some paper that claimed to have found a way to attribute NN outputs to training data points, and it claimed that LLM power-seeking was mainly caused by sci-fi stories and by AI safety discussions. I didn't read the paper so I don't know whether it's legit, but that sort of thing seems quite plausibly feasible a lot of the time.

As an example, I remember a while ago there was some paper that claimed to have found a way to attribute NN outputs to training data points, and it claimed that LLM power-seeking was mainly caused by sci-fi stories and by AI safety discussions. I didn't read the paper so I don't know whether it's legit, but that sort of thing seems quite plausibly feasible a lot of the time.

Perhaps you're thinking of the recent influence function work from Anthropic?

I don't think that this paper either shows or claims that "LLM power-seeking was mainly caused by sci-fi stories and by AI safety discussions". But they do find that there are influential training examples from sci-fi stories and AI safety discussion when asking the model questions about topics like this.