Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Recent Discussion

(Cross-posted from my website)

I’ve written a report about whether advanced AIs will fake alignment during training in order to get power later – a behavior I call “scheming” (also sometimes called “deceptive alignment”). The report is available on arXiv here. There’s also an audio version here, and I’ve included the introductory section below (with audio for that section here). This section includes a full summary of the report, which covers most of the main points and technical terminology. I’m hoping that the summary will provide much of the context necessary to understand individual sections of the report on their own.


This report examines whether advanced AIs that perform well in training will be doing so in order to gain power later – a behavior I call “scheming” (also sometimes called


(Partly re-hashing my response from twitter.)

I'm seeing your main argument here as a version of what I call, in section 4.4, a "speed argument against schemers" -- e.g., basically, that SGD will punish the extra reasoning that schemers need to perform. 

(I’m generally happy to talk about this reasoning as a complexity penalty, and/or about the params it requires, and/or about circuit-depth -- what matters is the overall "preference" that SGD ends up with. And thinking of this consideration as a different kind of counting argument *against* schemers see... (read more)

1Joe Carlsmith15h
The question of how strongly training pressures models to minimize loss is one that I isolate and discuss explicitly in the report, in section 1.5, "On 'slack' in training" -- and at various points the report references how differing levels of "slack" might affect the arguments it considers. Here I was influenced in part by discussions with various people, yourself included, who seemed to disagree about how much weight to put on arguments in the vein of: "policy A would get lower loss than policy B, so we should think it more likely that SGD selects policy A than policy B."  (And for clarity, I don't think that arguments of this form always support expecting models to do tons of reasoning about the training set-up. For example, as the report discusses in e.g. Section 4.4, on "speed arguments," the amount of world-modeling/instrumental-reasoning that the model does can affect the loss it gets via e.g. using up cognitive resources. So schemers -- and also, reward-on-the-episode seekers -- can be at some disadvantage, in this respect, relative to models that don't think about the training process at all.)
3Joe Carlsmith15h
Agents that end up intrinsically motivated to get reward on the episode would be "terminal training-gamers/reward-on-the-episode seekers," and not schemers, on my taxonomy. I agree that terminal training-gamers can also be motivated to seek power in problematic ways (I discuss this in the section on "non-schemers with schemer-like traits"), but I think that schemers proper are quite a bit scarier than reward-on-the-episode seekers, for reasons I describe here.


You can’t optimise an allocation of resources if you don’t know what the current one is. Existing maps of alignment research are mostly too old to guide you and the field has nearly no ratchet, no common knowledge of what everyone is doing and why, what is abandoned and why, what is renamed, what relates to what, what is going on. 

This post is mostly just a big index: a link-dump for as many currently active AI safety agendas as we could find. But even a linkdump is plenty subjective. It maps work to conceptual clusters 1-1, aiming to answer questions like “I wonder what happened to the exciting idea I heard about at that one conference” and “I just read a post on a surprising new insight and...

2Ryan Greenblatt17h
Explicitly noting for the record we have some forthcoming work on AI control which should be out relatively soon. (I work at RR)
2Ryan Greenblatt17h
Yep, indeed I would consider "control evaluations" to be a method of "AI control". I consider the evaluation and the technique development to be part of a unified methodology (we'll describe this more in a forthcoming post). (I work at RR)
6Zach Stein-Perlman14h
It's "a unified methodology" but I claim it has two very different uses: (1) determining whether a model is safe (in general or in particular scaffolding) and (2) directly making deployment safer. Or (1) model evals and (2) inference-time safety techniques.

(Agreed except that "inference-time safety techiques" feels overly limiting. It's more like purely behavioral (black-box) safety techniques where we can evaluate training by converting it to validation. Then, we imagine we get the worst model that isn't discriminated by our validation set and other measurements. I hope this isn't too incomprehensible, but don't worry if it is, this point isn't that important.)

This is Section 2.1 of my report “Scheming AIs: Will AIs fake alignment during training in order to get power?”. There’s also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I’m hoping that it will provide much of the context necessary to understand individual sections of the report on their own.

Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast app.

What's required for scheming?

Let's turn, now, to examining the probability that baseline ML methods for training advanced AIs will produce schemers. I'll begin with an examination of the prerequisites for scheming. I'll focus on:

  1. Situational awareness: that is, the model understands that it's a model in a training process, what the training process will reward,

2Daniel Kokotajlo21h
Which training signal? Across all of time and space, there are many different AIs being trained with many different signals, and of course there are also non-AI minds like humans and animals and aliens. Even the choice to optimize for some aggregate of AI training signals is already a choice to self-locate as an AI. But realistically given the diversity of training signals, probably significant gains will be had by self-locating as a particular class of AIs, namely those whose training signals are roughly what the actual training signal is.
1Joe Carlsmith14h
Agree that it would need to have some conception of the type of training signal to optimize for, that it will do better in training the more accurate its picture of the training signal, and that this provides an incentive to self-locate more accurately (though not necessary to degree at stake in e.g. knowing what server you're running on).
I claim this part is basically unneccesary. Once the AI has situational awareness, if it's optimizing for human-assigned ratings, then scheming-style behavior naturally follows even if the AI is only optimizing for a single episode. This came up in a recent dialogue with Eli. We talked about an AI optimized against human ratings, facing a choice of whether to "seize the raters and put them in special 'thumbs up'-only suits". [The below is copied from the dialogue.] If the AI has a detailed internal model of the training process, and the training process includes sticking the AI in a simulation, then presumably the AI has an internal model of the simulation (including an internal self model). So during training, when this "thumbs-up-only suits" scenario comes up, the AI's actual reasoning will route through something like: * Ok, I have the opportunity to put these simulated humans in thumbs-up-only suits. * If I do that, then the actual humans who produce the actual ratings will give a bad rating; they would not actually be in thumbs-up-only suits. * Therefore I won't do that. ... and that reasoning gets reinforced. Then when the AI is out of simulation, it reasons: * Ok, I have the opportunity to put the actual humans who produce the actual ratings in thumbs-up-only suits. * If I do that, then the actual ratings will be great. * Therefore I do that. (This sounds like a typical "the AI is strategically aware, and knows it is in a simulation" story, and it is. But note two things which are not always present in such stories: * First, there's a clear reason for the AI to at least consider the hypothesis that it's in a simulation: by assumption, it has an internal model of the training process, and the training process includes simulating the AI, so the AI has an internal model of itself-in-a-simulation as part of the training process. * Second, the AI's cognition doesn't involve any explicit deception, or even any non-myopia; this story all goes through

I agree that AIs only optimizing for good human ratings on the episode (what I call "reward-on-the-episode seekers") have incentives to seize control of the reward process, that this is indeed dangerous, and that in some cases it will incentivize AIs to fake alignment in an effort to seize control of the reward process on the episode (I discuss this in the section on "non-schemers with schemer-like traits"). However, I also think that reward-on-the-episode seekers are also substantially less scary than schemers in my sense, for reasons I discuss here (i.e.... (read more)

ETA: I'm not saying that MIRI thought AIs wouldn't understand human values. If there's only one thing you take away from this post, please don't take away that. Here is Linch's attempted summary of this post, which I largely agree with.

Recently, many people have talked about whether some of the main MIRI people (Eliezer Yudkowsky, Nate Soares, and Rob Bensinger[1]) should update on whether value alignment is easier than they thought given that GPT-4 seems to follow human directions and act within moral constraints pretty well (here are two specific examples of people talking about this: 1, 2). Because these conversations are often hard to follow without much context, I'll just provide a brief caricature of how I think this argument has gone in the places I've...

(For context: My initial reaction to the post was that this is misrepresenting the MIRI-position-as-I-understood-it. And I am one of the people who strongly endorse the view that "it was never about getting the AI to predict human preferences". So when I later saw Yudkowsky's comment and your reaction, it seemed perhaps useful to share my view.)

It seems like you think that human preferences are only being "predicted" by GPT-4, and not "preferred." If so, why do you think that?

My reaction to this is that: Actually, current LLMs do care about out preferences... (read more)

Status: Vague, sorry. The point seems almost tautological to me, and yet also seems like the correct answer to the people going around saying “LLMs turned out to be not very want-y, when are the people who expected 'agents' going to update?”, so, here we are.


Okay, so you know how AI today isn't great at certain... let's say "long-horizon" tasks? Like novel large-scale engineering projects, or writing a long book series with lots of foreshadowing?

(Modulo the fact that it can play chess pretty well, which is longer-horizon than some things; this distinction is quantitative rather than qualitative and it’s being eroded, etc.)

And you know how the AI doesn't seem to have all that much "want"- or "desire"-like behavior?

(Modulo, e.g., the fact that it can play chess pretty...

4Daniel Kokotajlo1d
I am confused what your position is, Paul, and how it differs from So8res' position. Your statement of your position at the end (the bit about how systems are likely to end up wanting reward) seems like a stronger version of So8res' position, and not in conflict with it. Is the difference that you think the main dimension of improvement driving the change is general competence, rather than specifically long-horizon-task competence?


  • I don't buy the story about long-horizon competence---I don't think there is a compelling argument, and the underlying intuitions seem like they are faring badly. I'd like to see this view turned into some actual predictions, and if it were I expect I'd disagree.
  • Calling it a "contradiction" or "extreme surprise" to have any capability without "wanting" looks really wrong to me.
  • Nate writes: 

This observable "it keeps reorienting towards some target no matter what obstacle reality throws in its way" behavior is what I mean when I describe an

... (read more)
16Alex Turner3d
This seems like a great spot to make some falsifiable predictions which discriminate your particular theory from the pack. (As it stands, I don't see a reason to buy into this particular chain of reasoning.)  AIs will increasingly be deployed and tuned for long-term tasks, so we can probably see the results relatively soon. So—do you have any predictions to share? I predict that AIs can indeed do long-context tasks (like writing books with foreshadowing) without having general, cross-situational goal-directedness.[1]  I have a more precise prediction: Conditional on that, I predict with 85% confidence that it's possible to do this with AIs which are basically as tool-like as GPT-4. I don't know how to operationalize that in a way you'd agree to. (I also predict that on 12/1/2025, there will be a new defense offered for MIRI-circle views, and a range of people still won't update.) 1. ^ I expect most of real-world "agency" to be elicited by the scaffolding directly prompting for it (e.g. setting up a plan/critique/execute/summarize-and-postmortem/repeat loop for the LLM), and for that agency to not come from the LLM itself. 
6Daniel Kokotajlo1d
The thing people seem to be disagreeing about is the thing you haven't operationalized--the "and it'll still be basically as tool-like as GPT4" bit. What does that mean and how do we measure it? 
Load More