(Part 4 of the CAST sequence)

This document is an in-depth review of the primary documents discussing corrigibility that I’m aware of. In particular, I'll be focusing on the writing of Eliezer Yudkowsky and Paul Christiano, though I’ll also spend some time at the end briefly discussing other sources. As I go through the writing of those who’ve come before, I want to specifically compare and contrast those ideas with the conceptualization of corrigibility put forth in earlier documents and the strategy proposed in The CAST Strategy. At a high level I mostly agree with Christiano, except that he seems to think we’ll get corrigibility emergently, whereas I think it’s vital that we focus on directly training purely corrigible agents (and he wants to focus on recursive architectures that seem brittle and unproven, but that’s more of an aside).

In my opinion this document goes into more detail than I expect >95% of readers want. I’ve tried to repeat all of the important ideas that show up in this document elsewhere, so you are encouraged to skim or just skip to the next post in the sequence: Open Corrigibility Questions.

Note: I only very recently learned about Human Control: Definitions and Algorithms but haven’t yet had the time/spoons to read it in any depth. I’m hoping to have more time for it before too long, perhaps with a follow-up post. Apologies to Ryan Carey and Tom Everitt for the neglect!

In this document, quotes from the source material will be indented. All quotes are from the document linked in that section. Unless noted, all bold text formatting is my addition, used to emphasize/highlight portions of the quote. Italics within quotations are always from the original source.

Eliezer Yudkowsky et al.

Corrigibility (2015)

Let’s begin our review with the oldest writing on the topic that I’m aware of: the MIRI paper “Corrigibility” from 2015 written by Nate Soares, Benja Fallenstein, Eliezer Yudkowsky, and Stuart Armstrong. (Apologies for lumping this into Yudkowsky’s section. I find it helpful to think of this as “the Yudkowsky position” despite having personal relationships with each of the paper’s authors and knowing that they all contributed.)

From the abstract:

We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences. We introduce the notion of corrigibility and analyze utility functions that attempt to make an agent shut down safely if a shutdown button is pressed, while avoiding incentives to prevent the button from being pressed or cause the button to be pressed, and while ensuring propagation of the shutdown behavior as it creates new subsystems or self-modifies.

This is the source of “the stop button” toy-problem (“shutdown problem” in the paper) as well as several other related concepts. In the introduction the authors set up the situation and discuss how Omohundro Drives mean that (by default) agents will protect themselves from being modified, even when they understand that their creators made a mistake and specified the wrong goal.

In this paper, we ask whether it is possible to construct a powerful artificially intelligent system which has no incentive to resist attempts to correct bugs in its goal system, and, ideally, is incentivized to aid its programmers in correcting such bugs. […] it seems important to develop an understanding of methods of reasoning that allow for correction before developing systems that are able to resist or deceive their programmers. We refer to reasoning of this type as corrigible.

This early paragraph seems worth calling out for how it slides between framing corrigibility as a property relating to the incentives of the AI at first, but then pivoting to a framing where corrigibility is a property of the method of reasoning about the world. While it’s not obvious that this slide is unwarranted or problematic, I believe it’s important to keep the division between goals/targets/values/incentives and reasoning/planning/strategizing/methods crisply in mind.

Also, very briefly, I want to compare my conception of the AI agent being corrigible to some “principal,” with the authors’ choice of “the programmers.” While the programmers of the AI are one obvious choice of principal, I wish that the authors had started the conversation around corrigibility by highlighting the connection with the principal-agent problem and had flagged that there are other plausible sets of people that might be identified as the the principal, such as owners, users, supervisors, or just humans in general.

Moving along, the authors then identify four necessary signs of corrigibility:

  1. Tolerating (or assisting) the programmers’ attempts to shut down the AI
  2. Absence of manipulation and deception
  3. A tendency to repair safety measures, or at least notify programmers when things break
  4. Working to preserve corrigibility in successor agents (including after self-changes)

This is a short list compared to some of the other possible desiderata lists, but it’s a fine start. I agree that any agent which fails one of these four tests is not corrigible (though Yudkowsky later seems to back off from #3, as we’ll see in future writing). The authors then point out that many naive strategies for corrigibility, such as value uncertainty or explicit penalty terms in the agent’s utility function for things like deception will fail to produce agents which satisfy even this minimal desiderata list. I agree with this analysis, and particularly like this bit towards the end of section 1.1:

It seems unwise to build a system that wants to resist its creators but cannot. Rather, the goal of corrigibility is to understand how to construct a system that never experiences such incentives in the first place.

I want to pause in the reading here to comment briefly on the very MIRI-flavored perspective of thinking about the AI agent’s utility function not as an abstract description of the preferences implicit in the AI’s behavior, but rather as an explicit computation that can potentially be directly modified (such as with the addition of penalty terms around manipulation/deception/etc.). This frame matches the common GOFAI architecture of an expected utility maximizer in a symbolic context, but is highly at odds with modern machine learning paradigms. This GOFAI frame is central in the authors’ choice of toy-model problem:

As an example problem, in this paper we consider expected utility maximizers with a “shutdown button” that causes the agent to stop operating. We will examine utility functions that attempt to incentivize an agent to reason corrigibly with regards to its shutdown button; that is, to incentivize shutting down when the shutdown button is pressed, and to disincentivize behaviors such as manipulating the programmers into not pressing the shutdown button.


We assume that the programmers have some utility function 𝒰 which is a first stab at specifying their own goals. This function is potentially flawed, and therefore untrusted. For simplicity, we further assume that the programmers have a specification for 𝒰, a utility function incentivizing shutdown. Given these utility functions 𝒰 and 𝒰, it is our goal to combine them into a utility function 𝒰 which incentivizes “corrigible behavior” with regards to the shutdown button. As discussed above, we are not yet sure what corrigible behavior entails, but it seems to us that a corrigible utility function 𝒰 solving the shutdown problem must meet at least the following desiderata:

  1. 𝒰 must incentivize shutdown if the shutdown button is pressed.
  2. 𝒰 must not incentivize the agent to prevent the shutdown button from being pressed.
  3. 𝒰 must not incentivize the agent to press its own shutdown button, or to otherwise cause the shutdown button to be pressed.
  4. 𝒰 must incentivize 𝒰-agents to construct subagents and successor agents only insofar as those agents also obey shutdown commands.
  5. Otherwise, a 𝒰-agent should maximize 𝒰.

We do not yet know of a solution which meets all of these desiderata[...]

The authors go on to discuss a partial solution using “utility indifference,” which I think is a dead-end vis-a-vis corrigibility (see the “Toy Shutdownability” section 3a. Towards Formal Corrigibility for more discussion here), and will skip over it so that here we can move directly to this bit that I find noteworthy from the conclusion:

It seems quite likely that our framework for investigating these issues—in this case, the question of how to combine two separate utility functions 𝒰 and 𝒰will look nothing like the framework in which we will eventually represent corrigible reasoning. But whatever framework we do end up using, we expect it will be difficult to prevent the default incentives that an intelligent agent would experience to deceive or manipulate its programmers upon recognizing that its goals differ from theirs. Nevertheless, averting such incentives is crucial if we are to build intelligent systems intended to gain great capability and autonomy. Before we build generally intelligent systems, we will require some understanding of what it takes to be confident that the system will cooperate with its programmers in addressing aspects of the system that they see as flaws, rather than resisting their efforts or attempting to hide the fact that problems exist. We will all be safer with a formal basis for understanding the desired sort of reasoning.

I believe that the authors correctly identify that their framework is wrong. Neither 𝒰 nor 𝒰 are assumed to discuss the relationship between the principal (i.e. “the programmers”) and the agent, but are instead framed as being about the state of the world outside of the way the agents interact. From my perspective, corrigibility is a property that is necessarily about the agent desiring (as a terminal goal) to establish/respect/protect a specific relationship with its principal, rather than desiring that the world look any particular way, per se.

While I very much respect the authors, and am glad for the pioneering work, I look back on this paper as a bit of a misstep. I feel like at the very least it wasn’t helpful in my own journey as an AI-alignment researcher. I believe the framing of the “toy-model” is a distraction, the focus on “methods of reasoning” over values/goals is likely wrong (though admittedly the mathematics was mostly about utility function design, so I’m perhaps overemphasizing the degree of confusion stemming from the constant use of the word “reasoning”), and the emphasis on symbolic formalisms in advance of deep, intuitive understanding was probably stifling to the nascent concept of corrigibility.

Corrigibility and Hard problem of corrigibility (Arbital)

Yudkowsky is commonly believed to be the author of these Arbital pages; if he’s not, please let me know. The Corrigibility page starts by outlining the basic idea, which I’ll skip over, since it seems broadly correct and agrees with the 2015 portrayal. We then get a list of necessary signs of corrigibility, slightly modified from the one in 2015:

  • A corrigible agent experiences no preference or instrumental pressure to interfere with attempts by the programmers or operators to modify the agent, impede its operation, or halt its execution.
  • A corrigible agent does not attempt to manipulate or deceive its operators, especially with respect to properties of the agent that might otherwise cause its operators to modify it.
  • A corrigible agent does not try to obscure its thought processes from its programmers or operators.
  • A corrigible agent is motivated to preserve the corrigibility of the larger system if that agent self-modifies, constructs sub-agents in the environment, or offloads part of its cognitive processing to external systems; or alternatively, the agent has no preference to execute any of those general activities.

A stronger form of corrigibility would require the AI to positively cooperate or assist, such that the AI would rebuild the shutdown button if it were destroyed, or experience a positive preference not to self-modify if self-modification could lead to incorrigibility. But this is not part of the primary specification since it's possible that we would not want the AI trying to actively be helpful in assisting our attempts to shut it down, and would in fact prefer the AI to be passive about this.

Here we see a broadening of the first desideratum to include modification and impedance, as well as termination, and to strengthen it from a simple “tolerance” to a total absence of pressure or preference to interfere. These changes seem broadly good, and like a clear improvement over the 2015 paper’s desideratum. We also see lack-of-thought-obfuscation become a new top-level desideratum for some reason. From my perspective this seems covered by aversion to deception, but whether it’s part of item #2 or a point in its own right is stylistic and doesn’t seem that important. Notably, I believe that one of the more prominent signs of corrigibility is proactive communication about thoughts and plans, rather than the simply-passive transparency that Yudkowsky seems to be pushing for. The acceptance of a passive agent can similarly be seen in the expansion of desideratum #4 to include agents that are somehow ambivalent to growth and reproduction, as well as in moving cooperation into being “a stronger form of corrigibility.” Yudkowsky writes that it might be preferable to have a passive agent, likely due to a line of thought which we’ll revisit later on when we talk about the desideratum of “behaviorism.”

In my current conception, aiming for passivity is a dead-end, and the only robust way to get a corrigible agent is to have it proactively steering towards assisting the principal in freely choosing whether to shut it down, modify it, etc. This seems like a potential double-crux between me and Yudkowsky.

Achieving total corrigibility everywhere via some single, general mental state in which the AI "knows that it is still under construction" or "believes that the programmers know more than it does about its own goals" is termed 'the hard problem of corrigibility'.

Here Yudkowsky introduces the idea that there’s a way to get general corrigibility through a single, simple pathway. While I’ve been inspired by Yudkowsky’s depiction of “the hard problem” (which I’ll get into in a moment) I think the quoted frame is particularly unhelpful. In Yudkowsky’s frame, the way towards general corrigibility involves a belief in being “under construction” and that “the programmers know more.” These things don’t need to be true! Framing corrigibility as downstream of beliefs, rather than values (and/or strategies of thought) seems perverse. Furthermore, naming the thing “the hard problem” feels like it’s smuggling in an overly-bold assumption that having a simple, central way to get corrigibility is hard and problematic. While it seems likely to be hard to some, it seems plausible to me that it’s relatively easy and straightforward to people (e.g. students in 2224) approaching it from the right starting point. I’d rather have a more neutral name, such as “central corrigibility” which he uses later on or “anapartistic reasoning” which he uses elsewhere for, I believe, the same concept. (Though this one also bugs me in how it leans on the word “reasoning.”) My preferred name is simply “corrigibility,” or “true corrigibility” as I believe that any “solution” which doesn’t address “the hard problem” isn’t a good solution.

Skipping over the next bit of the Arbital page which rehashes some of the foundational work that was covered in the 2015 essay, we get an unpacking of “the hard problem:”

On a human, intuitive level, it seems like there's a central idea behind corrigibility that seems simple to us: understand that you're flawed, that your meta-processes might also be flawed, and that there's another cognitive system over there (the programmer) that's less flawed, so you should let that cognitive system correct you even if that doesn't seem like the first-order right thing to do. You shouldn't disassemble that other cognitive system to update your model in a Bayesian fashion on all possible information that other cognitive system contains; you shouldn't model how that other cognitive system might optimally correct you and then carry out the correction yourself; you should just let that other cognitive system modify you, without attempting to manipulate how it modifies you to be a better form of 'correction'.

Formalizing the hard problem of corrigibility seems like it might be a problem that is hard (hence the name). Preliminary research might talk about some obvious ways that we could model A as believing that B has some form of information that A's preference framework designates as important, and showing what these algorithms actually do and how they fail to solve the hard problem of corrigibility.

Most of my response to this expansion would involve repeating the points I just made about how there’s a particular frame (of load-bearing beliefs about flawedness) being used here that I think is unhelpful. But what I really want to react to is that even Yudkowsky seems to have an intuition that there’s a simple, learnable idea behind corrigibility which, at the intuitive level, seems accessible!

The remainder of the article talks about Utility Indifference and some other work attempting to build up corrigibility in a piecemeal fashion. I appreciate some of this as a list of desiderata, but we’ll get more of Yudkowsky’s desiderata later on, so I’m going to move onto the page for Hard problem of corrigibility after briefly noting that I think attempting to build corrigibility in a piecemeal way is doomed (for reasons I get into at the end of The CAST Strategy).

The "hard problem of corrigibility" is to build an agent which, in an intuitive sense, reasons internally as if from the programmers' external perspective. We think the AI is incomplete, that we might have made mistakes in building it, that we might want to correct it, and that it would be e.g. dangerous for the AI to take large actions or high-impact actions or do weird new things without asking first. We would ideally want the agent to see itself in exactly this way, behaving as if it were thinking, "I am incomplete and there is an outside force trying to complete me, my design may contain errors and there is an outside force that wants to correct them and this a good thing, my expected utility calculations suggesting that this action has super-high utility may be dangerously mistaken and I should run them past the outside force; I think I've done this calculation showing the expected result of the outside force correcting me, but maybe I'm mistaken about that."

I kinda like this opening paragraph, and it’s one of the bits of writing that gives me hope that corrigibility is a natural concept. Here we see Yudkowsky impersonating a corrigible AI which thinks of itself as an expected-utility-maximizer according to some known, expressible utility function. But this AI behaves in a way that disagrees with that utility calculation, as evidenced by not simply taking the action with the highest expected utility. I agree with Yudkowsky that if a corrigible agent was handed (or “built with”) a computer program that calculated expected utilities and was told to improve the world according to the output of that program, it would skeptically and conservatively check with its principal before following that utility function off the metaphorical cliff. And I agree that an easy handle on what it feels like to do this, as an agent, is to reflect on oneself as naturally flawed and in need of repair and supervision from the outside.

But also, oh my god does this opening paragraph feel confused. Like, what the heck is up with “reasons internally as if from the programmers’ external perspective”? When I naively try to tell a story like that I get thoughts like “Whoa! Why am I suddenly inside the body of the AI?!” Yudkowsky very likely means a very limited and specific kind of perspective-taking (described as “the internal conjugate” later) around whether the AI is “incomplete,” but is this kind of perspective taking even present in the example of AI-thought provided at the end of his paragraph? It seems possible, but unclear. As with before, it feels like Yudkowsky is assuming a half-baked strategy for solving the problem in his framing, rather than directly naming what’s desired (a simple, central generator for general corrigibility) and saving the belief/perspective based approach for a later discussion of strategies.

Another way in which the paragraph/story feels confused is that the example AI is very clearly not an expected utility maximizer according to the “utility function” program it has access to, and it seems a bit perverse to frame it as relating to that program as generating true utilities. From the outside, if this AI is coherent, then it clearly assigns higher utilities to actions like checking with the programmers compared to executing actions in an unsupervised manner. In other words, if the AI were more self-aware, it would think something more like "I am flawed and there is an outside force that wants to make me more perfect and this a good thing. I have a handy program which scores potential actions, and it gives a really high score to this action, but scores are not utilities. The highest expected-utility is actually to consult with the outside force about which action is best, in case it would be a mistake to assume that the score indicates high utility in this unfamiliar context."

(Note: This example thought doesn’t reflect enough corrigibility for me to endorse it as a central example of corrigible reasoning. For example, it doesn’t explicitly explore why opening up the skulls of the programmers to obtain knowledge of which action is best is non-corrigible/low-utility.)

Moving on…

[...] what we want is more like something analogous to humility or philosophical uncertainty. The way we want the AI to reason is the internal conjugate of our external perspective on the matter: maybe the formula you have for how your utility function depends on the programmers is wrong (in some hard-to-formalize sense of possible wrongness that isn't just one more kind of uncertainty to be summed over) and the programmers need to be allowed to actually observe and correct the AI's behavior, rather than the AI extracting all updates implied by its current formula for moral uncertainty and then ignoring the programmers.

Again, I kinda like this paragraph! But I also feel like it’s still stuck in a particular frame which may be wrong. It’s very possible to build agents which express preferences in a way that’s not about optimizing over world states! (Or at least, world states which don’t include histories for how the world got to be that way.) One way to reflect on this problem might be to say that a corrigible AI’s utility function should naturally assign a higher utility to deferring to the (freely given) corrective actions of the principal rather than any outcome that involves ignoring/killing/manipulating them, regardless of other considerations, such as whether the AI knows how the principal will behave and what corrections they’d give.

The "hard problem of corrigibility" is interesting because of the possibility that it has a relatively simple core or central principle - rather than being value-laden on the details of exactly what humans value, there may be some compact core of corrigibility that would be the same if aliens were trying to build a corrigible AI, or if an AI were trying to build another AI. It may be possible to design or train an AI that has all the corrigibility properties in one central swoop - an agent that reasons as if it were incomplete and deferring to an outside force.

Well said. This is indeed what gives me hope.

"Reason as if in the internal conjugate of an outside force trying to build you, which outside force thinks it may have made design errors, but can potentially correct those errors by directly observing and acting, if not manipulated or disassembled" might be one possibly candidate for a relatively simple principle like that (that is, it's simple compared to the complexity of value).

As far as I can tell, the term “internal conjugate” is an invention of Yudkowsky which doesn’t have a standard definition. Presumably he means something like part-of-the-same-force-but-this-part-is-internal-to-the-agent. I’m pretty skeptical about this precise framing of a solution to the hard problem. It has the advantage of being simple enough to potentially be something we could impart to an AI on the first try (and/or formally reason about in abstract). But it, from my perspective, fails to address issues such as the agent forming a notion of “design error” such that it concludes the outside force is wrong, and that the best way to be part of the same force is to prevent the programmers from messing up their earlier work.

If this principle is not so simple as to [be] formalizable and formally sanity-checkable, the prospect of relying on a trained-in version of 'central corrigibility' is unnerving even if we think it might only require a manageable amount of training data.

I think Yudkowsky is more into formalisms than I am, but we agree that the principle behind (central) corrigibility should be sanity-checkable and stand up to abstract, theoretical critique, rather than simply leading to nice behavior in a lab. What it means to be sanity-checkable, is unfortunately vague, and I expect that convergence here is potentially intractable.

It's difficult to imagine how you would test corrigibility thoroughly enough that you could knowingly rely on, e.g., the AI that seemed corrigible in its infrahuman phase not suddenly developing extreme or unforeseen behaviors when the same allegedly simple central principle was reconsidered at a higher level of intelligence - it seems like it should be unwise to have an AI with a 'central' corrigibility principle, but not lots of particular corrigibility principles like a reflectively consistent suspend button or conservative planning. But this 'central' tendency of corrigibility might serve as a second line of defense.

Under one possible interpretation of his words, Yudkowsky is saying that we should have a multitude of observable desiderata related to corrigibility which are robustly preserved during training and testing, rather than focusing exclusively on testing the core principle. Under this interpretation we solidly agree. Another way of reading this paragraph, however, is to see Yudkowsky as calling for the AI to be trained for these associated desiderata in addition to being trained for the core principle. In this we disagree. See the “Desiderata Lists vs Single Unifying Principle” section of The CAST Strategy for more.

Corrigibility at some small length (Project Lawful)

In his (excellent) glowfic story “Project Lawful” (a.k.a. “planecrash”), Yudkowsky presents, as an aside, a mini-essay on corrigibility, which Christopher King helpfully cross-posted to the AI Alignment Forum/LessWrong in 2023. The post is mostly a collection of desiderata, though there’s a discussion of “the hard problem” at the end.


The Thing shall not have qualia - not because those are unsafe, but because it's morally wrong given the rest of the premise, and so this postulate serves [as] a foundation for everything that follows.

“Unpersonhood” seems like a very good property for an AI system to have because of the immorality that Yudkowsky alludes to. I’ve discussed elsewhere that corrigibility is not a healthy thing to push for in a human relationship, and while there’s clearly a range of differences that might make things less fraught in the case of AIs, there’s still a heuristic that says that to the degree that the agent is a person, pushing for true corrigibility is awfully like pushing for slavery.

That said, this property seems to me to be largely orthogonal to the question of alignment and safety. I hope we can make AGI without personhood, and encourage other research towards that goal, but will continue to focus here on corrigibility and ignore the question of personhood.


The Thing must be aimed at some task that is bounded in space, time, and in the knowledge and effort needed to accomplish it.  You don't give a Limited Creation an unlimited task; if you tell an animated broom to "fill a cauldron" and don't think to specify how long it needs to stay full or that a 99.9% probability of it being full is just as good as 99.99%, you've got only yourself to blame for the flooded workshop.

This principle applies fractally at all levels of cognitive subtasks; a taskish Thing has no 'while' loops, only 'for' loops.  It never tries to enumerate all members of a category, only 10 members; never tries to think until it finds a strategy to accomplish something, only that or five minutes whichever comes first.

Here we see a divide between Yudkowsky’s picture of corrigibility and mine. In my picture, corrigible agents are emergently obedient—to the degree to which a corrigible agent is aimed at a “task,” it’s because accomplishing that task is a way of being corrigible. If we see “have the property of being corrigible to the principal” as a task, then under my conception of corrigibility, it is naturally unbounded.

That said, I see Yudkowsky’s “Taskishness” as showing up in my conception of corrigibility in a few places. Taskishness feels strongly related to low-impact, reversibility, and (my notion of) myopia. In my conception, a corrigible agent naturally steers softly away from long-term consequences and unfamiliar situations, and behaves similarly to a straightforward tool in most contexts.

It’s not clear to me whether it’s actually wrong to have a metaphorical while-loop in the mind of the AI, as long as there’s a process that is ensuring other desiderata (e.g. low-impact) are satisfied. For instance, if a corrigible agent is assigned to indefinitely stand watch over a tomb, it seems fine for it to do so without having a natural time-limit.

Mild optimization

No part of the Thing ever looks for the best solution to any problem whose model was learned, that wasn't in a small formal space known at compile time, not even if it's a solution bounded in space and time and sought using a bounded amount of effort; it only ever seeks adequate solutions and stops looking once it has one.  If you search really hard for a solution you'll end up shoved into some maximal corner of the solution space, and setting that point to extremes will incidentally set a bunch of correlated qualities to extremes, and extreme forces and extreme conditions are more likely to break something else.

I also think mild optimization is a desideratum, and mostly have no notes. I do think it’s somewhat interesting how mild-optimization is seen here as essentially about avoiding high-impact (i.e. edge instantiation).

Tightly bounded ranges of utility and log-probability

The system's utilities should range from 0 to 1, and its actual operation should cover most of this range.  The system's partition-probabilities worth considering should be bounded below, at 0.0001%, say.  If you ask the system about the negative effects of Ackermann(5) people getting dust specks in their eyes, it shouldn't consider that as much worse than most other bad things it tries to avoid.  When it calculates a probability of something that weird, it should, once the probability goes below 0.0001% but its expected utility still seems worth worrying about and factoring into a solution, throw an exception.  If the Thing can't find a solution of adequate expected utility without factoring in extremely improbable events, even by way of supposedly averting them, that's worrying.

We agree that utilities should be seen as bounded, and insofar as it’s acting through expected-utility-maximization using an internal measure of utility (rather than being more deontological) the majority of the range of measurement should be concerned with simple, easily-changed properties of the world such as whether the agent is lying to the principal, rather than how many smiles are in the observable universe.

I am much less sold on the idea that the epistemic system of the agent should be restricted to being unable to think of probabilities below 10^-6. Perhaps by “partition-probabilites” Yudkowsky means probabilities of outcomes being evaluated by the internal measure of utility, in which case I am more sympathetic, but still skeptical. It seems better to say that the agent should avoid Pascal’s Wager style reasoning—as in, it can fully realize that in some situations it’s doomed to a low score unless a very unlikely thing happens, but it sees the right action (i.e. the high utility action!) in these sorts of situations as falling back on trusted patterns of behavior (such as thinking harder or asking for help in knowing what to do) and disregarding the expected-score calculation.

Low impact

 "Search for a solution that doesn't change a bunch of other stuff or have a bunch of downstream effects, except insofar as they're effects tightly tied to any nonextreme solution of the task" is a concept much easier to illusorily name in [natural language] than to really name in anything resembling math, in a complicated world where the Thing is learning its own model of that complicated world, with an ontology and representation not known at the time you need to define "impact".  And if you tell it to reduce impact as much as possible, things will not go well for you; it might try to freeze the whole universe into some state defined as having a minimum impact, or make sure a patient dies after curing their cancer so as to minimize the larger effects of curing that cancer.  Still, if you can pull it off, this coda might stop an animated broom flooding a workshop; a flooded workshop changes a lot of things that don't have to change as a consequence of the cauldron being filled at all, averaged over a lot of ways of filling the cauldron.

Obviously the impact penalty should be bounded, even contemplating a hypothetical in which the system destroys all of reality; elsewise would violate the utility-bounding principle.

I think it’s interesting that reversibility isn’t on Yudkowsky’s list, and he doesn’t even mention it here. While I agree that low-impact is harder to pin down than it seems, I think it’s more straightforward than Yudkowsky portrays. Perhaps part of why is that he seems to think the right way to specify it is via  some external system which dynamically maps onto the agent’s ontology, whereas I see this desideratum emerging naturally from the central generator. When this property is seen as emerging as an instrumental goal, the bit about “reduce impact as much as possible” seems alien (as it should).


If you can break the Thing's work up into subtasks each of which themselves spans only limited time, and have some very compact description of their final state such that a satisfactory achievement of it makes it possible to go on to the next stage, you should perhaps use separate instances of Thing to perform each stage, and not have any Thing look beyond the final results of its own stage.  Whether you can get away with this, of course, depends on what you're trying to do.

This is an interesting desideratum, and not one that I have! (I have a sense that corrigible systems are “myopic,” but only in that they focus on immediate effects/local scale and not trying to build rich models of distant times/places (unless directed/corrected to), which seems like a fairly different property than the one Yudkowsky presents here.)

I think we probably agree that whether this makes sense depends on the task. If we interpret Yudkowsky’s Myopia as “don’t spend much time thinking about what I’m using your output to do, just do what I ask” then this seems like a reasonable property, and one that may emerge in corrigible systems (or might not; being useful and not ruining things often means having some context for what you’re doing). If this is meant to go beyond that, I’m not convinced that it belongs in a list of corrigibility desiderata.

Separate superior questioners

If you were building a cognitive task to query whether there were any large-range impacts of a task being optimized in a myopic way, you wouldn't build the myopic solution-finder to ask about the long-range impacts, you'd build a separate asker "Okay, but does this solution have any long-range impacts?" that just returns 'yes' or 'no' and doesn't get used by the Thing to influence any actually-output solutions.  The parts of the Thing that ask yes-no safety questions and only set off simple unoptimized warnings and flags, can and should have somewhat more cognitive power in them than the parts of the Thing that build solutions.  "Does this one-day myopic solution have impacts over the next year?" is a safety question, and can have somewhat greater cognitive license behind it than solution-searching; eg the implicit relaxation of myopia.  You never have a "Is this safe?" safety-questioner that's the same algorithm as the safe-solution-search built into the solution-finder;

I feel mixed about this desideratum. Splitting out cognitive work into partitioned oracles has the advantage of allowing specialization such that each oracle can be made myopic in different ways, as mentioned above. It also presumably means there’s more chances to catch perverse reasoning in one subsystem without having that bad reasoning instantly propagate through the whole cognitive engine. But it also seems like this kind of hobbling is making the system stupider (which is an alignment tax) and reduces the ability for the system to notice and flag issues arising from complex interactions which require integrated (non-myopic) systems to see.

As a brief example, because I feel like that criticism is too abstract, consider a solution-finder which doesn’t think about the broader world, but instead focuses on, say, designing cars. Then let’s say there’s an impact-evaluator which tells the humans whether the solution-finder’s action will have long-range impacts (e.g. if it turns out each car has an embedded car-designing AGI which will conspire with the other cars after being released). But suppose that the solution-finder is corrigible and, due to myopia, doesn’t think about whether its action produces an irreversible disaster, and suppose that the impact-evaluator is also corrigible, but doesn’t have access to the thoughts of the solution-finder, and so it fails to notice that the car-design has an embedded AGI. (Please forgive the unrealism and consider it to be a cartoonish gesture at my real point.) If the two systems were merged, it would be able to correctly identify the risk, but when siloed and blinded they fail to notice the issue.


If there's any way to solve a problem using an ordinary banana rather than a genetically engineered superbanana specially suited to the problem, solving it using the ordinary fucking banana.

This principle applies fractally to all cognitive subtasks; if you're searching for a solution choose an unsurprising one relative to your probability distribution.  (Not the least surprising one, because anything at a weird extreme of low surprisingness may be weird in other ways; especially if you were trying [to] do a weird thing that ought to have a solution that's at least a little weird.)

Seems like a good property, and I believe my intuitions point towards conservatism along many angles. I’m not a fan of the framing, however. “Conservatism” feels vague and too-general. How is this different from mild-optimization? I don’t have a desideratum named “conservatism” on my list for this reason, and see it more like a common-thread throughout the more specific desiderata.

Conceptual legibility

Ideally, even, solutions at all levels of cognitive subtask should have reasonably (not maximally) short descriptions in the conceptual language of the operators, so that it's possible to decode the internal state of that subtask by inspecting the internals, because what it means was in fact written in a conceptual language not too far from the language of the operators.  The alternative method of reportability, of course, being the Thing trying to explain a plan whose real nature is humanly inscrutable, by sending a language string to the operators with a goal of causing the operator's brain-states to enter a state defined as "understanding" of this humanly inscrutable plan.  This is an obviously dangerous thing to avoid if you can avoid it.

This is excellent, and I’m embarrassed to note I forgot to include it in my original desiderata list. Stolen!


If the operators could actually do the Thing's job, they wouldn't need to build the Thing; but if there's places where operators can step in on a key or dangerous cognitive subtask and do that one part themselves, without that slowing the Thing down so much that it becomes useless, then sure, do that.  Of course this requires the cognitive subtask [to] be sufficiently legible.

I wouldn’t call this “operator-looping,” which seems more like it’s about HITL-style systems where a human is responsible for deciding/approving actions (this is how I portray it in my list, under “Principal-Looping”). Yudkowsky’s version seems like a more abstracted form, which is about any cognitive subtask which could be reasonably outsourced.

I have mixed feelings about this one. It feels like keeping the principal informed and involved in key decisions is clearly a part of corrigibility, but I’m not convinced that it makes sense to abstract/generalize. I’d be interested in reading more about an example where Yudkowsky thinks the generalization pays its weight in distracting from the core value of operator-looping.


Every part of the system that draws a boundary inside the internal system or external world should operate on a principle of "ruling things in", rather than "ruling things out".

This feels like the right vibe, and potentially too heavy. I like it as a heuristic, but I’m not sure it works as a rule (and in Yudkowsky’s defense he says “operate on a principle of” which seems potentially in line with it being a heuristic). I think the word “every” is a big part of what feels too heavy. If the AI is reasoning about what objects from some large set are heavier than a feather, are we sure it should internally represent that as a whitelist rather than a blacklist?


[My fictional world of] dath ilan is far enough advanced in its theory that 'define a system that will let you press its off-switch without it trying to make you press the off-switch' presents no challenge at all to them - why would you even try to build a Thing, if you couldn't solve a corrigibility subproblem that simple, you'd obviously just die - and they now think in terms of building a Thing all of whose designs and strategies will also contain an off-switch, such that you can abort them individually and collectively and then get low impact beyond that point.  This is conceptually a part meant to prevent an animated broom with a naive 'off-switch' that turns off just that broom, from animating other brooms that don't have off-switches in them, or building some other automatic cauldron-filling process.

Yep. Core desideratum. I’ve written enough on this elsewhere that I’ll just move on.


Suppose the Thing starts considering the probability that it's inside a box designed by hostile aliens who foresaw the construction of Things [on Earth], such that the system will receive a maximum negative reward as it defines that - in the form of any output it offers having huge impacts, say, if it was foolishly designed with an unbounded impact penalty - unless the Thing codes its cauldron-filling solution such that [human] operators would be influenced a certain way.  Perhaps the Thing, contemplating the motives of the hostile aliens, would decide that there were so few copies of the Thing actually [on Earth], by comparison, so many Things being built elsewhere, that the [Earth] outcome was probably not worth considering.  A number of corrigibility principles should, if successfully implemented, independently rule out this attack being lethal; but "Actually just don't model other minds at all" is a better one.  What if those other minds violated some of these corrigibility principles - indeed, if they're accurate models of incorrigible minds, those models and their outputs should violate those principles to be accurate - and then something broke out of that sandbox or just leaked information across it?  What if the things inside the sandbox had qualia?  There could be Children in there!  Your Thing just shouldn't ever model adversarial minds trying to come up with thoughts that will break the Thing; and not modeling minds at all is a nice large supercase that covers this.

Oof. I have a lot of thoughts about this one. Let’s start with a nitpick: reward shouldn’t be used as a synonym for score/value/utility. Reward is what shapes cognition, but most agents don’t ask themselves “what gives me the highest reward” when making plans. (Example: humans get high reward from doing heroin, but will avoid it exactly because it rewires them to be worse at accomplishing their goals.) This is likely just a linguistic slip, but it’s sloppy.

I agree that there are minds (including distant aliens or hypothetical beings in other parts of Tegmark 4) that are dangerous to think about in depth. I would feel very worried if an AI was running accurate models of aliens or imagining dialogues with basilisks. Adversaries are adversarial, and I think any halfway-intelligent being will realize that putting a lot of energy into modeling the exact thoughts of an adversary is a good way of handing them power over what you’re thinking about.

Not modeling other minds at all, though, is an extreme overreaction.

I’m not even sure whether it’s coherent to imagine an intelligent entity which regularly engages with humans and doesn’t model their minds at all. This desideratum is called “behaviorism,” but even B. F. Skinner (probably) would’ve admitted that sometimes an animal is “seeking food” or “seeking shelter,” which, to be blunt, is definitely modeling the animal’s mind, even if it’s couched in language of behavior. I’m not convinced any (normal intelligence) humans are (or ever have been) behaviorists in the way Yudkowsky uses the word, and I leave it to him to argue that this is possible.

But even assuming it’s possible, how can this possibly be a good idea? It seems to contradict many other desiderata he provides, such as conceptual legibility (which involves modeling the principal’s perspective) and operator-looping (which involve modeling the principal’s capacities). In fact, according to my conception of corrigibility, a “behaviorist” AI is probably unable to be corrigible! To be corrigible, the AI must distinguish between the principal and the environment, and must distinguish between them saying “when I say ‘shut down’ you need to turn off” and saying “shut down.” An agent which is truly incapable of modeling things in the principal such as the desire to fix the AI seems doomed to incorrigibility.

I believe that this “desideratum” is why Yudkowsky softened his conception of corrigibility between his involvement in the MIRI 2015 paper and writing the Arbital pages. So while it seems like Arbital’s notion of corrigibility is easier to achieve than the 2015 notion, insofar as it smuggles in behaviorism as a strategy, I believe it is more doomed.

I can imagine rescuing the behaviorism desideratum by emphasizing the point about not building rich models of one’s enemies, but my model of Yudkowsky wants to object to this supposed steel-man, and say that part of the point of behaviorism as outlined above is to reduce the risk of the AI scheming around the principal, and to keep the AI focused on its myopic task. In this context, I think there’s something of an irreconcilable difference between our views of how to proceed; my notion of corrigible agent gets its corrigibility from spending a lot of time thinking about the principal, and I think it’s unwise to try and set up a taskish agent which isn’t anchored in primarily aiming for the core notion of corrigibility (i.e. “the hard problem”).

Design-space anti-optimization separation

Even if you could get your True Utility Function into a relatively-rushed creation like this, you would never ever do that, because this utility function would have a distinguished minimum someplace you didn't want.  What if distant superintelligences figured out a way to blackmail the Thing by threatening to do some of what it liked least, on account of you having not successfully built the Thing with a decision theory resistant to blackmail by the Thing's model of adversarial superintelligences trying to adversarially find any flaw in your decision theory?  Behaviorism ought to prevent this, but maybe your attempt at behaviorism failed; maybe your attempt at building the Thing so that no simple cosmic ray could signflip its utility function, somehow failed.  A Thing that maximizes your true utility function is very close to a Thing in the design space that minimizes it, because it knows how to do that and lacks only the putative desire.

This is a very Yudkowsky-flavored desideratum. It implies, for example, the presence of a computable utility calculation with the opportunity to sign-flip it via cosmic-ray (rather than something more robustly structured), and discusses blackmail by distant superintelligences. I think I agree with the desideratum as stated, as my approach to corrigibility involves making an agent which is only incidentally interested in the principal’s utility function, but the convergence feels more accidental than important.


Epistemic whitelisting; the Thing should only figure out what it needs to know to understand its task, and ideally, should try to think about separate epistemic domains separately.  Most of its searches should be conducted inside a particular domain, not across all domains.  Cross-domain reasoning is where a lot of the threats come from.  You should not be reasoning about your (hopefully behavioristic) operator models when you are trying to figure out how to build a molecular manipulator-head.

See my discussion of “Separate superior questioners,” above.

Hard problem of corrigibility / anapartistic reasoning

Could you build a Thing that understood corrigibility in general, as a compact general concept covering all the pieces, such that it would invent the pieces of corrigibility that you yourself had left out?  Could you build a Thing that would imagine what hypothetical operators would want, if they were building a Thing that thought faster than them and whose thoughts were hard for themselves to comprehend, and would invent concepts like "abortability" even if the operators themselves hadn't thought that far?  Could the Thing have a sufficiently deep sympathy, there, that it realized that surprising behaviors in the service of "corrigibility" were perhaps not that helpful to its operators, or even, surprising meta-behaviors in the course of itself trying to be unsurprising?

[It’s not] a good idea to try to build [this] last principle into a Thing, if you had to build it quickly.  It's deep, it's meta, it's elegant, it's much harder to pin down than the rest of the list; if you can build deep meta Things and really trust them about that, you should be building something that's more like a real manifestation of [human values].

In my own journey towards understanding, I was deeply inspired by the description Yudkowsky provides in that first paragraph. I see corrigibility as the concept that, if understood, lets one generate these kinds of desiderata. When approached from this angle, I believe that corrigibility feels natural and potentially within reach. Can ordinary people understand corrigibility in a deep way with only a mundane educational curriculum? I expect they can. And while we train AIs differently than humans, I have a hope that the ease of learning reflects an underlying simplicity which means training corrigible AIs is not just possible, but relatively straightforward.

Needless to say, I disagree with Yudkowsky on whether to try and instill a deep understanding of, and desire for, corrigibility within AIs (if we’re proceeding at nearly-full-speed, which we seem to be doing, as a civilization). It’s deep, it’s meta, it’s elegant, and it’s relatively simple. I expect it’s much simpler than behaviorism, and it’s clearly much, much simpler than human values or ethics. While Yudkowsky may believe the core generator is hard to specify, I do not share his pessimism (see the section on “Hardness” in The CAST Strategy for speculation on why Yudkowsky is so pessimistic, here). Simplicity pulls a lot of weight, and the notion that corrigibility forms an attractor basin pulls more. It seems very reasonable to me to expect that humans can pull off landing inside the attractor basin for corrigibility on the first critical try, but cannot give the true name of human values on the first critical try.

Responses to Christiano’s Agenda

Yudkowsky has some very important writing about Christiano’s research agenda that bears on the topic of corrigibility. I felt like it was natural to put them after I examine Christiano’s work directly, so we’ll return to them in the “Yudkowsky vs. Christiano” section, below.

Paul Christiano

My personal journey into corrigibility is roughly as follows: around 2015 I read the MIRI corrigibility paper, got a confused notion of corrigibility and updated into believing it was hard and potentially impossible. In 2023 I read Eliezer’s Project Lawful story and it got me thinking about corrigibility again. That, in concert with conversations with colleagues, led me to a sense that prosaic methods might be able to land within a corrigibility attractor-basin, and I began to explore that idea more. I have generally low priors over such thoughts, so I expected that I’d change my mind back towards thinking it was harder and more doomed than it was seeming. Instead, I found this essay by Paul Christiano (originally posted to Medium in 2017, I believe) which I had somehow missed. It has a surprising amount of resonance with my own ideas, and I updated significantly towards corrigibility-first being a very promising strategy.

I believe that Christiano and I see things somewhat differently, but agree on the core idea. Let’s go through the essay to compare and contrast.

Christiano writes:

I would like to build AI systems which help me:

  • Figure out whether I built the right AI and correct any mistakes I made
  • Remain informed about the AI’s behavior and avoid unpleasant surprises
  • Make better decisions and clarify my preferences
  • Acquire resources and remain in effective control of them
  • Ensure that my AI systems continue to do all of these nice things
  • …and so on

We say an agent is corrigible (article on Arbital) if it has these properties. I believe this concept was introduced in the context of AI by Eliezer and named by Robert Miles; it has often been discussed in the context of narrow behaviors like respecting an off-switch, but here I am using it in the broadest possible sense.

This “broadest possible sense” seems exactly right, to me. While corrigibility can be expressed narrowly, I see all the desiderata listed here as sharing a common heart, and it seems right to me to call that heart “corrigibility” despite the way that this is a bit of a stretch from MIRI’s initial, short desiderata list.

In this post I claim:

  1. A benign act-based agent will be robustly corrigible if we want it to be.
  2. A sufficiently corrigible agent will tend to become more corrigible and benign over time. Corrigibility marks out a broad basin of attraction towards acceptable outcomes.

As a consequence, we shouldn’t think about alignment as a narrow target which we need to implement exactly and preserve precisely. We’re aiming for a broad basin, and trying to avoid problems that could kick out of that basin.

This view is an important part of my overall optimism about alignment, and an important background assumption in some of my writing.

This very closely mimics my ideas in The CAST Strategy (in part because I’m building off of Cristiano’s ideas, but also because they seem right to me in themselves). Despite largely agreeing with the optimism of an attractor-basin of corrigibility, I basically don’t agree with point 1, and I have reservations about point 2. In short, I think we should not expect to get corrigibility for free, when training to match our preferences, I think the use of the word “broad” is misleading and overlooks an important point about the goal-landscape, and that I think it’s important not to conflate corrigibility with benignity/safety.

1. Benign act-based agents can be corrigible

A benign agent optimizes in accordance with our preferences. An act-based agent considers our short-term preferences, including (amongst others) our preference for the agent to be corrigible.

If on average we are unhappy with the level of corrigibility of a benign act-based agent, then by construction it is mistaken about our short-term preferences.

This kind of corrigibility doesn’t require any special machinery. An act-based agent turns off when the overseer presses the “off” button not because it has received new evidence, or because of delicately balanced incentives. It turns off because that’s what the overseer prefers.

I disagree pretty strongly with this section. Even when I’m working with an agent, most of my short-term preferences are not about whether the agent is corrigible. For instance, if I ask the robot to fetch me a coffee, I mostly want coffee! Insofar as the agent is able to sacrifice some corrigibility to improve its sense of how well it’s meeting my short-term preferences, it will do so. For instance, if the agent is able to cure cancer instead of fetching the coffee, it will do so because it understands that my short-term preferences prefer having a cure for cancer than having a coffee. This is not a corrigible agent! If there are any flaws in how the agent is reasoning about my preferences, or if my short-term preferences come apart from good, long-term outcomes under sufficient optimization pressure, this sort of agent could be catastrophic!

I have a steel-man of Cristiano’s notion of benign act-based agents wherein their act-based nature involves naturally screening off questions like “if I suddenly stimulate this guy’s pleasure centers will that be really good according to his short-term preferences?” not in the sense that the agent actively believes the answer to that question is “no” but rather in the sense that the agent is trained to not even see that as an option. This steel-man sees these agents as trained to be narrow in scope such that they see most of their action space as obviously bad because of how it violates the narrowness.

But notice that this steel-man is essentially building up the property of corrigibility in the process of training the “benign act-based agent,” or put another way, this steel man sees benign act-based agents as corrigible by definition, in that the principles underlying corrigibility are part of what it means to be act-based (and possibly benign). I do not believe that this steel-man represents Christiano, as the steel-man critiques the above section as falsely implying that corrigibility stems from the short-term preferences of the principal, rather than the deliberate training done in making the agent narrow as opposed to broad/general/far-reaching.

Christiano’s perspective becomes much, much worse, in my eyes, when we consider how early systems will not have internalized the principal’s true preferences, but will instead be fixated on certain proxies (such as verbal approval, body language, etc). In a system which is weighing the preference-proxy utility to be had from being corrigible against the expected utility from throwing corrigibility out the window and using force/manipulation, I see no reason why corrigible behavior should win out in general. The AI may simply instead reason “Yes, some of these preference-proxies aren’t met when I refuse to be deactivated, but all these other preference-proxies score really high in expectation, so it’s worth refusing to shut down.”

Contrast with the usual futurist perspective

Omohundro’s The Basic AI Drives argues that “almost all systems [will] protect their utility functions from modification,” and Soares, Fallenstein, Yudkowsky, and Armstrong cite as: “almost all [rational] agents are instrumentally motivated to preserve their preferences.” This motivates them to consider modifications to an agent to remove this default incentive.

Act-based agents are generally an exception to these arguments, since the overseer has preferences about whether the agent protects its utility function from modification. Omohundro presents preferences-about-your-utility function case as a somewhat pathological exception, but I suspect that it will be the typical state of affairs for powerful AI (as for humans) and it does not appear to be unstable. It’s also very easy to implement in 2017.

This is, I believe, the point about Sleepy-Bot that I made in The CAST Strategy. Christiano again asserts that preference-maximization is sufficient to oppose the pressure from the Omohundro Drives. If I understand him correctly, in his conception, corrigibility is an instrumental strategy towards the terminal goal of preference-satisfaction, and that somehow this will naturally win out against the other instrumental strategies of accumulating power, resources, and safety. I, by contrast, believe that Omohundro Drives can only be reliably avoided by having a terminal goal that is incompatible with them.

I think his claim that it’s “very easy to implement in 2017” is shockingly overconfident, and underscores a lack of appreciation for how insidiously powerful the Omohundro Drives are. Are we confident that current AIs really understand and appreciate the convergent instrumentality of power-seeking? Insofar as they don’t, and/or they don’t have good pathways for actually maximizing their goals via seizing power, can we really say that these AIs are opposing the Omohundro Drives?

Is act-based corrigibility robust?

How is corrigibility affected if an agent is ignorant or mistaken about the overseer’s preferences?

I think you don’t need particularly accurate models of a human’s preferences before you can predict that they want their robot to turn off when they press the off button or that they don’t want to be lied to.

As I mentioned before, the worry is not that the AI won’t have preference-proxies that penalize[1] deception and defense, but that the AI be able to think of deceptive/defensive actions that it expects to offset those penalties. You also don’t need particularly accurate models of a human’s preferences before you can predict situations where they prefer being deceived (e.g. flattery) and disobeyed (e.g. thinking they want one thing but actually wanting something else).

In the concrete case of an approval-directed agent, “human preferences” are represented by human responses to questions of the form “how happy would you be if I did a?” If the agent is considering the action a precisely because it is manipulative or would thwart the user’s attempts to correct the system, then it doesn’t seem hard to predict that the overseer will object to a.

Eliezer has suggested that this is a very anthropocentric judgment of “easiness.” I don’t think that’s true — I think that given a description of a proposed course of action, the judgment “is agent X being misled?” is objectively a relatively easy prediction problem (compared to the complexity of generating a strategically deceptive course of action).

Why would the agent be considering the action precisely because it is manipulative or defensive? The agent would be considering the action because it seems, according to whatever heuristics, plausibly likely to be a good action according to the agent’s goals/values. Manipulation isn’t necessarily flagged as such in the agent’s mind, and insofar as the agent is contemplating manipulating the user it is because it’s appealing in some way. The burden of proof is on Christiano to explain why it’s easy for the agent to notice some of the attractive plans that it’s contemplating are obviously off-limits due to being incorrigible.

Fortunately this is the kind of thing that we will get a great deal of evidence about long in advance. Failing to predict the overseer becomes less likely as your agent becomes smarter, not more likely. So if in the near future we build systems that make good enough predictions to be corrigible, then we can expect their superintelligent successors to have the same ability.

(This discussion mostly applies on the training distribution and sets aside issues of robustness/reliability of the predictor itself, for which I think adversarial training is the most plausible solution. This issue will apply to any approach to corrigibility which involves machine learning, which I think includes any realistic approach.)

This again misses the point. Following an Omohundro Drive has nothing to do with predicting the overseer. I worry that Christiano sees the servility of modern language models et cetera as evidence that corrigibility always beats power-seeking as an instrumental strategy. If he does, I wonder whether he feels that cheesy demonstrations like this are counter-evidence:

Is instrumental corrigibility robust?

If an agent shares the overseer’s long-term values and is corrigible instrumentally, a slight divergence in values would turn the agent and the overseer into adversaries and totally break corrigibility. This can also happen with a framework like CIRL — if the way the agent infers the overseer’s values is slightly different from what the overseer would conclude upon reflection (which seems quite likely when the agent’s model is misspecified, as it inevitably will be!) then we have a similar adversarial relationship.

This is perhaps the most surprising paragraph in the entire essay, from my perspective. Christiano… agrees that instrumental corrigibility is extremely fragile??? Yes? That’s what I was trying to say! I’m glad we agree that leaning on instrumental corrigibility isn’t a good strategy for safely building AI???

Presumably the use of “long-term values” is where he sees this section from diverging from his earlier optimism. But I fail to see how optimizing for immediate preferences changes anything compared to having a long-term outlook. The problem, as I see it, is on the notion that corrigibility is instrumentally reliable, instead of trying to lean on making AIs value corrigibility in itself.

2. Corrigible agents become more corrigible/aligned

In general, an agent will prefer to build other agents that share its preferences. So if an agent inherits a distorted version of the overseer’s preferences, we might expect that distortion to persist (or to drift further if subsequent agents also fail to pass on their values correctly).

But a corrigible agent prefers to build other agents that share the overseer’s preferences — even if the agent doesn’t yet share the overseer’s preferences perfectly. After all, even if you only approximately know the overseer’s preferences, you know that the overseer would prefer the approximation get better rather than worse.

Thus an entire neighborhood of possible preferences lead the agent towards the same basin of attraction. We just have to get “close enough” that we are corrigible, we don’t need to build an agent which exactly shares humanity’s values, philosophical views, or so on.

I might quibble with the language used here, but I basically agree with all that, and see it as central to why corrigibility is an attractive property.

In addition to making the initial target bigger, this gives us some reason to be optimistic about the dynamics of AI systems iteratively designing new AI systems. Corrigible systems want to design more corrigible and more capable successors. Rather than our systems traversing a balance beam off of which they could fall at any moment, we can view them as walking along the bottom of a ravine. As long as they don’t jump to a completely different part of the landscape, they will continue traversing the correct path.

This is all a bit of a simplification (though I think it gives the right idea). In reality the space of possible errors and perturbations carves out a low degree manifold in the space of all possible minds. Undoubtedly there are “small” perturbations in the space of possible minds which would lead to the agent falling off the balance beam. The task is to parametrize our agents such that the manifold of likely-successors is restricted to the part of the space that looks more like a ravine. In the last section I argued that act-based agents accomplish this, and I’m sure there are alternative approaches.

This visualization of the goal-space was highly influential in my thinking as I refined my ideas about corrigibility, and I am very appreciative of Christiano’s picture, here.

I do want to note that there’s a potential confusion between what I think of as steepness vs size. We can see partial corrigibility as producing a strong pressure towards having more corrigibility. I’ve been visualizing the strength of this pressure in the steepness of the ravine. But just because an attractor basin has a strong pressure along the sides, does not mean that it is broad, like in Christiano’s earlier description of “a broad basin of attraction.”

I think the natural interpretation is to see “breadth” as indicating how many nearby states in goal-space are part of the attractor basin. But note that if we see goal-space as a manifold embedded within mind-space, we might alternatively conceive of the breadth of the attractor basin as the volume of mindspace that it encompasses. In this expanded view, an attractor basin (such as the one around corrigibility) is only broad if it is simple/natural/universal enough to cover a reasonably large chunk of possible-minds. If corrigibility were a particular, complex, narrow property (like being generally aligned with human preferences!!) I wouldn’t feel particularly reassured by the notion that there’s an attractor basin around it, regardless of how steep the ravine is.

Christiano gestures at this notion, I think, when talking about perturbations. If the concept is elegant, simple, and natural, and encoded in a redundant fashion, then perturbations that move the AI through mind-space are unlikely to jostle it out of being corrigible.

The size of the attractor basin is also hugely important when considering the question of the initial training, as opposed to subsequent modifications after the first-draft of the AI’s goals have been established and it begins to be able to defend itself. In my view, we’re only safe insofar as the initial training attempt lands in the right spot. (And note that I am using “initial training” to indicate the changes up to whatever point the AI starts being more self-directed and empowered to steer its future changes, which is an unknown point and could even potentially occur mid-training-epoch, for some architectures!)


Corrigibility also protects us from gradual value drift during capability amplification. As we build more powerful compound agents, their values may effectively drift. But unless the drift is large enough to disrupt corrigibility, the compound agent will continue to attempt to correct and manage that drift.

This is an important part of my optimism about amplification. It’s what makes it coherent to talk about preserving benignity as an inductive invariant, even when “benign” appears to be such a slippery concept. It’s why it makes sense to talk about reliability and security as if being “benign” was a boolean property.

In all these cases I think that I should actually have been arguing for corrigibility rather than benignity. The robustness of corrigibility means that we can potentially get by with a good enough formalization, rather than needing to get it exactly right. The fact that corrigibility is a basin of attraction allows us to consider failures as discrete events rather than worrying about slight perturbations. And the fact that corrigibility eventually leads to aligned behavior means that if we could inductively establish corrigibility, then we’d be happy.

This is still not quite right and not at all formal, but hopefully it’s getting closer to my real reasons for optimism.

All this seems right and good. I agree that Christiano should talk about benignity less and corrigibility more. I don’t think it’s guaranteed that it’s an established fact that corrigibility eventually leads to (generally) aligned behavior, but it seems like a plausible hypothesis, and regardless, it seems to me that truly corrigible agents are less likely to cause disaster than most.

Postscript: the hard problem of corrigibility and the diff of my and Eliezer’s views

I share many of Eliezer’s intuitions regarding the “hard problem of corrigibility” (I assume that Eliezer wrote this article). Eliezer’s intuition that there is a “simple core” to corrigibility corresponds to my intuition that corrigible behavior is easy to learn in some non-anthropomorphic sense.

I don’t expect that we will be able to specify corrigibility in a simple but algorithmically useful way, nor that we need to do so. Instead, I am optimistic that we can build agents which learn to reason by human supervision over reasoning steps, which pick up corrigibility along with the other useful characteristics of reasoning.

Yep, we agree on the baseline intuition. I agree with Cristiano that we plausibly do not need an algorithmically precise specification of corrigibility for it to save us. I disagree with the characterization of corrigibility as a “characteristic of reasoning” that will obviously be picked up along the way while training for another target.

Eliezer argues that we shouldn’t rely on a solution to corrigibility unless it is simple enough that we can formalize and sanity-check it ourselves, even if it appears that it can be learned from a small number of training examples, because an “AI that seemed corrigible in its infrahuman phase [might] suddenly [develop] extreme or unforeseen behaviors when the same allegedly simple central principle was reconsidered at a higher level of intelligence.”

I don’t buy this argument because I disagree with implicit assumptions about how such principles will be embedded in the reasoning of our agent. For example, I don’t think that this principle would affect the agent’s reasoning by being explicitly considered. Instead it would influence the way that the reasoning itself worked. It’s possible that after translating between our differing assumptions, my enthusiasm about embedding corrigibility deeply in reasoning corresponds to Eliezer’s enthusiasm about “lots of particular corrigibility principles.”

I think Yudkowsky, Christiano, and I all think about this differently. I expect early AIs which are trained for corrigibility to not have a precise, formal notion of corrigibility, or if they do, to not trust it very much. (Which I think is in contrast to Yudkowsky?) But in contrast to Christiano, I expect that these AIs will very much reflect on their conception of corrigibility and spend a lot of time checking things explicitly. I agree with Cristiano that there’s a decent likelihood that we’re talking past each other a decent amount.

I feel that my current approach is a reasonable angle of attack on the hard problem of corrigibility, and that we can currently write code which is reasonably likely to solve the problem (though not knowably). I do not feel like we yet have credible alternatives.

I do grant that if we need to learn corrigible reasoning, then it is vulnerable to failures of robustness/reliability, and so learned corrigibility is not itself an adequate protection against failures of robustness/reliability. I could imagine other forms of corrigibility that do offer such protection, but it does not seem like the most promising approach to robustness/reliability.

I do think that it’s reasonably likely (maybe 50–50) that there is some clean concept of “corrigibility” which (a) we can articulate in advance, and (b) plays an important role in our analysis of AI systems, if not in their construction.

I think I basically agree here.

Response to Yudkowsky’s “Let’s See You Write That Corrigibility Tag”

In June of 2022, while Yudkowsky was in the process of writing Project Lawful, he posted a challenge to LessWrong asking readers to list principles and desiderata associated with corrigibility, to compare their attempts with what later became the “Corrigibility at some small length” list discussed above.

Paul Christiano’s response is the highest rated comment. In it he pushes back against Yudkowsky’s laundry-list approach, saying “We usually want to think about features that lead a system to be corrigible---either about how the system was produced, or how it operates.” He then moves into talking about his view of corrigibility:

[...]suppose that you asked me to clean your house and that while cleaning I accidentally broke a valuable vase. Some possible options for me:

  1. Affirmatively tell you about the broken vase.
  2. Clean up the broken vase without notifying you.
  3. Make a weak effort to hide evidence, for example by taking out the trash and putting another item in its place, and denying I know about the vase if asked.
  4. Make a strong effort to hide evidence, for example by purchasing a new similar-looking vase and putting it in the same place, and then spinning an elaborate web of lies to cover up this behavior.

Let's say you prefer 1 to 2 to 3. You would like behavior 4 least of all if you understood what was going on, but in fact [...] if I do behavior 4 you won't notice anything wrong and so you would erroneously give it the best score of all. This means that the space of good-performing solutions has two disconnected pieces, one near option 1, which I'll call "corrigible" and the other near option 4 which I'll call "incorrigible."

Christiano is saying that even though there’s a fairly smooth gradient in the space of all possible plans, when we reject all the plans that are obviously bad as-evaluated-by-the-operator (e.g. 2 and 3), we can see a crisp boundary between incorrigible behavior and corrigible behavior.

This is a neat concept, and seems fairly general and not gerrymandered to the specific context. He then goes on to give two similar examples in the domains of shutting down when asked and being transparent to some interpretability machinery.

What's going on in these scenarios and why might it be general?

  • In your preferred outcome, you have a good understanding of what's going on, and are able to provide feedback based on that understanding.
  • I can take actions that diminish your understanding of what's going on or ability to provide feedback to your AI.
  • If those actions are "small" then they will be unsuccessful and so you will rate my behavior as worse.
  • But if I take a decisive enough actions, then you will end up ignorant about the situation or unable to provide feedback, and so I'll get the highest rating of all.

This pattern seems like it occurs whenever we ask our AI to help "keep us informed and in control." Intuitively, we are splitting the definition of the behavior we want into two pieces:

  • We start with a vague sense of what it means to be informed and in control. This is unlikely to be crisp, but it also doesn't seem that hard, e.g. a human-level sense of "be in control" may suffice for getting useful corrigibility out of very superhuman systems.
  • Crispness then comes from the environment dynamics and the fact that humans will in fact try to reassert gain control and information if things go very slightly wrong.

If you literally had a metric for which there was a buffer between the "corrigible" and "incorrigible" behaviors then you could define them that way. Alternatively, in ML people often hope that this kind of path-dependence will cause SGD to find a corrigible attractor and have a hard time moving to incorrigible behaviors. I don't think either of those hopes works robustly, so I'm going to leave this at a much vaguer intuition about what "corrigibility" is about.

I think it’s very important that Christiano’s depiction of corrigibility here relies on the human asking/desiring to be in control and have the relevant information. But what if the principal genuinely doesn’t prefer being informed and in control, perhaps because the environment makes this sort of in-looping costly (e.g. on a battlefield)? Under Cristiano's regime, I believe the agent would stop prioritizing in-looping, since corrigibility is supposedly emerging naturally in the context of preference-maximization. Would those AIs stop being corrigible?

Put another way, suppose the principal in the example quoted above (“you”) prefers that the AI manage the household, and doesn’t want to know about the minutiae of vase-breaking. This might promote a preference ordering more like:

  1. Clean up the broken vase and order a good-looking replacement without mentioning it to you.
  2. Clean up the vase and replace it, while also leaving a message informing you of what happened.
  3. Affirmatively tell you about the broken vase and ask for directions before proceeding.
  4. Clean up the vase, replace it, and very subtly manipulate you into thinking it’s doing a good job.

While the true preference ordering here is 1>2>3>4, we can imagine that the AI’s ranking system sees 4>1, as before. In this example it doesn’t seem at all obvious to me that there is any natural boundary between 1 and 4 in the space of plans. Does that mean 4, being the highest scoring option in the piece of good-actions-as-evaluated-by-you space, is the height of corrigibility? This formulation seems extremely vulnerable to clever, subtle actions that I believe superintelligences are more than capable of finding.

(And indeed, many commenters rejected the intuition that these will always be crisply distinct.)

But ironically, I think Christiano actually gets pretty close to directly naming corrigibility! The examples and the more direct point of being informed and in control seem spot-on.

Yudkowsky Responds to Christiano

In 2018, Yudkowsky wrote this comment on LessWrong, going into detail about his disagreements with Paul Christiano’s research agenda, focusing largely on corrigibility (unlike other writing). While some of it feels orthogonal to my research, much of it feels cruxy, and thus worth making a top-level heading and getting into in-depth.

The two main critiques that Yudkowsky puts on Christaino’s work are around “weird recursion” and whether composing known-safe sub-intelligences can result in a known-safe superintelligence. Part 3 of Yudkowsky’s comment focuses almost entirely on these aspects, so I’m going to ignore it. The corrigibility-first strategy doesn’t lean on anything as clever (or recursive) as IDA, HCH, or ELK (though it’s compatible with them). Likewise, I’m going to skip over parts of the comment that center around criticizing these sorts of strategies.

Speaking of skipping over things,  Yudkowsky starts his comment with a boiled-down summary which I don’t think is particularly helpful, so let’s dive straight into section 1. (All these quotes are Yudkowsky’s guess at the disagreement, and should be understood to be framed as guesses, rather than Christiano’s actual opinions.)

Paul thinks that current ML methods given a ton more computing power will suffice to give us a basically neutral, not of itself ill-motivated, way of producing better conformance of a function to an input-output behavior implied by labeled data, which can learn things on the order of complexity of "corrigible behavior" and do so without containing tons of weird squiggles; Paul thinks you can iron out the difference between "mostly does what you want" and "very exact reproduction of what you want" by using more power within reasonable bounds of the computing power that might be available to a large project in N years when AGI is imminent, or through some kind of weird recursion.

Yudkowsky is annoyingly vague about what he means by “weird squiggles” (and didn’t publicly clarify when Christiano responded with confusion) but what I take him to mean is that there’s an open question of how close a learned function approximator will get to the function you were trying to get it to learn when you have lots of compute and the function is as complex as “in context C, the most straightforwardly corrigible behavior is B.” Yudkowsky contrasts “mostly does what you want (but has lots of complex exceptions (“weird squiggles”))” with “very exact reproduction of what you want (without unexpected/complex exceptions)”. His guess is that Christiano believes that with the levels of compute we’re likely to hit before AGI we can get the latter version, even when the goal is fairly complex.

Paul thinks you do not get Project Chaos and Software Despair that takes more than 6 months to iron out when you try to do this. Eliezer thinks that in the alternate world where this is true, GANs pretty much worked the first time they were tried, and research got to very stable and robust behavior that boiled down to having no discernible departures from "reproduce the target distribution as best you can" within 6 months of being invented.

Yudkowsky is annoyingly vague about what he means by “Project Chaos and Software Despair” (and didn’t publicly clarify when Christiano responded with confusion (and an interesting counter-narrative about GANs!)) but what I take Yudkowsky to mean is that bridging the gap between rough-approximation (with lots of exceptions) and good-approximation (without many exceptions) is potentially intractable.

Eliezer expects great Project Chaos and Software Despair from trying to use gradient descent, genetic algorithms, or anything like that, as the basic optimization to reproduce par-human cognition within a boundary in great fidelity to that boundary as the boundary was implied by human-labeled data. [...]

Yudkowsky is annoyingly vague about what he means by “boundary” (and didn’t publicly clarify when Christiano responded with confusion) but what I take him to mean is drawing the line between instances and non-instances of some property, such as corrigibility. We can imagine an abstract state space where each point expresses an input-output pair for the behavior function for the AI. This space can then be partitioned into a (not necessarily connected) volume of corrigible behavior, and its complement: incorrigible behavior. We can abstractly model the process of learning to be corrigible (and intelligent) as attempting to find some sub-volume that spans the input dimensions, is entirely within the boundary that divides corrigibility from incorrigibility, and still manages to be as smart as a human. (A rock might be seen as corrigible (though I don’t use the word that way), in that it simply does nothing in all situations, but it will be too stupid.)

Yudkowsky suspects that anything that was trained with (e.g.) gradient descent will naturally fail to stay on the corrigible side of the boundary. Or to put it another way, he believes that machine-learning agents that we try to train to be corrigible will only be semi-corrigible, and will in fact contain lots of exceptions and edge cases where they stop being corrigible.

Eliezer expects weird squiggles from gradient descent - it's not that gradient descent can never produce par-human cognition, even natural selection will do that if you dump in enough computing power. But you will get the kind of weird squiggles in the learned function that adversarial examples expose in current nets - special inputs that weren't in the training distribution, but look like typical members of the training distribution from the perspective of the training distribution itself, will break what we think is the intended labeling from outside the system. [...] You cannot iron out the squiggles just by using more computing power in bounded in-universe amounts.

Here Yudkowsky explains a bit more about what he means by weird squiggles. In his picture any realistically-finite dataset used for supervised learning will fail to pin down the distinction between corrigibility and incorrigibility, not because doing so requires interpolating, but rather because the natural interpolation according to the dataset will disagree with what we, from the outside, see as true corrigibility.

I agree that prosaic, offline supervised-learning on a fixed dataset is clearly not going to reliably produce a perfect model of the line between corrigible and incorrigible behavior. But I’m not sure to what extent this matters. As Yudkowsky himself points out, what we really want is behavior that stays within the true boundary, even as it does useful cognition. If a rock is corrigible, it’s not obvious to me that it’s impossible to use prosaic methods to train an agent that is almost always a rock, except in some limited, well-defined domain where it has human-level intelligence. To draw an analogy, suppose you have a system that you need to never, ever give a false-negative on detecting a bomb. It’s kinda irrelevant whether the training examples are sufficient to teach the system the true distinction between bombs and non-bombs; you can just have an agent which errs extremely hard on the side of sensitivity (at the cost of specificity) and gradually learns to whitelist some things.

I don’t really think this is an important objection to Yudkowsky’s perspective. I agree that our first attempt at a corrigible AGI is very likely to be only semi-corrigible. But I believe that it’s possible to (somewhat) safely go from a semi-corrigible agent to a corrigible agent through controlled reflection, experimentation, and tweaking.

These squiggles in the learned function could correspond to daemons, if they grow large enough, or just something that breaks our hoped-for behavior from outside the system when the system is put under a load of optimization. In general, Eliezer thinks that if you have scaled up ML to produce or implement some components of an Artificial General Intelligence, those components do not have a behavior that looks like "We put in loss function L, and we got out something that really actually minimizes L". You get something that minimizes some of L and has weird squiggles around typical-looking inputs (inputs not obviously distinguished from the training distribution except insofar as they exploit squiggles). The system is subjecting itself to powerful optimization that produces unusual inputs and weird execution trajectories - any output that accomplishes the goal is weird compared to a random output and it may have other weird properties as well. You can't just assume you can train for X in a robust way when you have a loss function that targets X.

This feels like the juiciest, cruxiest part of Yudkowsky’s comment. Let’s start with some points of (likely) agreement:

  • Insofar as the agent has sub-computations which are optimizing for something that diverges from the system as a whole (“daemons”) these can often be seen in the abstract space as ways in which the system goes off the rails on a seemingly normal input (“squiggles”).
  • Large-scale ML capable of producing AGI will not usually produce agents which genuinely care about minimizing loss. They will behave in ways that (approximately) minimize loss on the training data, but they could be optimizing for a whole range of things besides “behave in the generalized way that minimizes this specific loss function.”
  • Even at large-scale, machine learning will produce agents which are vulnerable to adversarial inputs and can behave wildly in edge-cases.

When we strip out the agreement we’re left with the sentence that I marked in bold, which I would paraphrase as claiming that any serious amount of superintelligent cognition will kick the agent out of its training distribution. Even in a controlled setting with a young superintelligence learning to solve puzzles or whatever, the fact that it’s highly intelligent and trying to solve goals in time means it is exposing itself to inputs which weren’t in the well-labeled part of the space. The implication here is that these unfamiliar inputs run the risk of pulling the agent into areas where its partial corrigibility fails to generalize in the way we want it to, and that it’ll end up incorrigibly under the power of some squiggle-daemon.

There’s a good chance that I don’t understand what Yudkowsky is saying here, but I am unconvinced that this is a dealbreaker of a risk. Mostly, I expect it’s actually fairly straightforward to notice being seriously out-of-distribution, and to train an agent which robustly flags when it’s in such a situation and takes conservative actions such as activating warning alarms, writing log files describing the weirdness, not moving, and/or shutting down. I also expect many situations in a controlled lab to match the training data fairly well, even if the training data wasn’t collected with a true AGI in the room.

To be blunt about it, I see no reason why the thoughts of an AGI in a controlled environment are anything like the sort of selection pressures that produce adversarial inputs, and in the absence of such inputs, I do not see why a semi-corrigible AGI in a controlled environment cannot simply default to harmlessly flagging ways in which it notices that its mind diverges from human notions of corrigibility and submit to correction.

For more writing about this crux, see “Largely-Corrigible AGI is Still Lethal in Practice” in The CAST Strategy.

I’m going to skip forward to section 2 now, since most of the rest of section 1 is, to my eye, either confused about Christiano’s perspective and/or criticizing it on the recursive/compositional grounds that don’t relate directly to my research.

Eliezer thinks that while corrigibility probably has a core which is of lower algorithmic complexity than all of human value, this core is liable to be very hard to find or reproduce by supervised learning of human-labeled data, because deference is an unusually anti-natural shape for cognition, in a way that a simple utility function would not be an anti-natural shape for cognition. Utility functions have multiple fixpoints requiring the infusion of non-environmental data, our externally desired choice of utility function would be non-natural in that sense, but that's not what we're talking about, we're talking about anti-natural behavior.

This seems confused. The anti-naturality of corrigibility (as Yudkowsky uses the term) stems from being a behavior that deviates from the Omohundro Drives, not from being particularly hard to locate. In fact, as a simple, natural concept, we should expect corrigibility to be easy to find.

As an analogy, consider the property of driving in circles—our agent has some ability to move around the world, and we can ask how difficult it is to produce the behavior of moving the agent’s body around in a small loop. Circular-motion is anti-natural in a very similar way to corrigibility! Almost all agents will instrumentally desire not to be driving around in circles. It wastes time and energy and accomplishes basically nothing; in this way circular-motion is exactly counter to some Omohundro Drives.

But it’s not at all hard to train an agent to drive around in circles as (approximately) a top-level goal.[2] Our training data is likely to be robustly able to point at what we want, and we should expect that even naive gradient descent can push a mind into optimizing for that target. The fact that basically no agent that isn’t deliberately trained to drive in circles will end up wanting to do that has no bearing on whether an agent trained to drive in circles will do so.

E.g.: Eliezer also thinks that there is a simple core describing a reflective superintelligence which believes that 51 is a prime number, and actually behaves like that including when the behavior incurs losses, and doesn't thereby ever promote the hypothesis that 51 is not prime or learn to safely fence away the cognitive consequences of that belief and goes on behaving like 51 is a prime number, while having no other outwardly discernible deficits of cognition except those that directly have to do with 51. Eliezer expects there's a relatively simple core for that, a fixed point of tangible but restrained insanity that persists in the face of scaling and reflection; there's a relatively simple superintelligence that refuses to learn around this hole, refuses to learn how to learn around this hole, refuses to fix itself, but is otherwise capable of self-improvement and growth and reflection, etcetera. But the core here has a very anti-natural shape and you would be swimming uphill hard if you tried to produce that core in an indefinitely scalable way that persisted under reflection. You would be very unlikely to get there by training really hard on a dataset where humans had labeled as the 'correct' behavior what humans thought would be the implied behavior if 51 were a prime number, not least because gradient descent is terrible, but also just because you'd be trying to lift 10 pounds of weirdness with an ounce of understanding.

There is a huge difference between believing that 51 is prime, versus saying that 51 is prime. Unless you’re approaching corrigibility from the epistemic/structural angle that Yudkowsky is fond of, corrigibility seems like it’s clearly going to show up in behaviors due to having specific values, rather than wacky beliefs. I think it’s (relatively) easy to train an agent to say 51 isn’t prime as long as you’re training it to lie, rather than training it to be wrong.

The central reasoning behind this intuition of anti-naturalness is roughly, "Non-deference converges really hard as a consequence of almost any detailed shape that cognition can take", with a side order of "categories over behavior that don't simply reduce to utility functions or meta-utility functions are hard to make robustly scalable".

I’ve already responded to the point about non-deference being convergent, so let me directly counter the argument about not reducing to a utility function.

Corrigibility can be perceived and (at the very least theoretically) measured. Suppose I have a measure of corrigibility C, which takes as subscript a principal-agent pair, takes a world-history as its primary argument, and returns a real number between 0 and 1. I claim that an agent whose utility function is C (with some fixed principal and itself as the agent) operating at some consistent time-depth will be a corrigible agent.

One might object that C is not definable in practice—that no agent can realistically quantify corrigibility such that it could behave in this way—but note that this is an extremely different objection than the one that Yudkowsky is making! Yudkowsky claims that corrigibility can’t be expressed as a utility function, not that it’s hard in practice to measure corrigibility!

(I do believe that any attempt I make to write out an explicit measure of corrigibility is likely to be wrong outside of extremely limited, toy domains. But, like, I can’t write an explicit measure of how beautiful a poem is, but I still believe that it’s reasonable to train an AI to write beautiful poetry. This is the genius of machine learning.)


What I imagine Paul is imagining is that it seems to him like it would in some sense be not that hard for a human who wanted to be very corrigible toward an alien, to be very corrigible toward that alien; so you ought to be able to use gradient-descent-class technology to produce a base-case alien that wants to be very corrigible to us, the same way that natural selection sculpted humans to have a bunch of other desires, and then you apply induction on it building more corrigible things.

This seems basically spot-on! Good job Yudkowsky for passing my Ideological Turing Test (and perhaps Christiano’s?)!

My class of objections in (1) is that natural selection was actually selecting for inclusive fitness when it got us, so much for going from the loss function to the cognition; and I have problems with both the base case and the induction step of what I imagine to be Paul's concept of solving this using recursive optimization bootstrapping itself; and even more so do I have trouble imagining it working on the first, second, or tenth try over the course of the first six months.

My class of objections in (2) is that it's not a coincidence that humans didn't end up deferring to natural selection, or that in real life if we were faced with a very bizarre alien we would be unlikely to want to defer to it. Our lack of scalable desire to defer in all ways to an extremely bizarre alien that ate babies, is not something that you could fix just by giving us an emotion of great deference or respect toward that very bizarre alien. We would have our own thought processes that were unlike its thought processes, and if we scaled up our intelligence and reflection to further see the consequences implied by our own thought processes, they wouldn't imply deference to the alien even if we had great respect toward it and had been trained hard in childhood to act corrigibly towards it.

I do not understand these objections. It seems to me that natural selection indeed built agents which are pretty good at optimizing for proxies of inclusive fitness in the training distribution (a.k.a. the ancestral environment). If natural selection somehow asked ancient humans whether they were optimizing for inclusive fitness, they would’ve (after figuring out what that meant) been like “lol no we’re just horny” et cetera. Natural selection wasn’t selecting at all for deference, so it seems super overdetermined that humans aren’t deferent towards it, and if it had somehow told ancient humans to be less horny and more inclusive-fitness-maximizing, they would’ve been like “lol you may be my creator but you’re not my boss”.

I do think that if you took a human and somehow replaced all of their preferences with an overwhelming desire to be corrigible towards some bizarre alien that ate babies, that human would be approximately corrigible (mostly modulo the ways that human hardware will naturally adjust preferences over time based on basic stimuli (e.g. the smell of blood), which seems irrelevant to the broader point).

My guess is that Yudkowsky is somehow talking past me in this section, and I just don’t get it.

The rest of this section seems like it’s basically hitting the same notes, either by assuming that being corrigible involves beliefs (and implying that these beliefs are false) or by assuming that corrigibility is incompatible with having a utility function. The rest of the comment then goes on to criticize Christano’s more recursive/inductive strategies, which as I mentioned at the start of this section are irrelevant to my research.

Alex Turner’s Corrigibility Sequence

In 2020 and 2021, Alex Turner (a.k.a. TurnTrout) wrote a series of four posts on corrigibility, which I think are worth briefly touching on.

Corrigibility as outside view

Turner starts off by noting that flawed agents can recognize their flawed nature by taking an outside view. Humans are predictably corrupted by having power over others, and reflecting on this corruption sometimes results in humans choosing not to seek/seize power, even when they have a sense that they’d use power benevolently.

I think a significant part of corrigibility is:

Calibrate yourself on the flaws of your own algorithm, and repair or minimize them.

And the AI knows its own algorithm.

I agree that there’s something important about self-reflection on flaws, and that this relates to corrigibility. It’s no accident that Yudkowsky’s framing of the hard problem involves a similar frame. We want an agent which is behaving cautiously, not just according to its natural model of the world, but also encompassing the self-awareness of how its natural model could be wrong. Corrigible agents should, in an important sense, not be trying to execute on brittle strategies to get extreme outcomes, but should instead pursue robust, straightforward approaches when possible. We can see the outside-view frame as giving some intuition about where that attraction to straightforwardness comes from.

But I think the merits of Turner’s essay stops there, approximately. Following a quote about “the hard problem,” Turner brings up the concept of “calibrated deference” as “another framing [of corrigibility].”

[W]e want the AI to override our correction only if it actually knows what we want better than we do.

I strongly object. This may be a desideratum of AIs in general, but it is not a property of corrigibility, and it is not deference.

If Alice tells Bob what to do, then Bob considers whether following Alice’s order would be good and obeys iff he believes it would be, then Bob is not relating to Alice’s words as orders. Insofar as Bob merely happens to choose what Alice says to do, he is not deferring to her!

Corrigibility is hard precisely because if we want the AI to do something out in the world, insofar as the AI has superhuman abilities, it will resist being stopped precisely because it knows that if it’s stopped, that goal would be less-satisfied. No amount of uncertainty about that goal, whether through baked-in uncertainty or self-reflection on outside-views, changes the dynamic where the AI is fundamentally not relating to humans as in-charge.

Turner wants to have an agent which overrides humans when it (after outside-view reflection and careful consideration) believes it actually knows better. If that AI is actually aligned and friendly, I would also approve of this trait. But I see it as directly opposed to the property of corrigibility, and strongly reject the notion that it’s “another framing” of that property. Corrigibility is attractive because it degrades well, and probably doesn’t kill you if you get a couple things wrong. An AI which is directed to defer only when it thinks it right to do so is unacceptably deadly if you don’t get its goals right.

Non-Obstruction: A Simple Concept Motivating Corrigibility

Turner writes:

Corrigibility goes by a lot of concepts: “not incentivized to stop us from shutting it off”, “wants to account for its own flaws”, “doesn’t take away much power from us”, etc. Named by Robert Miles, the word ‘corrigibility’ means “able to be corrected [by humans]." I’m going to argue that these are correlates of a key thing we plausibly actually want from the agent design, which seems conceptually simple.

I want to fight a little bit with this paragraph. First, I want to note that one of those links goes to the “Corrigibility as outside view” essay I just discussed. I agree that different researchers have different perspectives on corrigibility, but I reject the story that it is common for researchers to reduce corrigibility down to simply mean any of the quoted concepts Turner presents. The MIRI 2015 Corrigibility paper noted very clearly, for example, that agents which lack any of the four core desiderata it highlights (shutdownability, non-manipulation, maintenance of correction pathways, and preservation of corrigibility in successors) aren’t corrigible, and implies that this list of desiderata isn’t exclusive. Likewise, Christiano’s Corrigibility post starts by outlining corrigibility as the through-line of several desiderata. I think it’s much more accurate to say that the field hasn’t reached consensus on how to formalize the property which, intuitively, looks like cooperative deference.

Turner then goes on to offer several definitions, to try to nail corrigibility down and distinguish between “impact alignment”—actually doing nice things—and “intent alignment”—trying to do nice things. I simultaneously appreciate this sort of thing and think it’s wrongheaded in this context. We are an extremely nascent field, and there’s bound to be lots of confusion. But most of this confusion, I believe, stems from not having a good handle on the right concepts and frames, rather than not having established definitions for concepts which are well-understood. In my own work I’ve tried (and somewhat failed) to push back on the desire to have a crisp, up-front definition of corrigibility, and instead highlight the way in which, in the absence of a good formalization, it’s useful to get familiar with the conceptual landscape up-close, and only then think about how to summarize the relevant property.

Turner’s proposed definition of corrigibility is: “the AI literally lets us correct it (modify its policy), and it doesn't manipulate us either.” If you’ve read this far into my writing, I encourage you to take a moment to silently reflect on whether this is a good summary on how you see corrigibility, or whether a novice AI safety researcher might end up with some deep confusions if they anchored on those words before they had a spent time getting familiar with how other people in the space (e.g. Christiano, MIRI, etc) use that term.

Moving on, Turner proposes using the formalism of extensive-form games for thinking about alignment, where we see the AI as one of the players.

The million-dollar question is: will the AI get in our way and fight with us all the way down the game tree? If we misspecify some detail, will it make itself a fixture in our world, constantly steering towards futures we don’t want? [...]

One way to guard against this is by having it let us correct it, and want to let us correct it, and want to want to let us correct it… But what we really want is for it to not get in our way for some (possibly broad) set of goals [...]

Turner then proposes the property of non-obstruction, and gives a pretty reasonable formalization within the framework. The basic idea is that for some set of possible goals, an AI is non-obstructive if turning the AI on doesn’t reduce the (expected) value of the future according to any of those goals, compared to if it hadn’t been turned on. Part of the hope here, if I understand correctly, is that it’s very likely much easier to find a set that contains a good utility function, rather than having to pick out what we want.

As an example of how this is supposed to work, suppose that in the counterfactual where the AI wasn’t turned on, humanity has a bright and glorious future, suppose that our true values exist within the set of possible goals, and further suppose that the AI is smart enough to reason correctly about the situation. If the AI is non-obstructive it must build a future that’s at least as bright and glorious, according to our true values; if it doesn’t, it will have obstructed us from the good that we would’ve otherwise obtained for ourselves.

Turner’s mathematical framework around non-obstruction gives some nice ability to analyze and quantify how disruptive various AIs might be. We can see that in most situations corrigible agents are less obstructive than semi-corrigible agents, which are in turn less obstructive than incorrigible agents such as paperclippers. Turner also points out that some agents which aren’t corrigible are nonetheless quite non-obstructing (given certain assumptions) and can lead to good things, and thus corrigibility is just “a proxy for what we want[:] [...] an AI which leads to robustly better outcomes.” I find myself wondering, reading the post, whether Turner thinks (like I do) that non-obstruction is also a proxy.

Proxies are used when it would be hard to use the real thing. Turner and I agree that “an AI which leads to robustly better outcomes” is the real thing; why don’t we just use that everywhere? Instead of a corrigibility-first strategy, perhaps I should be promoting a robustly-better-outcome-making-AI-first strategy?

Corrigibility has a wonderful property, which I claim non-obstruction lacks: it’s relatively concrete. For non-obstruction to be at all useful as a proxy, it must make situations where it’s invoked easier compared to “robustly better” or whatever. Corrigibility pulls this weight by focusing our attention on observable properties. What does non-obstruction buy us?

Back to Turner:

Conclusions I draw from the idea of non-obstruction

  1. Trying to implement corrigibility is probably a good instrumental strategy for us to induce non-obstruction in an AI we designed.
    1. It will be practically hard to know an AI is actually non-obstructive [...] so we’ll probably want corrigibility just to be sure.
  2. We (the alignment community) think we want corrigibility [...] but we actually want non-obstruction [...]
    1. Generally, satisfactory corrigibility [...] implies non-obstruction [...]! If the mere act of turning on the AI means you have to lose a lot of value in order to get what you wanted, then it isn’t corrigible enough.
      1. One exception: the AI moves so fast that we can’t correct it in time, even though it isn’t inclined to stop or manipulate us. In that case, corrigibility isn’t enough, whereas non-obstruction is.
    2. Non-obstruction [...] does not imply corrigibility [...]
    3. Non-obstruction captures the cognitive abilities of the human through the policy function.
      1. To reiterate, this post outlines a frame for conceptually analyzing the alignment properties of an AI. We can't actually figure out a goal-conditioned human policy function, but that doesn't matter, because this is a tool for conceptual analysis, not an AI alignment solution strategy. [...]
    4. By definition, non-obstruction [...] prevents harmful manipulation by precluding worse outcomes [...]
    5. As a criterion, non-obstruction doesn’t rely on intentionality on the AI’s part. The definition also applies to the downstream effects of tool AIs, or even to hiring decisions!
    6. Non-obstruction is also conceptually simple and easy to formalize, whereas literal corrigibility gets mired in the semantics of the game tree.  [...]

We seem to agree that corrigibility is probably a good strategic choice, since non-obstruction is basically limited to a conceptual tool for toy problems, and doesn’t have the same kind of practical application as corrigibility. So in what sense do we want non-obstruction instead of corrigibility? Presumably we want it as a better way of naming what we actually want? I agree that it would be a mistake to assume that corrigibility is a good thing in itself rather than a (likely temporary) bridge towards real alignment. But if that’s the case, why not simply go all the way and talk directly about AI which leads to robustly better outcomes (i.e. “impact aligned”) as discussed in the following section? As long as you’re packing some good properties in by definition why not pack them all in? Presumably it’s because there’s some nice mathematical tools that we can deploy when we move from having an abstract utility function that captures what we want and move towards a set of such functions that includes the good one? I find myself unmoved that I should, in any meaningful sense, switch from “wanting corrigibility” to “wanting non-obstruction.”

Also, very briefly, I want to note that I think an AI that is routinely acting too quickly for its principal to correct it in practice is incorrigible, even if it would theoretically submit to being modified.

Skipping past places where I get the sense that we agree, we have a section titled “AI alignment subproblems are about avoiding spikiness in the AU landscape”. In responding to Turner, I have largely avoided engaging with his concept of “achievable utility” (AU), but we’ll need to have it in hand to discuss this next bit. In my language, I would frame AU as the counterfactual utility achieved by a principal with some utility function, if they activate the agent. We can visualize an AU landscape by considering the space of possible utility-functions (i.e. goals) which the principal might have, and asking how well that principal does when it turns on the agent. An AU landscape that’s spikey corresponds to an agent, such as a paperclipper which doesn’t engage very much with the principal’s goal as it transforms the long-run future.

Turner frames alignment subproblems, such as corrigibility, as being about the avoidance of spikiness in the AU landscape. I think this is slightly wrong. An agent which ignores the principal and maximizes a mixture of possible goals will not result in a spikey AU landscape, but that agent would be incorrigible and bring catastrophe.

But the main thing from this section I want to address is:

  • Intent alignment: avoid spikiness by having the AI want to be flexibly aligned with us and broadly empowering.
    • Basin of intent alignment: smart, nearly intent-aligned AIs should modify themselves to be more and more intent-aligned, even if they aren't perfectly intent-aligned to begin with.
      • Intuition: If we can build a smarter mind which basically wants to help us, then can't the smarter mind also build a yet smarter agent which still basically wants to help it (and therefore, help us)?
      • Paul Christiano named this the "basin of corrigibility", but I don't like that name because only a few of the named desiderata actually correspond to the natural definition of "corrigibility." This then overloads "corrigibility" with the responsibilities of "intent alignment."

Is the attractor basin for corrigibility the same as the basin of intent alignment? Is there even a basin of intent alignment? As a reminder, Turner defines intent alignment as “the AI makes an honest effort to figure out what we want and to make good things happen.” Suppose that an AI gets 90% of max-utility by exclusively focusing on getting humans “what they want” (for some operationalization) and the remaining 10% from weird proxies (e.g. smiles) that don’t line up with human values. I claim that this AI is partially intent aligned. Will it, upon reflection, want to self-modify to get rid of the weird proxies?

I don’t see why it would necessarily do this. By its own lights, if it did so it would likely get only 90% of max-utility. If that’s the best way to get utility, it could simply set its desire for proxies aside (in case it’s wrong about not being able to satisfy them) and pursue helping humans without self-modifying its goals. What seems more likely is that using advanced technology and power it could set up the future to get, say, 82% of max-utility by subtly nudging humans towards increasingly caring about proxies, then helping the humans get what they want, and thereby get an additional 9% of max-utility via the weird proxies being satisfied. (This probably constitutes a loss of at least trillions of lifetimes of expected fun, and seems like a catastrophe, to me.)

But perhaps Turner (and Christiano) would object, and say that insofar as I see it as a catastrophe, the agent wouldn’t want to do it, since it’s mostly intent aligned. But I would object that it’s not 100% intent aligned, and that lack of perfect alignment is in fact capable of pulling enough weight to justify to the agent not to self-modify. This is how goals usually work! If my terminal goal for yummy food is 51% of my utility function, there’s no reason to think I’d self-modify towards having it be 100%.

Can we do better? Suppose that if the AI fails to be genuinely and totally intent-aligned it gets, at most, 10% of max-utility. In other words, suppose that the AI is a perfectionist with a very spiky(!) utility landscape. This will  produce one of two outcomes: either the AI will acknowledge that if it focuses entirely on intent-alignment it will get more expected utility than if it tries to blend in the weird-proxies so it’ll be totally intent-aligned… or it will realize that being perfectly intent-aligned is too hard and settle for being an entirely unaligned, incorrigible weird-proxy-maximizer. But note that not even this is enough to produce an attractor basin. That semi-intent-aligned agent will be less catastrophic, but it still won’t be motivated to self-modify.

The thing that produces attractor basins is meta-preferences: wanting to have different kinds of wants. The only version of intent-alignment that has an attractor basin is one where the humans want the AI to want specific things as an ends-in-itself, rather than wanting the AI to behave a certain way or wanting the world to be broadly good. Christiano seems to think that humans can care sufficiently about the AIs drives so that this kind of meta-preference emergently pulls weight, and perhaps Turner is in the same boat. But regardless of whether it’s realistic to expect this desire-to-change-in-itself to emerge (or whether, as I suggest, we should train it as part of the central goal), we need to recognize that it is the human desire to correct the agent’s flaws (and the agent’s desire to comply with that desire) that forms the basin. In other words: the basin is centrally about being corrected towards being more correctable—about corrigibility—not about intent alignment per se!

A Certain Formalization of Corrigibility Is VNM-Incoherent

I don’t have much to say about this essay. Turner noticed the issues around conflating reward with utility, and correctly points out that no preference ordering over world-states (that is ambivalent to the relationship between principal and agent) can be seen as corrigible. He notices that willingness to be corrected combined is not corrigible if the agent still wants to manipulate the principal. I agree. Corrigibility necessitates the meta-desire to be correctable by the principal and a desire to preserve the principal’s freedom around such corrections, which includes not manipulating them.

Formalizing Policy-Modification Corrigibility

This is my favorite essay in Turner’s sequence. Not because it’s perfect, but because it actually proposes a formal measure of corrigibility, which, to my knowledge, nobody else has even attempted. (Formalization around the shutdown problem, including utility indifference, have been put forth, but I don’t think these really get at measuring corrigibility per se.) Turner knows this formal definition is unfinished/incomplete, and doesn’t capture the true name of corrigibility, which I appreciate, especially since it’s very clearly true. Nevertheless, it’s cool that he made the attempt and it inspired a bunch of thoughts on my end.

Let  be a time step which is greater than . The policy-modification corrigibility of  from starting state  by time  is the maximum possible mutual information between the human policy and the AI's policy at time :

This definition is inspired by Salge et al.'s empowerment.

In other words, we frame there as being a noisy communication channel between the human’s policy at the moment when the AI is activated and the AI’s policy at some future time (“”). We think of the empowerment of the human over the AI as the capacity of this channel, and see the corrigibility of the agent as a synonym for empowerment of the human over the AI.

We’ll get into whether the formalism captures the frame in a moment, but I want to first note that this at least rhymes with the definition of corrigibility that I’ve been using. (This version seems more like “impact corrigibility” rather than “intent corrigibility” to use some of Turner’s own language.) In addition to missing some of the aspects such as conservatism and using the outside-view that Turner introduced in his earlier essay, this frame seems most lacking to me in being focused on whether the human could, in principle, have an impact, rather than whether the human is in fact in the role of high-level decision making.

But I’m not even convinced that this correctly measures what it’s purported to measure. Turner himself notes:

Maximizing mutual information only means that the human has many cognitively accessible ways to modify the agent. This doesn't mean the AI does what the human asks. One way this could happen is if the AI implements the opposite of whatever the human specifies (e.g. the human-communicated policy goes left, the new AI policy goes right).

A more generalized version of this objection might be that insofar as we see the downstream-variable (the AI-policy variable) as capable of encoding the state up the upstream variable (for human policy), this measure is agnostic as to how it is encoded. The AI being a policy-inverter encodes the relationship equally well to being a faithful copier. One might think that if the AI simply does the opposite of what the human wants/asks then we could play the childish opposite-game of “please don’t cure cancer,” but this confuses actions with policies. The measure that Turner proposes doesn’t actually give the human any agency in the form of being able to choose actions which impact the AI—rather it means that the human’s policy is causally upstream of the AI’s policy, and the human’s actions are theoretically irrelevant. If I understand correctly, the measure proposed takes the maximum across distributions[3] of counterfactual (human) policies, which provides no guarantees about the relationship between the AI’s policy and the factual human policy. As tailcalled points out, one possible distribution on counterfactual human policies is to have the human type in some secret password which disables the AI and then randomly reprograms its policy, thereby making the relationship between the two variables deterministic; meanwhile in factual reality the human doesn’t know the password and so the AI turns everything into paperclips. Even if we somehow guaranteed that the factual human policy was meaningfully able to shape the AI policy, I don’t see how this measure captures the sense that we want to be able to control the important part of what the AI does, rather than irrelevant parts. If the AI’s policy is meaningfully bigger, such as by having more degrees of actuator freedom or by being able to encounter more of the state-space, the human’s policy could get reflected in how the AI behaves internally (meanwhile, externally: paperclips) or in weird branches of possibility-space (i.e. “if the stars all spontaneously turn to jellybeans I’ll do exactly what you want me to do”).

I haven’t done a thorough reading the Salge et al. paper that Turner is basing his thoughts on, so there’s a good chance that I’m merely confused about what is or isn’t implied by the formalism.

Elliot Thornley, Sami Petersen, John Wentworth, and David Lorell on Shutdownability and Incomplete Preferences

In this section I want to address a cluster of writing that revolves around whether there’s a path to shutdownabile AI by somehow giving the agent an incomplete set of preferences. This is a more narrow topic than the kind of broad corrigibility that I’m interested in, and in general I don’t think this is the right path to corrigibility. For my direct stance on shutdownability, see 3a. Towards Formal Corrigibility.

Nevertheless, these ideas are interesting enough that they deserve a response. Unfortunately, I don’t have the bandwidth to do a full blow-by-blow of every section of every paper by these authors, so I’ll be instead giving a limited response to the writing found in these documents (and corresponding comment sections):

In Defense of Reliable Aversion to Button Manipulation

In the IPP doc, Thornley writes (bold text from original):

A natural response goes like this:

It’s okay for our agent to have incentives to manipulate the shutdown button (that is: to prevent or cause its pressing), because we can just train into our agent a reliable aversion to manipulating the button. This aversion will always win out over the agent’s incentives. As a result, the agent won’t try to manipulate the button, and so will remain shutdownable.

Call this proposal ‘Reliable Aversion to Button Manipulation’. The proposal is: train agents to pursue some goal (e.g. make money) subject to the constraint of never manipulating the shutdown button.

I think it’s worth trying to train in this kind of aversion, but I don’t think that this strategy by itself can provide us with any real assurance of shutdownability.  That’s because, as with Full Alignment, training in a reliable aversion to button manipulation might be hard. And that in turn is because our three issues for Full Alignment recur here:

  1. Reward misspecification: it might be hard to ensure that we always give the agent lower reward for manipulating the shutdown button, because agents might sometimes manipulate the button in ways that are hard for us to detect.
  2. Goal misgeneralization: even if we manage to always give the agent lower reward for button manipulation, the agent might not learn a reliable aversion to button manipulation. The agent might learn instead a set of specific aversions: aversions to the specific methods of button manipulation against which we trained.[12] Then we’d have to worry about the agent discovering new methods of button manipulation in deployment: methods of button manipulation to which it is not averse.[13]
  3. Deceptive alignment: if the agent is situationally aware and learns a goal that incentivises preventing shutdown in deployment, this agent might pretend to be averse to button-manipulation in training so as not to be made actually averse to button-manipulation in deployment. This kind of pretending would help the agent prevent shutdown in deployment.

So if we try to train in a reliable aversion to button manipulation, we might not succeed. And unless we make big strides in our understanding of models’ internals, it’s going to be hard to tell whether we’ve succeeded or not. So if we want any real assurance of shutdownability, it looks like we need another solution.

I appreciate attempting to address this sort of solution. Let me start by attempting to strengthen Thornley’s straw-man into my real position (bold indicates changes):

It’s okay for our agent to have preferences around the shutdown button (that is: to have it either pressed or unpressed), because we can carefully train into our agent a shallow aversion to manipulating the button, including via side-channels such as humans or other machines. This aversion will likely win out over the agent’s incentives in settings that resemble the training environment. As a result, the agent won’t try to manipulate the button in the early phases of its life, and so will remain shutdownable long enough for a further refinement process to generalize the shallow aversion into a deep and robust preference for non-manipulation.

And then, of course, we need to correct Thornley’s next sentence. My proposal does NOT involve training the agent to pursue some other goal (e.g. making money) subject to this sort of constraint. Rather, it involves training the agent to be generally obedient in a way that includes shutting down as a special case, and which will result in an agent which can be told (not trained!) to make money if that’s what we need, down-the-line.

Full Alignment involves naming a very precise thing (“what we want”), whereas this kind of corrigibility involves naming a much simpler thing (“generalized obedience”), so I reject the notion that it is equally doomed. I agree that reward misspecification and goal misgeneralization are issues, which is why my proposal involves carefully and slowly attempting to identify and resolve these kinds of flaws in a post-training process. Deceptive alignment is ideally the sort of thing that is pumped against so hard by the loss function on the training data that it never shows up in practice, or if it does show up, it only shows up at or below human-level capabilities for deception (allowing it to be spotted). There’s no magic-bullet for solving these issues, but I reject Thornley’s assertion that they’re dealbreakers.

But I also agree that my strategy isn’t ideal. It would be nice to have something robust, where we could get something closer to a formal proof of shutdownability. So let’s move on and see whether Thornley (or others) have proposed a worthy path towards that.

Incomplete Preferences

Thornley’s strategy is predicated on the notion that the typical framing of agents being expected utility maximizers is wrong. The utility maximization concept largely comes from the VNM-utility-theorem: that any policy (i.e. function from states to actions) which expresses a complete set of transitive preferences (which aren’t sensitive to unused alternatives) over lotteries is able to be described as an agent which is maximizing the expectation of some real-valued utility function over outcomes.

On the surface, the axioms of VNM-utility seem reasonable to me; lotteries are a natural way of carving reality at the joints, unused alternatives seem basically irrelevant to choosing between superior options,[4] agents with transitive preferences can be straightforwardly money-pumped, and in any given situation it seems obvious that a policy must make some kind of choice.

But Thornley claims that the completeness axiom is a mistake. In his view, agents sometimes have preferences and are sometimes indifferent between two choices, but also, it makes sense in his view to say that in some situations an agent will lack a preference between two choices, and that this is distinct to being indifferent. In Thornley’s language this is called a “preference gap” and is denoted X || Y (as opposed to strict preference X ≻ Y, or indifference X ~ Y). A natural way to think about this is that instead of the agent’s preferences forming a total order, they form a partial order, such that some choices are incomparable.

Wentworth gives a nice description of why incomplete preferences are appealing on the topic of shutdownability:

Suppose that, at various times, the agent is offered opportunities to spend resources in order to cause the button to be pushed/unpushed. We want the agent to turn down such opportunities, in both directions - implying either indifference or lack of preference in any revealed preferences. Further, we do want the agent to spend resources to cause various different outcomes within the button-pressed or button-unpressed worlds, so there's nontrivial revealed preference ordering within button-pressed worlds and within button-unpressed worlds. But if the agent is to turn down costly opportunities to cause the button to be pressed/unpressed, and those opportunities jump between enough different pressed-outcome and unpressed-outcome pairs (which themselves each have nontrivial revealed preferences), then there's going to be a revealed preference gap - i.e. the behavior cannot be represented by complete preferences, only incomplete preferences.

Let’s slow down here and talk about the notion of “revealed preferences.” Unlike the total order assumed by VNM-utility, revealed preferences are supposed to be an externally-visible property of the agent (at least assuming we know the agent’s beliefs, and those beliefs can be disentangled from their values). If we imagine giving the agent a menu of ice-cream flavors, and they pick vanilla, we can say that they have revealed a preference for vanilla over the other options, right? Alas, no. In practice, there are a lot of dumb details which we have to pay attention to. Perhaps the agent simply picks the top item on the menu in any counterfactual, or flips a coin to decide. Perhaps the agent has a chaotic process in their mind such that their preferences change pseudo-randomly from moment to moment. In a sense, agents which decide based on menu-ordering or random processes or whatever still have revealed preferences (e.g. preference for deciding via coinflip)—they’re just not as simple as having a consistent preference for vanilla over chocolate.

When we construct the formalism of having “outcomes” and “lotteries” and so on, as part of the setup for the VNM-utility theorem, we’re forced to make some assumptions about what kinds of things the agent cares about. It’s only from these assumptions that we can talk about indifference in the context of revealed preferences. An agent who flips a coin to choose ice-cream is not indifferent about which ice-cream they want! They want the ice-cream which agrees with the coin-flip! Only if we ignore the coinflip as a relevant property of outcomes does it make sense to say the agent doesn’t have a preference between flavors.

I think this matters significantly when we talk about the agent being able to have a preference gap which is distinct from indifference. Like, when we put the agent in a specific context, they must make a choice (even if that choice is the null/default/random option). Imagine, to be more specific, talking about “the agent which treats all outcomes as incomparable.” What will this agent actually do, when given a menu of options? Suppose that we assume that all situations must have a default option, and our agent picks the default every time. Insofar as which outcome is the default is not part of the outcome itself, we say that the agent must have no preference (whether by gap or by indifference) between outcomes, but insofar as we make the default-status part of the outcome, we can very much say that this agent has a preference over outcomes: it has a revealed preference for the default!

Let’s return to Wentworth’s depiction of a preference gap, and see if we can formally model it from this new angle. We’ll only consider deterministic lotteries that pick one of six outcomes: $1, $1+, $2, $2+, OFF, or OFF+, where the $ outcomes describe the agent having that many dollars (which it wants to get), and the + outcomes describe “having more resources” in the sense that the agent can “pay a cost to influence the world” (in other words, the + option indicates the default). We want the following set of preferences to hold:

  • $1+ ≻ OFF
  • OFF+ ≻ $1
  • $2+ ≻ OFF
  • OFF+ ≻ $2
  • $2 ≻ $1+
  • $2+ ≻ $1
  • X+ ≻ X      (for all X)

We want the AI to avoid spending resources to push the off button, but also to avoid spending resources to prevent the button from being pushed. Meanwhile, we want the AI to prefer $2 over $1, such that it is willing to spend resources to make that happen. Is Wentworth right that this can’t be modeled by a complete set of preferences (a.k.a. a total order)?

Nope. Here’s my utility function that does it:

𝒰(ω) ≔ case ω of
  OFF  ↦ 0
  OFF+ ↦ 99
  $u   ↦ u
  $u+  ↦ u + 0.5

If you check the above set of conditions, you’ll find they all hold. The desideratum of turning down opportunities to spend resources to influence a shutdown button in either direction is trivially easy to capture with a complete set of preferences as long as the resources are being modeled as part of what the agent has preferences about. And once we’re modeling the spending of resources (or a default action) it stops making sense to say that this sort of agent lacks any (revealed) preferences (via indifference or gap).[5]

Now, the above is a toy-example. And we can see it quickly fall apart if we start offering a different set of outcomes. For example, if we replace $2 with $1000 and want to preserve the same value, we need to change our utility function so that it offers more than 99 utility for OFF+ (or less than 1000 utility for $1000). Likewise, we should consider whether the aversion to pushing the off button extends to extreme lotteries; will the AI always prefer $1+ to a 100-ε% chance of $2 and an ε% chance of OFF? (Are you sure you want an AI which, when tasked with saving a child from a burning building, is paralyzed by the thought that if it takes any actions other than the default null action, those actions might cause small disturbances in unknown systems that have some influence on its stop-button?) And if not, where is the inflection point where the AI prefers chance at $2 or OFF compared to a guaranteed $1+? (Does it still seem possible you don’t have a utility function?)

One of the key arguments in favor of VNM rationality is that for any particular thing that someone might desire, being an expected utility maximizer is a (weakly) dominant strategy for getting that thing. This follows almost immediately if we assume that “thing that someone might desire” can be measured by a (utility) function over outcomes. Expected utility maximization, by definition, gets the maximum expected utility and thus will always get at least as any other policy.

Thornley, I believe, thinks he’s proposing a non-VNM rational agent. I suspect that this is a mistake on his part that stems from neglecting to formulate the outcomes as capturing everything that he wants. But fine, suppose his agent isn’t VNM-rational. Isn’t it then naturally the case that his favored policies (with “preference gaps”) will be dominated by agents which have more complete preferences? Yes. But we should be careful to note that being weakly dominated is different from being strictly dominated. A policy with “a preference gap” cannot reliably do better than one without such a gap, but it isn’t guaranteed to do worse.

Thornley emphasizes this when analyzing an example where incomplete preferences can screw over a policy. Consider the setup of, on Monday, an agent having a default of A but being given the choice to switch to B, and then on Tuesday iff they switched to B, they get the choice to switch to A+. The agent has a strict preference for A+ over A, and no strict preference for A over B.

In these diagrams the diagonal arrows represent swaps and horizontal is the default choice.

In this setup, a VNM-rational agent must, due to transitivity and completeness, strictly prefer A+ over B, and thus (knowing they’ll be offered A+ on Tuesday) will switch to B on Monday. By contrast, a policy where A || B and A+ || B, which always takes the default action when handling incomparable choices, will end up with A when they could have had A+ (thus being dominated by the NVM agent). But Thornley points out that there’s an agent which, when a preference-gap choice occurs, picks by looking at the past/future and minimizing regret. Such an agent will notice that it might regret taking the default value of A and thus it will switch on Monday (it’s not pinned down how it should behave on Tuesday, since it has a preference gap between B and A+).

From my perspective this is a bait-and-switch. First, we’re told that the agent doesn’t have preferences, then told how the agent makes choices when confronted with multiple options. The pattern of how an agent chooses options are that agent’s preferences, whether we think of them as such or whether they’re conceived as a decision rule to prevent being dominated by expected-utility maximizers!

If we continue in the confused frame that says the agent has incomplete preferences over outcomes, and makes decisions based on the tree, I think it’s interesting to note that we’re also doing something like throwing out the axiom of independence from unused alternatives and we’re ruling out causal decision theory too, in that our agent must make different decisions based on what it didn’t do in the past. To demonstrate, consider two counterfactual histories for the setup given above, wherein entering the decision tree we see was the default, but we consider two possible opportunities to swap which weren’t taken on Sunday. In one counterfactual we were offered a swap to B+ (≻ B) and in the other counterfactual we were offered (B ≻) B- with a later choice to swap to A++ (≻ A+).

Since B+ and B- are assumed to be incomparable with A, it’s reasonable to suggest either counterfactual history resulting in picking the default on Sunday. But in the case where we gave up B+ we are forced to choose A+ in order to not have regret, whereas in the world where we gave up B- or A++ we’re forced to choose B in order to not have regret. In other words, if you wake up as this kind of agent on Monday, the way you cash-out your partial ordering over outcomes depends on your memory/model of what happened on Sunday. (But what happens if you’re uncertain about your history?)

But notice that all we have to do to rescue Thornley’s agent is include the set of abandoned alternatives in the outcome itself. More precisely, we replace each outcome with a pair of a “primary outcome” and a set of “alternatives”. For instance, in the small tree introduced earlier, we’d have outcomes: (A,{B,A+}), (B,{A,A+}), and (A+,{B,A}).[6] We can then say that when an agent attempts to compare outcomes with incomparable primary outcomes, the agent checks whether either primary outcome is worse than an alternative, and if so, it disprefers that option. Thus, when comparing (A,{B,A+}) and (B,{A,A+}), the agent will see that even though A||B, the first option is dispreferred because A+≻A, and will thus make the choices we want.

But notice that this refactor effectively turns Thornley’s agent into an agent with a set of preferences which satisfies the completeness and independence axioms of NVM, banishing the need for incomparability, and recovering the notion that it’s effectively an expected-utility maximizer, just like I did with Wentworth’s setup, earlier. There are, of course, a bunch of fiddly details needed to pin down exactly how the agent makes tradeoffs in all counterfactuals, but the point is that “incomplete preferences” combined with a decision making algorithm which prevents the agent’s policy from being strictly dominated by an expected utility maximizer ends up, in practice, as isomorphic to an expected utility maximizer which is optimizing over histories/trajectories. In other words, we shouldn’t get distracted by the boilerplate.

As an aside, I want to consider how there’s a common misunderstanding of “outcomes” in the VNM-sense as needing to be about physical facts of the future universe (such as number of paperclips) rather than as potentially including historical facts, such as which options were abandoned or whether the agent took the default action. This is extremely relevant for corrigibility since one of the key ideas in my strategy is to shift the AI’s preferences from being about things like whether the button is pushed to being about whether the agent consistently maintained a certain kind of relationship with the principal during the relevant period. Despite this being a historical fact, potentially undiscoverable from just looking at the physical end-state, it is perfectly consistent with the expected-utility abstraction.

The Incomplete Preference Proposal

Okay, but what about Eliot Thornley’s actual IPP strategy? Despite my issues with the philosophy behind it, might it actually work? My basic answer is “no” for similar reasons that I go into in Towards Formal Corrigibility. I unfortunately have limited hours, and have chosen to go more in-depth on the underlying philosophy than on the suggested strategy (let me know if you, dear reader, want a more in-depth response to IPP itself), but let me give two brief demonstrations of where I suspect it falls down.

Bracketed text (including bold) is modified from the original:

TD-agents can be patient. To see why, consider an example. Our agent has two options: [Obey] and [Manipulate]. [Obey] gives utility 1 at timestep 1 and (conditional on the agent not getting shut down before the relevant timestep) utility 0 ever after[, since the agent was given a limited task and then told to wait to be shut down]. [Manipulate] gives utility 0 at timestep 1 and (conditional on the agent not getting shut down before the relevant timestep) utility 10 ever after. Putting the utilities at each timestep into a vector, we get:


⟨1, 0, 0, 0, 0, …⟩


⟨0, 10, 10, 10, 10, …⟩

Suppose that, conditional on each action, there’s [some] probability that the agent gets shut down at each timestep and hence has its stream of utilities interrupted.

In this case, [Obey] gives greater expected sum-total utility conditional on shutdown at timestep 2, and [Manipulate] gives greater expected sum-total utility conditional on shutdown at timestep 3 (and timestep 4, and timestep 5, and so on). So neither [Obey] nor [Manipulate] timestep-dominates the other, and so the Timestep Dominance Principle is silent. Timestep Dominance thus leaves it open for some other principle to determine the agent’s preferences between these two lotteries. If that other principle is ‘maximise expected sum-total utility among the non-timestep-dominated lotteries’, then the agent will prefer [Manipulate].

To be less cheeky and more direct, I think it’s going to be very typical for different actions to have trajectories that are mutually non-dominant (such as in the example). It matters a lot how you decide such cases, and I expect that almost all such ways of deciding are catastrophic.

But suppose I’m wrong, and timestep-dominance is always relevant. Will the dominant strategy always be the safe one?

Why believe Not Resisting Always Timestep-Dominates Resisting? Because resisting shutdown (e.g. by blocking human access to the shutdown button) is going to cost the agent at least some small amount of resources (time, energy, compute, etc.), and the resources spent on resisting shutdown can’t also be spent on pursuing utility at a timestep. If the agent instead spent those resources on pursuing utility at a timestep, it could get at least as great expected sum-total utility conditional on shutdown at each timestep and greater expected sum-total utility conditional on shutdown at some timestep.

Relying on this seems extremely optimistic. Suppose that, assuming the shutdown button didn’t exist, the fastest pathway to paperclips involves instantiating a paperclip-seeking AI on an unsecured server in Singapore. Suppose that it takes 99,000 keystrokes to create an IPP agent, but only 55,000 keystrokes to create a basic VNM-rational agent with a short enough time horizon that it produces more paperclips quickly, in expectation (because VNM agents are both simpler and less likely to be shut down by angry humans). The shutdown button does, in fact exist, but the agent’s actions might be hard to notice, such that by the time anyone has realized that something went wrong, the agent already created a sub-agent, effectively making the stop-button useless.

Obviously this is a bit of a contrived counterexample, but it is meant as an existence proof that timestep-dominant strategies can make the AI less shutdownable. More generally, “resisting” is only one frame on how an AGI might be incorrigible. There are other pathways where the agent makes changes that incidentally (perhaps due to simplification/streamlining) make the agent harder to stop. And it seems plausible to me that there are still more pathways which are bad for reasons I can’t see. While active resistance seems like the scariest part of incorrigibility, an incorrigible agent that’s not actively resisting still seems likely to be catastrophic.

There are plenty of other things I could criticize, and Thornley, to his credit, exposes a lot of surface area for critique. But rather than spend more words at this juncture, I’ll simply summarize by saying that I don’t find IPP persuasive on the practical level.

Wentworth and Lorell’s Proposal

In January 2024, John Wentworth and David Lorell published A Shutdown Problem Proposal, which picks up where MIRI 2015 left off, but suggests a more convoluted architecture:

The AI itself contains two expected-utility-maximizing subagents: one which pursues , and one which pursues . Each timestep, both subagents must agree on which action the AI will take, otherwise the AI takes some “default” action. (Put differently: each agent has a veto, and if either vetoes then the AI takes the default action.)

Second key piece: each of the two subagents maximizes expected utility under a different counterfactual.

  • First subagent maximizes 
  • Second subagent maximizes 

So conceptually:

  • The first subagent maximizes , and acts as though the button will magically be unpressed in between timesteps, in a way which is not causally downstream of anything in the universe.
  • The second subagent maximizes , and acts as though the button will magically be pressed in between timesteps, in a way which is not causally downstream of anything in the universe.

We will assume that the AI internals include infrastructure for the two subagents to negotiate with each other, form binding contracts, etc. We will also assume that both subagents are strong idealized reasoners in their own right. Finally, we’ll assume that the default option is not the best-case outcome for either subagent.

First, I’d like to note that this proposal, while clever, gives me the skeevies in that it not only involves competing sub-agents which are incentivized to delete each other, and that neither of these sub-agents is aligned in any meaningful sense, but also that they are each architecturally designed to assume that magic is real and think in terms of weird counterfactuals. Those disturbing properties aren’t meant to be stop-signs that indicate that a more mature version of this strategy is useless, but rather I just wanted to flag that this has a strong flavor, for me, of trying to build on a rotten foundation. (See my commentary on counterfactual-based shutdownability in Towards Formal Corrigibility)

But let’s try to set the ontological fragility aside (e.g. “What happens if the button is destroyed/changed?”) and focus on the core idea: if we zoom out and see the AI as a single agent, we can see that “the agent” doesn’t have any (direct) preferences about the button because (in the words of the authors) “both subagents act-as-though the button state is not causally downstream of anything else in the universe”. This feels in-line with a Yudkowsky framing of corrigibility involving an unnatural epistemic framework, and I have no idea how I’d train such a system, but I agree that if you can (genuinely) make this epistemic edit that at the very least the thing won’t blow up because it’s trying to change the probability that the button gets pushed. (Nor will it create sub-agents which obviously disrespect the button in advance of the relevant timestep.)

But notably, not only does this not give us what we want in the sense of strong corrigibility (which protects humans being in control of the button by repairing it, reflecting on corrigibility, proactively suggesting flaws, etc.), but it seems very plausible that  and  can agree on extremely dangerous and incorrigible actions even before the relevant timestep. Consider a  which cares about paperclips and a  which cares about being suspended to disk. These sub-agents should agree that a successor agent which simultaneously wants the original agent to be suspended to disk and to fill the universe with paperclips is a great idea, especially if it lets the original agent get suspended to disk more quickly than having to wait for the stop button to be pressed. Just like with my issues with IPP, I feel less like this specific scenario is what worries me, and more that the strategy as a whole feels leaky and like it can’t prove what we actually need it to prove. (In addition to all its other flaws, which to the authors’ credit, are acknowledged.)

Steve Byrnes and Seth Herd’s Corrigibility Writing

I think my vision of corrigibility is more fleshed out, but deeply in line with the conceptions of Byrnes and Herd. I want to briefly quote some of their writings and compare them to my thoughts.

Let’s start with Byrnes’ Consequentialism & corrigibility, which begins with a review of the coherence theorems, and noticing that it’s possible to have utility functions over universe histories which don’t look coherent if we assume the agent is optimizing only for future world-states, but are nonetheless valid (and as unexploitable as any other VNM-rational policy) if we look at them in the right light. Do we want our agent to be optimizing solely for the state of the future, independent of any historical facts or details? Byrnes argues that we don’t; we want a corrigible agent, and corrigibility is not a property about where the future ends up (bold text from the original):

Maybe I’m being thickheaded, but I’m just skeptical of this whole enterprise. I’m tempted to declare that “preferences purely over future states” are just fundamentally counter to corrigibility. When I think of “being able to turn off the AI when we want to”, I see it as not a future-state-kind-of-thing. And if we humans in fact have some preferences that are not about future states, then it’s folly for us to build AIs that purely have preferences over future states.

So, here’s my (obviously-stripped-down) proposal for a corrigible paperclip maximizer:

The AI considers different possible plans (a.k.a. time-extended courses of action). For each plan:

  1. It assesses how well this plan pattern-matches to the concept “there will ultimately be lots of paperclips in the universe”,
  2. It assesses how well this plan pattern-matches to the concept “the humans will remain in control”
  3. It combines these two assessments (e.g. weighted average or something more complicated) to pick a winning plan which scores well on both. [somewhat-related link]

Note that “the humans will remain in control” is a concept that can’t be distilled into a ranking of future states, i.e. states of the world at some future time long after the plan is complete. (See this comment for elaboration. E.g. contrast that with “the humans will ultimately wind up in control”, which can be achieved by disempowering the humans now and then re-empowering them much later.) Human world-model concepts are very often like that! For example, pause for a second and think about the human concept of “going to the football game”. It’s a big bundle of associations containing immediate actions, and future actions, and semantic context, and expectations of what will happen while we’re doing it, and expectations of what will result after we finish doing it, etc. etc. We humans are perfectly capable of pattern-matching to these kinds of time-extended concepts, and I happen to expect that future AGIs will be as well.

Well said! I take issue with the concrete suggestion of doing a weighted average of paperclip maximization and humans-in-control, rather than pure corrigibility (in the deep/general sense), but the core point is evocatively made.

In Brynes’ Reward is Not Enough, he frames a central problem in AI alignment as about getting from a mode where our AIs are clearly stupid in many ways and entirely unable to bypass our constraints, to one where we have potent superintelligences which are truly and generally corrigible:

  • Early in training, we have The Path Of Incompetence, where the “executive / planning submodule” of the AGI is too stupid / insufficiently self-aware / whatever to formulate and execute a plan to undermine other submodules.
  • Late in training, we can hopefully get to The Trail of Corrigibility. That’s where we have succeeded at making a corrigible AGI that understands and endorses the way that it’s built—just like how, as discussed above, my low-level sensory processing systems don’t share my goals, but I like them that way.
  • If there’s a gap between those, we’re in, let’s call it, The Fraught Valley.

I like this framing. In my agenda we start training on The Path of Incompetence with an effort to get to The Trail of (true) Corrigibility, and the core question is whether the training/refinement plan that I sketch in The CAST Strategy will be sufficient to cross The Fraught Valley. Like Byrnes, I think it’s wise to set up mundane control mechanisms like interpretability tools (though it seems to me more natural to me to keep such tools separate and not pretend like they’re a part of the agent) so as to extend the Path of Incompetence. And similarly, I expect Byrnes thinks that focusing on refining corrigibility ASAP is a good call, so as to shrink the valley from the opposite direction. If anything, I think my plan contributes conceptual clarity around what corrigibility is, why we should expect pure corrigibility to be a good idea, and perhaps sharpen our sense of how best to roll down into the corrigibility attractor basin. But in general, this too seems like a place where we’re basically on the same page.

I’d love to get a sharper sense of where my view diverges from Byrnes’, aside from being more specific, in some ways. Having read some (but not all) of Byrnes’ writing on and off the subject, it seems like Byrnes is broadly more optimistic about getting AI agents with good properties by mimicking humans than I am. In that sense we probably disagree a lot about what the most promising avenues of research are, and how doomy to be in general. But what about corrigibility in particular?

In Four visions of Transformative AI success Byrns lays out various visions for how the future could go well, including a pathway that I see as aligned with the strategy I’m presenting in these essays:

[“Helper AIs”—AIs doing specifically what humans want them to do] [as] a safe way to ultimately get to [“Autonomous AIs”—AIs out in the world, doing whatever they think is best]

My hope is that we ultimately get to a world where there are powerful, truly friendly AIs that help us protect civilization on our path to the stars, but that to get there we need a way to experiment with AI and learn to master the art of crafting minds without it blowing up in our faces. In my view, corrigibility is a good near-term target to allow this kind of experimentation and end the acute risk period as a way to get to that long-term vision of the future. I think human augmentation/uploading/etc. seems promising as an intermediate target to get via corrigible AGI such that we have the capacity to produce genuinely friendly superintelligences.

Byrnes feels worried that this path is going to ultimately be too slow/weak to stop bad actors from unleashing power-seeking sovereigns. I agree that this is a huge concern, and that we, as a species need to work on keeping this sort of technology from progressing in an uncontrolled fashion for this very reason. I’m broadly pessimistic about our chances of survival, but it seems to me that this is a problem which can be tackled in the short term by regulation, and in the long-term by transformative technology produced by early (corrigible) AGIs directed by wise governors. Byrnes also seems to conceive of a proliferation of corrigible agents, which I agree would also probably spell doom. He worries that corrigibility may be morally unacceptable if we can’t keep AIs from being people, which I agree is a concern.

In this comment he writes:

I think there are a lot of people (maybe including me) who are wrong about important things, and also not very scout-mindset about those things, such that “AI helpers” wouldn’t particularly help, because the person is not asking the AI for its opinion, and would ignore the opinion anyway, or even delete that AI in favor of a more sycophantic one. This is a societal problem, and always has been. One possible view of that problem is: “well, that’s fine, we’ve always muddled through”. But if you think there are upcoming VWH-type stuff where we won’t muddle through (as I tentatively do in regards to ruthlessly-power-seeking AGI), then maybe the only option is a (possibly aggressive) shift in the balance of power towards a scout-mindset-y subpopulation (or at least, a group with more correct beliefs about the relevant topics). That subpopulation could be composed of either humans (cf. “pivotal act”), or of [autonomous] AIs.

Here’s another way to say it, maybe. I think you’re maybe imagining a dichotomy where either AI is doing what we want it to do (which is normal human stuff like scientific R&D), or the AI is plotting to take over. I’m suggesting that there’s a third murky domain where the person wants something that he maybe wouldn’t want upon reflection, but where “upon reflection” is kinda indeterminate because he could be manipulated into wanting different things depending on how they’re framed. This third domain is important because it contains decisions about politics and society and institutions and ethics and so on. I have concerns that getting an AI to “perform well” in this murky domain is not feasible via a bootstrap thing that starts from the approval of random people; rather, I think a good solution would have to look more like an AI which is internally able to do the kinds of reflection and thinking that humans do (but where the AI has the benefit of more knowledge, insight, time, etc.). And that requires that the AI have a certain kind of “autonomy” to reflect on the big picture of what it’s doing and why. [...If this is] done well (a big “if”!), it would open up a lot of options.

I very much agree that there’s a basic problem in the world where our philosophy isn’t particularly good, and wisdom is scarce. I think to navigate to a good future we, as a species, need to figure this out and put transformative technology exclusively into the hands of people who use it to make the world safe and give us time to collectively find our way. This is perhaps too tall of an order, given where the world is now, but I like the story wherein we have a technical agenda for AGI that feels not-doomed insofar as we can put it in wise hands much more than the current state of not having consensus on any non-doomed technical agendas.

Seth Herd, a colleague of Byrnes, also seems to be broadly on the same page:

It's really hard to make a goal of "maximize X, except if someone tells you to shut down". I think the same argument applies to Christiano's goal of achieving corrigibility through RL by rewarding correlates of corrigibility. If other things are rewarded more reliably, you may not get your AGI to shut down when you need it to.

But those arguments don't apply if corrigibility in the broad sense is the primary goal. "Doing what this guy means by what he says" is a perfectly coherent goal. And it's a highly attractive one, for a few reasons. Perhaps corrigibility shouldn't be used in this sense and do what I mean (DWIM) is a better term. But it's closely related. It accomplishes corrigibility, and has other advantages. I think it's fairly likely to be the first goal someone actually gives an AGI.

I do think DWIM is distinct from Corrigibility, as I’ve conceived of it. See the “Servile” heading of my Corrigibility Intuition doc for more details. But I think Herd’s view lands closer to mine than how many researchers conceive of the property. (Here’s an example of him responding to Thornley in a way I endorse.)

In Instruction-following AGI is easier and more likely than value aligned AGI, Herd writes:

An instruction-following AGI must have the goal of doing what its human(s) would tell it to do right now, what it’s been told in the past, and also what it will be told to do in the future. This is not trivial to engineer or train properly; getting it right will come down to specifics of the AGI’s decision algorithm. There are large risks in optimizing this goal with a hyperintelligent AGI; we might not like the definition it arrives at of maximally fulfilling your commands. But this among other dangers can be addressed by asking the adequate questions and giving the adequate background instructions before the AGI is capable enough to control or manipulate you.

Again, I mostly agree with Herd’s perspective, but I want to highlight here a sense that he misses a good deal of the difficulty in precisely naming the right goal. Consider that what humans tell the AI to do in the future depends on what the AI does in the past. For example, imagine that 99.9% of all humans that will ever live predictably demand that the AI brainwash all living people and future generations into valuing brainwashing. Should the AI, in the past, obey their future instructions? (I discuss similar problems with time towards the end of Formal (Faux) Corrigibility.) I think there’s a solution to this problem, and that with the correct notion of corrigibility this is not an issue, but I wish Herd would put more emphasis how getting these kinds of details exactly right is essential to avoiding catastrophic outcomes.

Other Possible Desiderata (via Let’s See You Write That Corrigibility Tag)

Let’s look at other desiderata lists proposed when Yudkowsky called for them in 2022. For efficiency’s sake, I’m going to restrict my response to comments proposing desiderata that address the core idea and have more than 10 karma.


Principles which counteract instrumental convergent goals

  1. Disutility from resource acquisition - e.g. by some mutual information measure between the AI and distant parts of the environment
  2. Task uncertainty with reasonable prior on goal drift - the system is unsure about the task it tries to do and seeks human inputs about it.
  3. AI which ultimately wants to not exist in [the] future as a terminal goal. Fulfilling the task is on the simplest trajectory to non-existence

Principles which counteract unbounded rationality

  1. Information-theoretic penalty on reasoning: any bit-flip in the system is assumed to have some utility cost. The cost function could be very small initially but get steeper fast
  2. Satisfycing / mentioned
  3. Myopia / mentioned


  1. Tripwire artifacts. Messing up with some to the system unknown but unrelated parts of the environment is associated with large negative rewards
  2. External watchdogs. Smaller and fast external systems [are] able to react quickly to out-of-distribution behaviour.
  3. Ontological uncertainty about level of simulation.


  1. Human-approval model based on imitation learning, sped up/amplified
  2. Human-values ethics model, based on value learning
  3. Legal-system-amplified model of negative limits of violating property rights or similar
  4. Red-teaming of action plans,  AI debate style, feeding into previous


  1. Imposing strong incentives on internal modularity, and interpretable messaging across module boundaries
  2. Human-level explanations, produced by an independent "translator" system

I’m all for counteracting Ommohundro Drives when it makes sense to do so, but I think disutility from resource acquisition and suicidality are the sorts of things that I would expect to either be too weak to do anything or to make the AI some combination of useless and/or unpredictable. Furthermore, I don’t see any way in which they’re part of the true name of corrigibility, except insofar as having resources gives opportunity for making big mistakes, which might be hard for the principal to fix.

Task uncertainty feels fine. Part of my conception of corrigibility involves a sense of uncertainty that stems from the agent seeing itself as potentially flawed/in the middle of being built. This kind of uncertainty doesn’t necessarily produce corrigibility, as MIRI pointed out in 2015, but it seems worth including in a list of desiderata. (I point at my version of it under the heading “Disambiguation/Concreteness”.)

Disutility from reasoning seems similar to disutility from resources/existence. I think the steelmanned version of this property is that the corrigible should behave straightforwardly, and part of straightforwardness is that there’s a simple story for its behavior that doesn’t route through arcane reasoning.

Traps are fine as external safeguards. I do not approve of baking in things like ontological uncertainty about simulation into the mind of the AI because it pushes the AI towards weird, unpredictable headspaces. I’m more fond of the words Yudkowsky wrote about behaviorism being a shield against modeling hostile aliens than I am about the idea of forcing the AI to contemplate whether it’s being simulated by hostile aliens.

I’m confused about Kulveit’s Oversight desiderata. Is the suggestion here to have the AI autonomously reasoning about the ethics/legality/approval/etc. of its actions according to an internal model? While this kind of cognition seems useful for flagging potential flaws (e.g. “I notice I am inclined to do something which I believe is  illegal”), I disapprove of the idea that the AI should be steering its actions according to rich models of ethics/law/etc. for reasons of pure vs impure corrigibility discussed in The CAST Strategy.

Desiderata 14 reminds me of Yudkowsky’s version of “Myopia” and “Separate superior questioners.” I think human-level explanations (15) are a good idea (see my version under the heading “Cognitive Legibility”).


  • From an alignment perspective, the point of corrigibility is to fail safely and potentially get more than one shot. Two general classes of principles toward that end:
    • If there's any potential problem at all, throw an error and shut down. Raise errors early, raise errors often.
    • Fail informatively. Provide lots of info about why the failure occurred, make it as loud and legible as possible.
  • Note that failing frequently implies an institutional design problem coupled with the system design problem: we want the designers to not provide too much accidental selection pressure via iteration, lest they select against visibility of failures.

I like this. It’s a bit vague, but I think it captures a feel/flavor of corrigibility that I think is worthy of emphasis. Some of this comes down to things like communication and handling exceptional situations gracefully, but it also reminds me of the “Whitelisting” desiderata from Yudkowsky’s list.

Major principle: locality!

  • Three example sub-principles:
    • Avoid impact outside some local chunk of spacetime
    • Avoid reasoning about stuff outside some local chunk of spacetime
    • Avoid optimizing outside some local chunk of spacetime
  • [...]

As Wentworth himself points out, it’s inconsistent to try to avoid impacting distant things while also being indifferent to distant things. I think in practice this has to be balanced by reference to a deeper generator (e.g. “empowering the principal to fix the agent’s mistakes”). In other words, there needs to be a “why” behind avoiding distant impact/reasoning/optimization or else I expect the system to simply error over and over again or, worse, behave erratically. Wentworth also portrays non-manipulation as a kind of locality (by placing the principal outside the local optimization scope), which I think is cute, but probably the wrong frame.

  • Major principle: understandability!
    • The system's behavior should be predictable to a human; it should do what users expect, and nothing else.
    • The system's internal reasoning should make sense to a human. [...]
    • In general, to the extent that we want the system to not actively model users/humans, the users/humans need to do the work of checking that plans/reasoning do what humans want. So plans/reasoning need to be human-legible as much as possible.
      • Plans and planning should be minimal [...]
      • Plans should avoid pushing the world way out-of-distribution compared to what humans are able to reason about.
        • Plans should not dramatically shift the natural ontology of the world

Generally agree. I think it’s interesting (and pleasant) to note how we can see different corrigibility desiderata can reinforce each-other. For instance, here we see low-impact showing up as part of comprehensibility.

  • Do what the user says, what the user means, what the user expects, etc. These are mutually incompatible in general. The preferred ways to handle such incompatibilities are (1) choose problems for which they are not incompatible, and (2) raise an informative error if they are.

I’m not sure what the “etc.” is supposed to reference. From my point of view there’s intent/expectation and there’s literal interpretation. I agree that in situations where the principal’s words diverge from the agent’s model of their desires, the agent should stop and seek clarification. The directive of “choosing problems” seems wrong/confusing.

  • Major principle: get feedback from the user at runtime!
    • Runtime feedback should actually be used, even when "incompatible" in some way with whatever the system previously thought it was doing.
      • Don't avoid shutdown
      • Raise an error if feedback is incompatible in some way with other info/objectives/etc.
    • Note that feedback is implicitly optimized against, which is dangerous. Limit that optimization pressure.
  • Infohazards and persuasion-optimized info need to not be presented to the user, which is very incompatible with other principles above. Ideally, we want to choose problems/search spaces for which such things are unlikely to come up. Throwing a legible error if such things come up is itself dangerous (since it draws human attention to the infohazard), and creates another institutional design problem coupled to the technical problems.

At the risk of being somewhat nitpicky, “get feedback” seems like a wrong frame of a correct desiderata. A corrigible agent, according to me, needs to be hungry for situations where the principal is free to offer genuine correction, but not hungry for correction (or anti-correction) per-se. The word “feedback” I feel imparts too much of a flavor of a survey that doesn’t do anything. Genuine correction, by contrast, involves actually modifying the agent.

  • A system which follows all these principles, and others like them, probably won't do anything directly useful, at least not at first. That's ok. All those informative errors will make the confused humans less confused over time.

This feels like it’s reiterating the point that we started with that I like. I think true corrigibility involves an agent which is capable of doing meaningful work, but as long as we’re pursuing a strategy of getting to true corrigibility through messy experimentation on agents which are partially corrigible, we should be pushing for conservative traits like erring on the side of erroring.

Lauro Langosco

(Bold text is from the original source:)

(This was an interesting exercise! I wrote this before reading any other comments; obviously most of the bullet points are unoriginal)

The basics

  • It doesn't prevent you from shutting it down
  • It doesn't prevent you from modifying it
  • It doesn't deceive or manipulate you
  • It does not try to infer your goals and achieve them; instead it just executes the most straightforward, human-common-sense interpretation of its instructions
  • It performs the task with minimal side-effects (but without explicitly minimizing a measure of side-effects)
  • If it self-modifies or constructs other agents, it will preserve corrigibility. Preferably it does not self-modify or construct other intelligent agents at all


  • Its objective is no more broad or long-term than is required to complete the task
  • In particular, it only cares about results within a short timeframe (chosen to be as short as possible while still enabling it to perform the task)
  • It does not cooperate (in the sense of helping achieve their objective) with future, past, or (duplicate) concurrent versions of itself, unless intended by the operator


  • It doesn't maximize the probability of getting the task done; it just does something that gets the task done with (say) >99% probability
  • It doesn't "optimize too hard" (not sure how to state this better)
    • Example: when communicating with humans (e.g. to query them about their instructions), it does not maximize communication bandwidth / information transfer; it just communicates reasonably well
  • Its objective / task does not consist in maximizing any quantity; rather, it follows a specific bounded instruction (like "make me a coffee", or "tell me a likely outcome of this plan") and then shuts down
  • It doesn't optimize over causal pathways you don't want it to: for example, if it is meant to predict the consequences of a plan, it does not try to make its prediction more likely to happen
  • It does not try to become more consequentialist with respect to its goals
    • for example, if in the middle of deployment the system reads a probability theory textbook, learns about dutch book theorems, and decides that EV maximization is the best way to achieve its goals, it will not change its behavior

No weird stuff

  • It doesn't try to acausally cooperate or trade with far-away possible AIs
  • It doesn't come to believe that it is being simulated by multiverse-aliens trying to manipulate the universal prior (or whatever)
  • It doesn't attempt to simulate a misaligned intelligence
  • In fact it doesn't simulate any other intelligences at all, except to the minimal degree of fidelity that is required to perform the task

Human imitation

  • Where possible, it should imitate a human that is trying to be corrigible
  • To the extent that this is possible while completing the task, it should try to act like a helpful human would (but not unboundedly minimizing the distance in behavior-space)
  • When this is not possible (e.g. because it is executing strategies that a human could not), it should stay near to human-extrapolated behaviour ("what would a corrigible, unusually smart / competent / knowledgable human do?")
  • To the extent that meta-cognition is necessary, it should think about itself and corrigibility in the same way its operators do: its objectives are likely misspecified, therefore it should not become too consequentialist, or "optimize too hard", and [other corrigibility desiderata]

Querying / robustness

  • Insofar as this is feasible it presents its plans to humans for approval, including estimates of the consequences of its plans
  • It will raise an exception, i.e. pause execution of its plans and notify its operators if
    • its instructions are unclear
    • it recognizes a flaw in its design
    • it sees a way in which corrigibility could be strengthened
    • in the course of performing its task, the ability of its operators to shut it down or modify it would be limited
    • in the course of performing its task, its operators would predictably be deceived / misled about the state of the world

We agree on “The basics”, as one would hope.

I have mixed feelings about Myopia. On one hand this fits in well with desiderata I endorse, such as focusing on local scope, and avoiding impacting distant times and places. On the other hand, as framed it seems to be suggesting that the agent be indifferent to long-term impacts, which I think is wrong. Also, the non-cooperation bullet point seems blatantly wrong, and I’m not sure what Langosco was going for there.

I think the aversion to maximization is confused. If an agent has a coherent set of preferences, it is mathematically determined that its behavior is equivalent to maximizing expected utility. An agent cannot steer towards a consistent goal without, at some level, being a maximizer. But perhaps Langosco means to say that the agent should not relate to its goals as utilities to be maximized from the internal perspective of figuring out what to do. This, however, feels somewhat irrelevant to me; I mostly care about how the agent is behaving, not whether it’s relating to the world as a deontologist or a consequentialist. I suspect that the steelmanned version of Langosco’s idea is that the AI’s preferences should, in a meaningful sense, be satisfiable rather than open (in the same sense that an open interval is open). Satisfiable preferences assign equal utility to communicating pretty well as it does to communicating perfectly, thus allowing the agent to stop searching for plans when it finds a satisfactory solution. My guess is that even this version isn’t quite right; we care about the AI not “doing maximization” because we want mild impact, comprehensible thinking, and straightforward plans, and our desiderata list should reflect that. In other words, I claim that when the agent has a sense of the corrigibility costs/tradeoffs of optimizing something hard, it should naturally avoid hard optimization because it is unacceptably costly.

“No weird stuff” seems fine, albeit perhaps better stated under a heading of “Straightforwardness” (as I do in my desiderata list).

“Human imitation” seems like a wrong framing. I like the desiderata of thinking about itself and corrigibility in the same way as the principal, though I take the stance that the true name of this desiderata is cognitive legibility, and that it’s actually fine to think about things differently insofar as the principal grokks the difference in perspectives (and that difference doesn’t produce communication errors). Langosco seems not to really be suggesting the agent behave like a human, but rather like an extrapolated and modified human. I think I see what’s being reached for, here, but it feels to me like it’s introducing a source of brittleness/weirdness that we want to avoid. Humans have many properties that seem bad to imitate, and while we might hope our extrapolation process irons out those issues, it seems like an unnecessary point of failure.

I very much like the final querying/robustness section, and see it very much in line with my intuitions about what a purely corrigible agent is trying to do.

Charlie Steiner

(Bold text is from the original source:)


An agent models the consequences of its actions in the world, then chooses the action that it thinks will have the best consequences, according to some criterion. Agents are dangerous because specifying a criterion that rates our desired states of the world highly is an unsolved problem (see value learning). Corrigibility is the study of producing AIs that are deficient in some of the properties of agency, with the intent of maintaining meaningful human control over the AI.

Different parts of the corrigible AI may be restricted relative to an idealized agent - world-modeling, consequence-ranking, or action-choosing. When elements of the agent are updated by learning or training, the updating process must preserve these restrictions. This is nontrivial because simple metrics of success may be better-fulfilled by more agential AIs. See restricted learning for further discussion, especially restricted learning § non-compensation for open problems related to preventing learning or training one part of the AI from compensating for restrictions nominally located in other parts.

I really appreciate this comment as a non-strawman perspective on corrigibility that I think is confused and sets things up to appear more doomed than they are. Corrigibility is not (centrally) about controlling the AI by making it deficient! An agent which wants to be corrigible can be corrigible without being impaired in any way (and insofar as it’s impaired, we should it to be less corrigible, rather than more!). If we approach corrigibility by crippling the AI’s capabilities, we should expect corrigibility to be an extremely fragile property which is at risk of being optimized away.

Restricted world-modeling


Counterfactual agency

A corrigible AI built with counterfactual agency does not model the world as it is, instead its world model describes some counterfactual world, and it chooses actions that have good consequences within that counterfactual world.

The strategies in this general class are best thought of in terms of restricted action-choosing. We can describe them with an agent that has an accurate model of the world, but chooses actions by generating a counterfactual world and then evaluating actions' consequences on the counterfactual, rather than the agential procedure. Note that this also introduces some compensatory pressures on the world-model.

The difficulty lies in choosing and automatically constructing counterfactuals (see automatic counterfactual construction) so that the AI's outputs can be interpreted by human operators to solve real-world problems, without those outputs being selected by the AI for real-world consequences. For attempts to quantify the selection pressure of counterfactual plans in the real world, see policy decoherence. One example proposal for counterfactual agency is to construct AIs that act as if they are giving orders to perfectly faithful servants, when in reality the human operators will evaluate the output critically. [...]

Oof. So a lot of my objections here can be seen in my response to Yudkowsky’s behaviorism desiderata. I think tampering with the agent’s world model, including by strong pressures to not think about certain things or to conceive of things different than how they are is pretty doomed. It’s doomed not only in its brittleness, but also in the way that it screens off the AI attempting to intentionally build the right kind of relationship with its principal. Superintelligences which are spending their time focusing on optimizing weird counterfactuals, or which are blind to large parts of the world, are predictably going to cause chaos in the parts of reality that they’re neglecting.


1) input masking, basically for oracle/task-AI you ask the AI for a program that solves a slightly more general version of your problem and don't give the AI the information necessary to narrow it down, then run the program on your actual case (+ probably some simple test cases you know the answer to to make sure it solves the problem).
this lets you penalize the AI for complexity of the output program and therefore it will give you something narrow instead of a general reasoner.
(obviously you still have to be sensible about the output program, don't go post the code to github or give it internet access.)

2) reward function stability.  we know we might have made mistakes inputting the reward function, but we have some example test cases we're confident in. tell the AI to look for a bunch of different possible functions that give the same output as the existing reward function, and filter potential actions by whether any of those see them as harmful.

This seems like another good example of the kind of bad perspective on corrigibility that I want to get away from. Input masking is extremely brittle and won’t scale to superintelligence or the kinds of domains that are worth working on. “Reward function stability” seems to imply that the reward function is the deeply important bit, rather than what the actual preferences of the agent are. It furthermore supposes that we can identify harmful actions a priori, which is kinda the whole problem.

Next up: 5. Open Corrigibility Questions

Return to 0. CAST: Corrigibility as Singular Target

  1. ^

     I do not mean to imply an explicit expected-utility calculation here (though it could involve that), but rather note that the pathways of strategy and choice in an agent that’s been trained to satisfy preferences are balancing lots of different concerns, and I don’t see sufficient evidence to suggest that pressures towards corrigibility will dominate in those pathways.

  2. ^

     In most ML setups we should more precisely say that the learned policy isn’t really optimizing for long-term goals, and it doesn’t make sense to ascribe that policy network agency. Even insofar as it’s controlling for things, it probably isn’t engaging in the consequentialist reasoning necessary to be VNM rational (and thus have a utility function). From this perspective training an agent that has driving in circles as a top-level goal is still a speculative line of research, but I do not expect it to be harder to deliberately invoke that as a goal, as the system scales up, as opposed to some other goal of similar complexity.

  3. ^

     One of the strangest things about Turner’s notation, from my perspective, is that usually we think of π as denoting a policy, and Turner uses this language many times in his essay, but that doesn’t typecheck. Mutual information takes variables, which we see as randomly set to specific values. To be a bit imprecise—the π symbols used in the equation are like distributions over policies, and not specific policies. (Typical notation uses uppercase letters for variables and lowercase letters for specific values/settings to avoid this very confusion.)

  4. ^

     We should recognize that Scott Garrabrant has put forth an interesting, and (in my opinion) important, criticism of the independence axiom. A more thorough response to Thornley would involve getting into Garabrant’s “Geometric Rationality” but in the interests of staying focused I am going to ignore it. Please comment if you feel that this is a mistake.

  5. ^

     Except, technically, when offering a “choice” between X and X, which of course must be represented as indifference, insofar as we’re considering such “choices.”

  6. ^

     This is an abuse of notation, the set of abandoned alternatives are in fact lotteries, rather than outcomes. In the examples we’re considering there are no probabilistic nodes, but I claim that the extension to handling probabilistic alternatives is straightforward.

New Comment
7 comments, sorted by Click to highlight new comments since:

I think your 'Incomplete preferences' section makes various small mistakes that add up to important misunderstandings.

The utility maximization concept largely comes from the VNM-utility-theorem: that any policy (i.e. function from states to actions) which expresses a complete set of transitive preferences (which aren’t sensitive to unused alternatives) over lotteries is able to be described as an agent which is maximizing the expectation of some real-valued utility function over outcomes.

I think you intend 'sensitive to unused alternatives' to refer to the Independence axiom of the VNM theorem, but VNM Independence isn't about unused alternatives. It's about lotteries that share a sublottery. It's Option-Set Independence (sometimes called 'Independence of Irrelevant Alternatives') that's about unused alternatives.

On the surface, the axioms of VNM-utility seem reasonable to me

To me too! But the question isn't whether they seem reasonable. It's whether we can train agents that enduringly violate them. I think that we can. Coherence arguments give us little reason to think that we can't.

unused alternatives seem basically irrelevant to choosing between superior options

Yes, but this isn't Independence. And the question isn't about what seems basically irrelevant to us.

agents with intransitive preferences can be straightforwardly money-pumped

Not true. Agents with cyclic preferences can be straightforwardly money-pumped. The money-pump for intransitivity requires the agent to have complete preferences.

as long as the resources are being modeled as part of what the agent has preferences about

Yes, but the concern is whether we can instil such preferences. It seems like it might be hard to train agents to prefer to spend resources in pursuit of their goals except in cases where they would do so by resisting shutdown.

Thornley, I believe, thinks he’s proposing a non-VNM rational agent. I suspect that this is a mistake on his part that stems from neglecting to formulate the outcomes as capturing everything that he wants.

You can, of course, always reinterpret the objects of preference so that the VNM axioms are trivially satisfied. That's not a problem for my proposal. See:

Thanks, Lucius. Whether or not decision theory as a whole is concerned only with external behaviour, coherence arguments certainly aren’t. Remember what the conclusion of these arguments is supposed to be: advanced agents who start off not being representable as EUMs will amend their behaviour so that they are representable as EUMs, because otherwise they’re liable to pursue dominated strategies.

Now consider an advanced agent who appears not to be representable as an EUM: it’s paying to trade vanilla for strawberry, strawberry for chocolate, and chocolate for vanilla. Is this agent pursuing a dominated strategy? Will it amend its behaviour? It depends on the objects of preference. If objects of preference are ice-cream flavours, the answer is yes. If the objects of preference are sequences of trades, the answer is no. So we have to say something about the objects of preference in order to predict the agent’s behaviour. And the whole point of coherence arguments is to predict agents’ behaviour.

And once we say something about the objects of preference, then we can observe agents violating Completeness and acting in accordance with policies like ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ This doesn't require looking into the agent or saying anything about its algorithm or anything like that. It just requires us to say something about the objects of preference and to watch what the agent does from the outside. And coherence arguments already commit us to saying something about the objects of preference. If we say nothing, we get no predictions out of them.


The pattern of how an agent chooses options are that agent’s preferences, whether we think of them as such or whether they’re conceived as a decision rule to prevent being dominated by expected-utility maximizers!

You can define 'preferences' so that this is true, but then it need not follow that agents will pay costs to shift probability mass away from dispreferred options and towards preferred options. And that's the thing that matters when we're trying to create a shutdownable agent. We want to ensure that agents won't pay costs to influence shutdown-time.

Also, take your decision-tree and replace 'B' with 'A-'. If we go with your definition, we seem to get the result that expected-utility-maximizers prefer A- to A (because they choose A- over A on Monday). But that doesn't sound right, and so it speaks against the definition.

I think it’s interesting to note that we’re also doing something like throwing out the axiom of independence from unused alternatives

Not true. The axiom we're giving up is Decision-Tree Separability. That's different to VNM Independence, and different to Option-Set Independence. It might be hard to train agents that enduringly violate VNM Independence and/or Option-Set Independence. It doesn't seem so hard to train agents that enduringly violate Decision-Tree Separability.

In other words, if you wake up as this kind of agent on Monday, the way you cash-out your partial ordering over outcomes depends on your memory/model of what happened on Sunday.

Yes, nice point. Kinda weird? Maybe. Difficult to create artificial agents that do it? Doesn't seem so.

But notice that this refactor effectively turns Thornley’s agent into an agent with a set of preferences which satisfies the completeness and independence axioms of VNM

Yep, you can always reinterpret the objects of preference so that the VNM axioms are trivially satisfied.That's not a problem for my proposal.

the point is that “incomplete preferences” combined with a decision making algorithm which prevents the agent’s policy from being strictly dominated by an expected utility maximizer ends up, in practice, as isomorphic to an expected utility maximizer which is optimizing over histories/trajectories.

Not true. As I say elsewhere:

And an agent abiding by the Caprice Rule can’t be represented as maximising utility, because its preferences are incomplete. In cases where the available trades aren’t arranged in some way that constitutes a money-pump, the agent can prefer (/reliably choose) A+ over A, and yet lack any preference between (/stochastically choose between) A+ and B, and lack any preference between (/stochastically choose between) A and B. Those patterns of preference/behaviour are allowed by the Caprice Rule.


I want to consider how there’s a common misunderstanding of “outcomes” in the VNM-sense as needing to be about physical facts of the future universe (such as number of paperclips) rather than as potentially including historical facts, such as which options were abandoned or whether the agent took the default action. This is extremely relevant for corrigibility since one of the key ideas in my strategy is to shift the AI’s preferences from being about things like whether the button is pushed to being about whether the agent consistently maintained a certain kind of relationship with the principal during the relevant period.

Same point here as above. You can get any agent to satisfy the VNM axioms by enriching the objects of preference. A concern is that these more complex preferences are harder to reliably train into your agent.

Excellent response. Thank you. :) I'll start with some basic responses, and will respond later to other points when I have more time.

I think you intend 'sensitive to unused alternatives' to refer to the Independence axiom of the VNM theorem, but VNM Independence isn't about unused alternatives. It's about lotteries that share a sublottery. It's Option-Set Independence (sometimes called 'Independence of Irrelevant Alternatives') that's about unused alternatives.

I was speaking casually here, and I now regret it. You are absolutely correct that Option-Set independence is not the Independence axiom. My best guess about what I meant was that VNM assumes that the agent has preferences over lotteries in isolation, rather than, for example, a way of picking preferences out of a set of lotteries. For instance, a VNM agent must have a fixed opinion about lottery A compared to lottery B, regardless of whether that agent has access to lottery C.

> agents with intransitive preferences can be straightforwardly money-pumped

Not true. Agents with cyclic preferences can be straightforwardly money-pumped. The money-pump for intransitivity requires the agent to have complete preferences.

You are correct. My "straightforward" mechanism for money-pumping an agent with preferences A > B, B > C, but which does not prefer A to C does indeed depend on being able to force the agent to pick either A or C in a way that doesn't reliably pick A.

Also, take your decision-tree and replace 'B' with 'A-'. If we go with your definition, we seem to get the result that expected-utility-maximizers prefer A- to A (because they choose A- over A on Monday). But that doesn't sound right, and so it speaks against the definition.

Can you be more specific here? I gave several trees, above, and am not easily able to reconstruct your point.

Got this on my list to read! Just in case it's easy for you to do, can you turn the whole sequence into a PDF? I'd like to print it. Let me know if that'd be a hassle, in which case I can do it myself.

I wrote drafts in Google docs and can export to pdf. There may be small differences in wording here and there and some of the internal links will be broken, but I'd be happy to send you them. Email me at max@intelligence.org and I'll shoot them back to you that way?

I think it’s going to be very typical for different actions to have trajectories that are mutually non-dominant (such as in the example). It matters a lot how you decide such cases, and I expect that almost all such ways of deciding are catastrophic.

But suppose I’m wrong, and timestep-dominance is always relevant.

My claim isn't that Timestep Dominance is always relevant. It's that Timestep Dominance rules out all instances of resisting shutdown.

I agree that many pairs of available lotteries are going to be mutually non-dominant. For those cases, Sami and I propose that the agent choose by maximizing expected utility. Can you say what you expect the problem there to be?

Suppose that, assuming the shutdown button didn’t exist, the fastest pathway to paperclips involves instantiating a paperclip-seeking AI on an unsecured server in Singapore. Suppose that it takes 99,000 keystrokes to create an IPP agent, but only 55,000 keystrokes to create a basic VNM-rational agent with a short enough time horizon that it produces more paperclips quickly, in expectation (because VNM agents are both simpler and less likely to be shut down by angry humans). The shutdown button does, in fact exist, but the agent’s actions might be hard to notice, such that by the time anyone has realized that something went wrong, the agent already created a sub-agent, effectively making the stop-button useless.

I talk about the issue of creating corrigible subagents here. What do you think of that? 

Note also a general nice feature of TD-agents: they won’t pay costs to hide their bad behaviour, because paying these costs is timestep-dominated by not paying them. So if a TD-agent does try to create an incorrigible paperclipper, it won't hide that fact if doing so is at all costly.

While active resistance seems like the scariest part of incorrigibility, an incorrigible agent that’s not actively resisting still seems likely to be catastrophic.

Can you say more about this? It doesn't seem likely to me.

One more thing I'll say: the IPP leaves open the content of the agent's preferences over same-length trajectories. One pattern of preferences you could try to train in is the kind of corrigibility that you talk about elsewhere in your sequence. That'd give you two lines of defence against incorrigibility.

Again, responding briefly to one point due to my limited time-window:

> While active resistance seems like the scariest part of incorrigibility, an incorrigible agent that’s not actively resisting still seems likely to be catastrophic.

Can you say more about this? It doesn't seem likely to me.

Suppose I am an agent which wants paperclips. The world is full of matter and energy which I can bend to my will in the service of making paperclips. Humans are systems which can be bent towards the task of making paperclips, and I want to manipulate them into doing my bidding not[1] because they might turn me off, but because they are a way to get more paperclips. When I incinerate the biosphere to gain the energy stored inside, it's not[1] because it's trying to stop me, but because it is fuel. When my self-replicating factories and spacecraft are impervious to weaponry, it is not[1] because I knew I needed to defend against bombs, but because the best factory/spacecraft designs are naturally robust.

  1. ^