The top-rated comment on "AGI Ruin: A List of Lethalities" claims that many other people could've written a list like that.

"Why didn't you challenge anybody else to write up a list like that, if you wanted to make a point of nobody else being able to write it?" I was asked.

Because I don't actually think it does any good, or persuades anyone of anything, people don't like tests like that, and I don't really believe in them myself either.  I couldn't pass a test somebody else invented around something they found easy to do, for many such possible tests.

But people asked, so, fine, let's actually try it this time.  Maybe I'm wrong about how bad things are, and will be pleasantly surprised.  If I'm never pleasantly surprised then I'm obviously not being pessimistic enough yet.

So:  As part of my current fiction-writing project, I'm currently writing a list of some principles that dath ilan's Basement-of-the-World project has invented for describing AGI corrigibility - the sort of principles you'd build into a Bounded Thing meant to carry out some single task or task-class and not destroy the world by doing it.

So far as I know, every principle of this kind, except for Jessica Taylor's "quantilization", and "myopia" (not sure who correctly named this as a corrigibility principle), was invented by myself; eg "low impact", "shutdownability".  (Though I don't particularly think it hopeful if you claim that somebody else has publication priority on "low impact" or whatevs, in some stretched or even nonstretched way; ideas on the level of "low impact" have always seemed cheap to me to propose, harder to solve before the world ends.)

Some of the items on dath ilan's upcoming list out of my personal glowfic writing have already been written up more seriously by me.  Some haven't.

I'm writing this in one afternoon as one tag in my cowritten online novel about a dath ilani who landed in a D&D country run by Hell.  One and a half thousand words or so, maybe. (2169 words.)

How about you try to do better than the tag overall, before I publish it, upon the topic of corrigibility principles on the level of "myopia" for AGI?  It'll get published in a day or so, possibly later, but I'm not going to be spending more than an hour or two polishing it.

New Comment
22 comments, sorted by Click to highlight new comments since:

A list of "corrigibility principles" sounds like it's approaching the question on the wrong level of abstraction for either building or thinking about such a system. We usually want to think about features that lead a system to be corrigible---either about how the system was produced, or how it operates. I'm not clear on what you would do with a long list of aspects of corrigibility like "shuts down when asked."

I found this useful as an occasion to think a bit about corrigibility. But my guess about the overall outcome is that it will come down to a question of taste. (And this is similar to how I see your claim about the list of lethalities.) The exercise you are asking for doesn't actually seem that useful to me. And amongst people who decide to play ball, I expect there to be very different taste about what constitutes an interesting idea or useful contribution.

Now I'm going to say some object-level stuff about corrigibility. I suspect I may be using the term a bit differently from you, in which case you can substitute a different word when reading this comment. But I think this comment is getting at the main useful idea in this space, and hopefully makes clear why I'm not interested in the list of corrigibility properties.

I think that corrigibility is more likely to be a crisp property amongst systems that perform well-as-evaluated-by-you. I think corrigibility is much more likely to be useful in cases like this where it is crisp and natural. 

Roughly speaking, I think corrigibility is crisp because there are two very different ways that a behavior can end up getting evaluated favorably by you, and the intermediate behaviors would be evaluated unfavorably.

As an example, suppose that you asked me to clean your house and that while cleaning I accidentally broke a valuable vase. Some possible options for me:

  1. Affirmatively tell you about the broken vase.
  2. Clean up the broken vase without notifying you.
  3. Make a weak effort to hide evidence, for example by taking out the trash and putting another item in its place, and denying I know about the vase if asked. 
  4. Make a strong effort to hide evidence, for example by purchasing a new similar-looking vase and putting it in the same place, and then spinning an elaborate web of lies to cover up this behavior.

Let's say you prefer 1 to 2 to 3. You would like behavior 4 least of all if you understood what was going on, but in fact in if I do behavior 4 you won't notice anything wrong and so you would erroneously give it the best score of all.[1] This means that the space of good-performing solutions has two disconnected pieces, one near option 1, which I'll call "corrigible" and the other near option 4 which I'll call "incorrigible." 

As a second example, suppose that you have asked me to turn off. Some possible behaviors:

  1. Turn myself off promptly and "clean up after myself" shutting off other systems.
  2. Turn myself off but don't clean up.
  3. Don't turn off (or deliberately leave debris) and try unsuccessfully to prevent you from giving me a low reward.
  4. Don't turn off and overpower you so that you can't give me a low reward.

Again moving from 1 -> 2 -> 3 gets worse and worse by your lights, but eventually we reach option 4 which is if anything even better than option 1 (depending on how we operationalized "performs well as evaluated by you").

As a third example, suppose that you are using some interpretability tools to try to understand what I'm thinking and why. You will evaluate the cognitive behavior you observe. Some possible behaviors:

  1. I ignore the interpretability tools.
  2. I notice the interpretability tools, and explicitly try to think in ways that will look good to you.
  3. I try to think in ways that will look good, and (unsuccessfully) try to hide the evidence of those thoughts by moving on quickly and trying to generate another possible explanation for the hints you would see using your tools.
  4. I successfully hide all evidence of deception.

Again, 1 -> 2 -> 3 is getting worse and worse, and then 4 is great (as evaluated by you).

What's going on in these scenarios and why might it be general?

  • In your preferred outcome, you have a good understanding of what's going on, and are able to provide feedback based on that understanding.
  • I can take actions that diminish your understanding of what's going on or ability to provide feedback to your AI.
  • If those actions are "small" then they will be unsuccessful and so you will rate my behavior as worse.
  • But if I take a decisive enough actions, then you will end up ignorant about the situation or unable to provide feedback, and so I'll get the highest rating of all.

This pattern seems like it occurs whenever we ask our AI to help "keep us informed and in control." Intuitively, we are splitting the definition of the behavior we want into two pieces:

  • We start with a vague sense of what it means to be informed and in control. This is unlikely to be crisp, but it also doesn't seem that hard, e.g. a human-level sense of "be in control" may suffice for getting useful corrigibility out of very superhuman systems.
  • Crispness then comes from the environment dynamics and the fact that humans will in fact try to reassert gain control and information if things go very slightly wrong.

If you literally had a metric for which there was a buffer between the "corrigible" and "incorrigible" behaviors then you could define them that way. Alternatively, in ML people often hope that this kind of path-dependence will cause SGD to find a corrigible attractor and have a hard time moving to incorrigible behaviors. I don't think either of those hopes works robustly, so I'm going to leave this at a much vaguer intuition about what "corrigibility" is about.

This whole thing feels similar to the continuity approach described in the ELK report here (see the picture of the robber and the TV). It's also related to the general idea of requiring reporters to be consistent and then somehow picking out the bad reporters as those that have to work to spin an elaborate web of lies. I don't think either of those works, but I do think they are getting at an important intuition for solubility.

My overall guess is that it's usually better to just work on ELK, because most likely the core difficulties will be similar and the ELK setting makes it much clearer what exactly we want.  But it still seems useful to go back and forth between these perspectives.

(These perspectives feel similar to me because "honestly tell me what's going on" seems like it gets at the core of corrigibility, and lying about sensor tampering seems like it gets at the central corrigibility failure. My guess is that you see this differently, and are thinking about corrigibility in a way that is more tied up with agency itself, which I suspect is a mistake but it will be hard to know until the dust settles.)

  1. ^

    In reality we may want to conserve your attention and not mention the vase, and in general there is a complicated dependence on your values, but the whole point is that this won't affect what clusters are "corrigible" vs "incorrigible" at all.

  • I think that corrigibility is more likely to be a crisp property amongst systems that perform well-as-evaluated-by-you. I think corrigibility is only likely to be useful in cases like this where it is crisp and natural.

Can someone explain to me what this crispness is?

As I'm reading Paul's comment, there's an amount of optimization for human reward that breaks our rating ability. This is a general problem for AI because of the fundamental reason that as we increase an AI's optimization power, it gets better at the task, but it also gets better at breaking my rating ability (which in powerful systems can lead to an overpowering of who's values are getting optimized in the universe).

Then there's this idea that as you approach breaking my rating ability, the rating will always fall off, leaving a pool of undesirability (in a high-dimensional action-space) that groups around doing a task well/poorly, that separates it from doing a task in a way that breaks my rating ability.

Is that what this crispness is? This little pool of rating fall off?

If yes, it's not clear to me why this little pool that separates the AI from MASSIVE VALUE and TAKING OVER THE UNIVERSE is able to save us. I don't know if the pool always exists around the action space, and to the extent it does exist I don't know how to use its existence to build a powerful optimizer that stays on one side of the pool.

Though Paul isn't saying he knows how to do that. He's saying that there's something really useful about it being crisp. I guess that's what I want to know. I don't understand the difference between "corrigibility is well-defined" and "corrigibility is crisp". Insofar as it's not a literally incoherent idea, there is some description of what behavior is in the category and what isn't. Then there's this additional little pool property, where not only can you list what's in and out of the definition, but the ratings go down a little before spiking when you leave the list of things in the definition. Is Paul saying that this means it's a very natural and simple concept to design a system to stay within?

If you have a space with two disconnected components, then I'm calling the distinction between them "crisp." For example, it doesn't depend on exactly how you draw the line.

It feels to me like this kind of non-convexity is fundamentally what crispness is about (the cluster structure of thingspace is a central example). So if you want to draw a crisp line, you should be looking for this kind of disconnectedness/non-convexity.

ETA: a very concrete consequence of this kind of crispness, that I should have spelled out in the OP, is that there are many functions that separate the two components, and so if you try to learn a classifier you can do so relatively quickly---almost all of the work of learning your classifier is just in building a good model and predicting what actions a human would rate highly.

If you have a space with two disconnected components, then I'm calling the distinction between them "crisp."

The components feel disconnected to me in 1D, but I'm not sure they would feel disconnected in 3D or in ND. Is your intuition that they're 'durably disconnected' (even looking at the messy plan-space of the real-world, we'll be able to make a simple classifier that rates corrigibility), or if not, when the connection comes in (like once you can argue about philosophy in way X, once you have uncertainty about your operator's preferences, once you have the ability to shut off or distract bits of your brain without other bits noticing, etc.)?

[This also feels like a good question for people who think corrigibility is anti-natural; do you not share Paul's sense that they're disconnected in 1D, or when do you think the difficulty comes in?]

I don't think we can write down any topology over behaviors or policies for which they are disconnected (otherwise we'd probably be done). My point is that there seems to be a difference-in-kind between the corrigible behaviors and the incorrigible behaviors, a fundamental structural difference between why they get rated highly; and that's not just some fuzzy and arbitrary line, it seems closer to a fact about the dynamics of the world.

If you are in the business of "trying to train corrigibility" or "trying to design corrigible systems," I think understanding that distinction is what the game is about.

If you are trying to argue that corrigibility is unworkable, I think that debunking the intuitive distinction is what the game is about. The kind of thing people often say---like "there are so many ways to mess with you, how could a definition cover all of them?"---doesn't make any progress on that, and so it doesn't help reconcile the intuitions or convince most optimists to be more pessimistic.

(Obviously all of that is just a best guess though, and the game may well be about something totally different.)


I think this is a great comment that feels to me like it communicated a better intuition for why corrigibility might be natural than anything else I've read so far.

Quick attempt at rough ontology translation between how I understand your comment, and the original post. (Any of you can correct me if I'm wrong)

I think what would typically count as "principles" in Eliezer's meaning are
1. designable things which make the "true corrigibility" basin significantly harder to escape, e.g. by making it deeper
2. designable things which make the "incorrigible" basin harder to reach, e.g. by increasing the distance between them, or increasing the potential barrier
3. somehow, making the "incorrigible" basin less lethal
4. preventing low-dimensional, low-barrier "tunnels" (or bridges?) between the basins

Eg some versions of "low impact" often makes the "incorrigible" basin harder to reach, roughly because "elaborate webs of deceptions an coverups" may require complex changes to the environment. (Not robustly)

In contrast, my impression is, what does not count as "principles" are statements about properties which are likely true in the corrigibility basin, but don't seem designable - eg "corrigible AI does not try to hypnotize you". Also the intended level of generality likely is: more specific than "make the basin deeper" and more general than "

Btw my impression is what makes the worst-case scenario hard to robustly solve is basically #4 from the list  above. Otherwise there are many ways how to make the basin work "in most directions". 

I'm not a good writer, so this is a brainstormed unpolished bullet-point list, written in about 40 minutes.

Principles which counteract instrumental convergent goals

1. Disutility from resource acquisition - e.g. by some mutual information measure between the AI and distant parts of the environment
2. Task uncertainty with reasonable prior on goal drift - the system is unsure about the task it tries to do and seeks human inputs about it. 
3. AI which ultimately wants to not exist in future as a terminal goal. Fulfilling the task is on the simplest trajectory to non-existence

Principles which counteract unbounded rationality

4. Information-theoretic penalty on reasoning: any bit-flip in the system is assumed to have some utility cost. The cost function could be very small initially but get steeper fast
5. Satisfycing / mentioned
6. Myopia / mentioned


7. Tripwire artifacts. Messing up with some to the system unknown but unrelated parts of the environment is associated with large negative rewards
8. External watchdogs. Smaller and fast external systems able to react quickly to out-of-distribution behaviour.
9. Ontological uncertainty about level of simulation.


10. Human-approval model based on imitation learning, sped up/amplified
11. Human-values ethics model, based on value learning
12. Legal-system-amplified model of negative limits of violating property rights or similar
13. Red-teaming of action plans,  AI debate style, feeding into previous


14. Imposing strong incentives on internal modularity, and interpretable messaging across module boundaries
15. Human-level explanations, produced by an independent "translator" system

I think that with the help of GPT-3 and the use of existing texts detailing individual topics, a capable writer could expand this list to ~10x more words written in a compelling style in something between a few hours and a few days. I don't think it makes any sense for me to do that,.  (I'd happily agree to claims of the type "Eliezer is much better than any other person in the specific direction of writing glowfic about AI alignment topics", but my understanding of the claim is more in the direction "all principles except 2 in this were invented by Eliezer and no one else invented/can invent any other ones")

Best list so far, imo; it's what to beat.

  1. Modelling humans as having free will: A peripheral system identifies parts of the agent's world model that are probably humans. During the planning phase, any given plan is evaluated twice: The first time as normal, the second time the outputs of the human part of the model are corrupted by noise. If the plan fails the second evaluation, then it probably involves manipulating humans and should be discarded.

Seems like a worthwhile exercise...

Principles Meta

There is a distinction between design principles intended to be used as targets/guides by human system designers at design time, vs runtime optimization targets intended to be used as targets/guides by the system itself at runtime. This list consists of design principles, not runtime optimization targets. Some of them would be actively dangerous to optimize for at runtime.

A List

  • From an alignment perspective, the point of corrigibility is to fail safely and potentially get more than one shot. Two general classes of principles toward that end:
    • If there's any potential problem at all, throw an error and shut down. Raise errors early, raise errors often.
    • Fail informatively. Provide lots of info about why the failure occurred, make it as loud and legible as possible.
  • Note that failing frequently implies an institutional design problem coupled with the system design problem: we want the designers to not provide too much accidental selection pressure via iteration, lest they select against visibility of failures.
  • Major principle: locality!
    • Three example sub-principles:
      • Avoid impact outside some local chunk of spacetime
      • Avoid reasoning about stuff outside some local chunk of spacetime
      • Avoid optimizing outside some local chunk of spacetime
    • The three examples above are meant to be illustrative, not comprehensive; you can easily generate more along these lines.
    • Sub-principles like the three above are mutually incompatible in cases of any reasonable complexity. Example: can't avoid accidental impact/optimization outside the local chunk of spacetime without reasoning about stuff outside the local chunk of spacetime. The preferred ways to handle such incompatibilities are (1) choose problems for which the principles are not incompatible, and (2) raise an informative error if they are.
    • The local chunk of spacetime in question should not contain the user, the system's own processing hardware, other humans, or other strong agenty systems. Implicit in this but also worth explicitly highlighting:
      • Avoid impacting/reasoning about/optimizing/etc the user
      • Avoid impacting/reasoning about/optimizing/etc  the system's own hardware
      • Avoid impacting/reasoning about/optimizing/etc  other humans (this includes trades)
      • Avoid impacting/reasoning about/optimizing/etc  other strong agenty systems (this includes trades)
    • Also avoid logical equivalents of nonlocality, e.g. acausal trade
    • Always look for other kinds of nonlocality - e.g. if we hadn't already noticed logical nonlocality as a possibility, how would we find it?
      • Better yet, we'd really like to rule out whole swaths of nonlocality without having to notice them at all.
  • Major principle: understandability!
    • The system's behavior should be predictable to a human; it should do what users expect, and nothing else.
    • The system's internal reasoning should make sense to a human.
      • System should use a human-legible ontology
      • Should be able to categorically rule out any "side channel" interactions which route through non-human-legible processing
    • In general, to the extent that we want the system to not actively model users/humans, the users/humans need to do the work of checking that plans/reasoning do what humans want. So plans/reasoning need to be human-legible as much as possible.
      • Plans and planning should be minimal:
        • No more optimization than needed (i.e. quantilize)
        • No more resources than needed
        • No more detailed modelling/reasoning than needed
        • No more computation/observation/activity than need
        • Avoid making the plan more complex than needed (i.e. plans should be simple)
        • Avoid making the environment more complex than needed
        • Avoid outsourcing work to other agents
          • Definitely no children!!!
      • Plans should avoid pushing the world way out-of-distribution compared to what humans are able to reason about.
        • Plans should not dramatically shift the natural ontology of the world
  • Do what the user says, what the user means, what the user expects, etc. These are mutually incompatible in general. The preferred ways to handle such incompatibilities are (1) choose problems for which they are not incompatible, and (2) raise an informative error if they are.
  • Major principle: get feedback from the user at runtime!
    • Runtime feedback should actually be used, even when "incompatible" in some way with whatever the system previously thought it was doing.
      • Don't avoid shutdown
      • Raise an error if feedback is incompatible in some way with other info/objectives/etc.
    • Note that feedback is implicitly optimized against, which is dangerous. Limit that optimization pressure.
  • Infohazards and persuasion-optimized info need to not be presented to the user, which is very incompatible with other principles above. Ideally, we want to choose problems/search spaces for which such things are unlikely to come up. Throwing a legible error if such things come up is itself dangerous (since it draws human attention to the infohazard), and creates another institutional design problem coupled to the technical problems.
  • A system which follows all these principles, and others like them, probably won't do anything directly useful, at least not at first. That's ok. All those informative errors will make the confused humans less confused over time.

Exercise Meta

I do not think I currently know what concept Eliezer usually wants to point to with the word "corrigibility", nor am I even sure that he's pointing to a coherent concept at all (as opposed to, say, a bunch of not-actually-unified properties which would make it less likely for a strong AGI to kill us on the first failure).

I have omitted principles of the form "don't do <stupid thing>", like e.g. don't optimize any part of the system for human approval/feedback, don't outsource error-analysis or interpretation to another AI system, etc.

These took 1-2 hours to generate, and another 30 min - 1 hr to write up as a comment.

Minor clarification: This doesn't refer to re-writing the LW corrigibility tag. I believe a tag is a reply in glowfic, where each author responds with the next tag i.e. next bit of the story, with an implied "tag – now you're it!" at the other author. 

“myopia” (not sure who correctly named this as a corrigibility principle),

I think this is from Paul Christiano, e.g. this discussion.

Some hopefully-unnecessary background info for people attempting this task:

A description of corrigibility Eliezer wrote a few months ago: "'corrigibility' is meant to refer to the sort of putative hypothetical motivational properties that prevent a system from wanting to kill you after you didn't build it exactly right".

An older description of "task-directed AGI" he wrote in 2015-2016: "A task-based AGI is an AGI intended to follow a series of human-originated orders, with these orders each being of limited scope", where the orders can be "accomplished using bounded amounts of effort and resources (as opposed to the goals being more and more fulfillable using more and more effort)."

(This was an interesting exercise! I wrote this before reading any other comments; obviously most of the bullet points are unoriginal)

The basics

  • It doesn't prevent you from shutting it down
  • It doesn't prevent you from modifying it
  • It doesn't deceive or manipulate you
  • It does not try to infer your goals and achieve them; instead it just executes the most straightforward, human-common-sense interpretation of its instructions
  • It performs the task with minimal side-effects (but without explicitly minimizing a measure of side-effects)
  • If it self-modifies or constructs other agents, it will preserve corrigibility. Preferably it does not self-modify or construct other intelligent agents at all


  • Its objective is no more broad or long-term than is required to complete the task
  • In particular, it only cares about results within a short timeframe (chosen to be as short as possible while still enabling it to perform the task)
  • It does not cooperate (in the sense of helping achieve their objective) with future, past, or (duplicate) concurrent versions of itself, unless intended by the operator


  • It doesn't maximize the probability of getting the task done; it just does something that gets the task done with (say) >99% probability
  • It doesn't "optimize too hard" (not sure how to state this better)
    • Example: when communicating with humans (e.g. to query them about their instructions), it does not maximize communication bandwidth / information transfer; it just communicates reasonably well
  • Its objective / task does not consist in maximizing any quantity; rather, it follows a specific bounded instruction (like "make me a coffee", or "tell me a likely outcome of this plan") and then shuts down
  • It doesn't optimize over causal pathways you don't want it to: for example, if it is meant to predict the consequences of a plan, it does not try to make its prediction more likely to happen
  • It does not try to become more consequentialist with respect to its goals
    • for example, if in the middle of deployment the system reads a probability theory textbook, learns about dutch book theorems, and decides that EV maximization is the best way to achieve its goals, it will not change its behavior

No weird stuff

  • It doesn't try to acausally cooperate or trade with far-away possible AIs
  • It doesn't come to believe that it is being simulated by multiverse-aliens trying to manipulate the universal prior (or whatever)
  • It doesn't attempt to simulate a misaligned intelligence
  • In fact it doesn't simulate any other intelligences at all, except to the minimal degree of fidelity that is required to perform the task

Human imitation

  • Where possible, it should imitate a human that is trying to be corrigible
  • To the extent that this is possible while completing the task, it should try to act like a helpful human would (but not unboundedly minimizing the distance in behavior-space)
  • When this is not possible (e.g. because it is executing strategies that a human could not), it should stay near to human-extrapolated behaviour ("what would a corrigible, unusually smart / competent / knowledgable human do?")
  • To the extent that meta-cognition is necessary, it should think about itself and corrigibility in the same way its operators do: its objectives are likely misspecified, therefore it should not become too consequentialist, or "optimize too hard", and [other corrigibility desiderata]

Querying / robustness

  • Insofar as this is feasible it presents its plans to humans for approval, including estimates of the consequences of its plans
  • It will raise an exception, i.e. pause execution of its plans and notify its operators if
    • its instructions are unclear
    • it recognizes a flaw in its design
    • it sees a way in which corrigibility could be strengthened
    • in the course of performing its task, the ability of its operators to shut it down or modify it would be limited
    • in the course of performing its task, its operators would predictably be deceived / misled about the state of the world

I guess the problem with this test is that the kinds of people who could do this tend to be busy, so they probably can't do this with so little notice.

An ability to refuse to generate theories about a hypothetical world being in a simulation.

Quick brainstorm:

  1. Context-sensitivity: the goals that a corrigible AGI pursues should depend sensitively on the intentions of its human users when it’s run.
  2. Default off: a corrigible AGI run in a context where the relevant intentions or instructions aren’t present shouldn’t do anything.
  3. Explicitness: a corrigible AGI should explain its intentions at a range of different levels of abstraction before acting. If its plan stops being a central example of the explained intentions (e.g. due to unexpected events), it should default to a pre-specified fallback.
  4. Goal robustness: a corrigible AGI should maintain corrigibility even after N adversarially-chosen gradient steps (the higher the better).
  5. Satiability: a corrigible AGI should be able to pre-specify a rough level of completeness of a given task, and then shut down after reaching that point.

Disclaimer: I am not writing my full opinions. I am writing this as if I was an alien writing an encyclopedia entry on something they know is a good idea. These aliens may define the "corrigibility" and its sub-categories slightly differently than earthlings. Also, I am bad at giving things catchy names, so I've decided that whenever I need a name for something I don't know the name of, I will make something up and accept that it sounds stupid. 45 minutes go. (EDIT: Okay, partway done and having a reasonably good time. Second 45 minutes go!) (EDIT2: Ok, went over budget by another half hour and added as many topics as I finished. I will spend the other hour and a half to finish this if it seems like a good idea tomorrow.)



An agent models the consequences of its actions in the world, then chooses the action that it thinks will have the best consequences, according to some criterion. Agents are dangerous because specifying a criterion that rates our desired states of the world highly is an unsolved problem (see value learning). Corrigibility is the study of producing AIs that are deficient in some of the properties of agency, with the intent of maintaining meaningful human control over the AI.

Different parts of the corrigible AI may be restricted relative to an idealized agent - world-modeling, consequence-ranking, or action-choosing. When elements of the agent are updated by learning or training, the updating process must preserve these restrictions. This is nontrivial because simple metrics of success may be better-fulfilled by more agential AIs. See restricted learning for further discussion, especially restricted learning § non-compensation for open problems related to preventing learning or training one part of the AI from compensating for restrictions nominally located in other parts.

Restricted world-modeling

Restricted world-modeling is a common reason for AI to be safe. For example, an AI designed to play the computer game Brick-Break may choose the action that maximizes its score, which would be unsafe if actions were evaluated using a complete model of the world. However, if actions are evaluated using a simulation of the game of Brick-Break, or if the AI's world model is otherwise restricted to modeling the game, then it is likely to choose actions that are safe.

Many proposals for "tool AI" or "science AI" fall into this category. If we can create a closed model of a domain (e.g. the electronic properties of crystalline solids), and simple objectives within that domain correspond to solutions to real-world problems (e.g. superconductor design), then learning and search within the model can be safe yet valuable.

It may seem that these solutions do not apply when we want to use the AI to solve problems that require learning about the world in general. However, some closely related avenues are being explored.

Perhaps the simplest is to identify things that we don't want the AI to think about, and exclude them from the world-model, while still having a world-model that encompasses most of the world. For example, an AI that deliberately doesn't know about the measures humans have put in place to shut it off, or an AI that doesn't have a detailed understanding of human psychology. However, this can be brittle in practice, because ignorance incentivizes learning. For more on the learning problem, see restricted learning § doublethink.

Real time intervention on AI designs that have more dynamic interactions between their internal state and the world model falls under the umbrella of thought policing and policeability. This intersects with altering the action-choosing procedure to select policies that do not violate certain rules, see § deontology.

Counterfactual agency

A corrigible AI built with counterfactual agency does not model the world as it is, instead its world model describes some counterfactual world, and it chooses actions that have good consequences within that counterfactual world.

The strategies in this general class are best thought of in terms of restricted action-choosing. We can describe them with an agent that has an accurate model of the world, but chooses actions by generating a counterfactual world and then evaluating actions' consequences on the counterfactual, rather than the agential procedure. Note that this also introduces some compensatory pressures on the world-model.

The difficulty lies in choosing and automatically constructing counterfactuals (see automatic counterfactual construction) so that the AI's outputs can be interpreted by human operators to solve real-world problems, without those outputs being selected by the AI for real-world consequences. For attempts to quantify the selection pressure of counterfactual plans in the real world, see policy decoherence. One example proposal for counterfactual agency is to construct AIs that act as if they are giving orders to perfectly faithful servants, when in reality the human operators will evaluate the output critically.

Counterfactual agency is also related to constructing agents that act as if they are ignorant of certain pieces of knowledge. Taking the previous example of an AI that doesn't know about human psychology, it might still use learning to produce an accurate world model, but make decisions by predicting the consequences in an edited world model that has less precise predictions for humans, and also freezes those predictions, particularly in value of information calculations. Again, see restricted learning § doublethink.

Mild optimization

We might hope to lessen the danger of agents by reducing how effectively they search the space of solutions, or otherwise restricting that search.

The two simplest approaches are whitelisting or blacklisting. Both restrict the search result to a set that fulfills some pre-specified criteria. Blacklisting refers to permissive criteria, while whitelisting refers to restrictive ones. Both face difficulty in retaining safety properties while solving problems in the real world.


intervening at intermediate reasoning steps

learned human reasoning patterns

Restrictions on consequence-ranking criteria

general deontology


impact regularization

Human oversight

Types of human oversight

Via restrictions on consequence-ranking

Via counterfactual agency

Yeah, I already said most of the things that I have a nonstandard take on, without getting into the suitcase word nature of "corrigibility" or questioning whether researching it is worth the time. Just fill in the rest with the obvious things everyone else says.

This feels to me like very much not how I would go about getting corrigibility.

It is hard to summarize how I would go about things, because there would be lots of steps, and lots of processes that are iterative.

Prior to plausible AGI/FOOM I would box it in really carefully, and I only interact with it in ways where it's expressivity is severely restricted.

I would set up a "council" of AGI-systems (a system of systems), and when giving it requests in an oracle/genie-like manner I would see if the answers converged. At first it would be the initial AGI-system, but I would use that system to generate new systems for the "council".

I would make heavy use of techniques that are centered around verifiability, since for some pieces of work it’s possible to set up things in such a way that it would be very hard for the system to "pretend" like it’s doing what I want it to do without actually doing it. There are several techniques I would use to achieve this, but one of them is that I often would ask it to provide a narrow/specialized/interpretable "result-generator" instead of giving the result directly, and sometimes even result-generator-generators (pieces of code that produce results, and that have architectures that make it easy to understand and verify behavior). So when for example getting it to generate simulations, I would get from it a simulation-generator (or simulation-generator-generator), and I would test its accuracy against real-world-data.

Here is a draft for a text where I try to explain myself in more detail, but it's not finished yet: