In my recent post on steering systems, I sketched an AI system which chooses which actions to execute via tree search over world states.

In the original post, the system is meant to illustrate the possibility that component subsystems which are human-level and (possibly) non-dangerous when used individually can be composed straightforwardly into something which is more capable and likely more dangerous.

In this post, I explore how to add corrigibility to such a system, by considering modifications and restrictions to the original design based on principles and desiderata proposed by others.

My conclusion from this exercise is that corrigibility might be straightforward to add to practical systems, but that the resulting system will be much less capable than the unmodified, non-corrigible version. This effectiveness penalty can be thought of as a "corrigibility tax", which is a specific kind of alignment tax.

Background: corrigibility

There are multiple views on corrigibility. For various precise and technical criteria, it has been shown that it is difficult to construct an agent which meets these criteria in a way that is coherent and stable under reflection.

Nevertheless, there are proposed principles and desiderata of corrigible systems, some of which seem practical to add to existing or near-future AI systems. Even if these principles come apart or are ill-defined in the limit of superintelligence, it may be both possible and practical to imbue them in weakly or even strongly superhuman systems. The result may be a system which is both powerful and safe enough for pivotal use, without the need to ramp up the capabilities of the system to the point where corrigibility breaks down under reflection or self-improvement.

Even if you're skeptical of the pivotal use framing or corrigibility as a whole, many of the principles are probably individually desirable when trying to build any kind of safe AI system.

For the purposes of this post, I'll be using the principles of corrigibility listed here, originally written by Eliezer as a glowfic tag in planecrash.

Background: steering systems

My post on steering systems is about framing the capabilities of a system in terms of its ability to choose actions which steer towards particular outcomes. I gave a bunch of examples of how to apply the concept to existing and future AI systems. One of the goals of that post was to convey an intuition for why having safe or "aligned" foundation models doesn't imply systems which are built from those models are safe.

One of the examples I gave was a sketch of a system of my own design, which performs tree search over world states to find worlds that score highly according to a given evaluation function. The system is not meant to be practical; in the original post it was meant to illustrate the ease of and danger posed by composability of weaker, safer systems.

In this post, I'll use the same system for another purpose: illustrating how one might build corrigibility into a powerful AI system.

I'll make the same assumptions in the original post, namely, that the component subsystems (which may be next-gen deep learning models, LLMs, LLM-based agents or chains, or some other near-future construct) are at least human-level at their given individual tasks, but that they are either not capable of or do not desire to "break out" of the system and act agentically in their own right. In other words, I am assuming that there is no inner alignment failure of these component subsystems. In the original post, this was, in some sense, a conservative assumption: the point was to show that even given this assumption, the system could still be dangerous. In this post, it is a more foundational assumption; if it does not hold, the corrigibility properties I introduce in the next section will probably also not hold.

The next part of this section is a quote of the relevant section from steering systems. There are more remarks and explanations in the original post, though only the quoted section is mandatory for understanding the rest of this post.

A sketch of a powerful steering system

This section is meant to sketch a hypothetical system which is composed of individual pieces which are not steering systems on their own, but when glued together in a particular straightforward way, are a steering system.

It's not necessarily a good design for a real system, but it is meant to illustrate a possible analogue of the kinds of things people do with LLMs and LangChain today that will be possible with future powerful "foundation models". The architecture is loosely inspired by the kinds of architectures in Mu Zero and Dreamer, but over a real-world domain instead of Go, Minecraft, or Atari games.

A one sentence description of the system is:

Perform tree search over world states, given an initial world state, initial action set, and an outcome specification.

In more detail:

The system is comprised of the following components, each of which may be individually as interpretable, transparent, and safe as you like, when used in isolation, though the decision-making and modeling pieces are assumed to be at least human level at their given individual tasks.

An initial action set (A), which is the set of actions available at the initial world state. For concreteness, let's say that this set is the executions of any API function in the Playwright, a browser automation framework. (Essentially, the system begins with control of an internet-connected web browser.)

A world modeler (W), which, given an action and a world state, returns a predicted new world state and a predicted new action set.

W may take an observation returned by actually executing the action as an optional parameter, which, if provided, is used to make the prediction more accurate.

An evaluation function (E), which, given a world state, returns a score for that world state. This function may be a neural net trained from a set of example world states with human-labelled scores, a hand-coded / mathematical function of the world state, or even a literal human asked to rate (some description or representation of) the provided world state, at some nodes. No assumptions are made about the form of E other than its type; it need not represent a coherent utility function (though it may).

A planner or pruning heuristic (P), which, for a given world state, evaluation function, and set of actions available at that state, returns a subset of those actions to search over. P may also decide, for each candidate action, whether it should actually be executed, or whether the world modeler should be used to predict the next world state based only on a prediction of the result of execution.

Each of these components may be implemented by a large neural network, GOFAI algorithms, a literal human (sometimes, e.g. in a human-in-the-loop setup), or any composition of such elements, as long as they perform their task at least as well and fast as a human could do so.

These components can then be arranged into performing a tree search with some simple glue code. Pseudo-python code:

W = deepAI.world_modeler_api()
E = deepAI.construct_eval_function_from_human_feedback(scored_example_worlds)
P = deepAI.pruning_heuristic_api()

# Look for a really good world
target_score = 10000000

def tree_search(world_state, choices):
  if E(world_state) > target_score:
  # We're done! print a stack trace and send a global interrupt or something
  # to halt execution across the rest of the tree

  branches = P(world_state, E, C)

  parallelfor action in branches:

    # optional, but helpful if you don't want the world modeler
    # start to drift too far from reality...
    observation = execute(action)

    predicted_world_state, predicted_actions = W(world_state, action, observation)
    tree_search(predicted_world_state, predicted_actions)

initial_world_state = W()
initial_actions = playwright.chromium
tree_search(initial_world_state, initial_choices)

As a reminder, in the basic design above, it is left up to the pruning heuristic to keep the branching factor under control, and to decide if and when to actually execute the proposed actions vs. letting the world model predict the outcome and then searching deeper in the tree based on the prediction. Actually executing actions has the advantage of effecting the real world, resulting in feedback to improve the accuracy of the world model. The downside is that an action may be irreversible, expensive, slow, or step on the toes of searches in other parts of the tree. Adding various corrigibility properties will involve adding restrictions on when actions are actually executed, which may make the job of P even harder.

Applying each principle to the system

In this section, I'll go through each of the principles outlined in Corrigibility at some small length, and attempt to apply them to modify or restrict the system above.

Note: Most of the quotes from this section are from the tl;dr version and intended to serve as a signpost / quick reminder of the concept in question. I recommend familiarizing yourself with the actual full version as you read this, if you haven't seen it before.


The Thing shall not have qualia - not because those are unsafe, but because it's morally wrong given the rest of the premise, and so this postulate serves a foundation for everything that follows.

This requires the underlying component subsystems to not have qualia; in particular, P and W.

One way of modeling a world containing agents with qualia is to simulate those agents and their thoughts. Sufficiently high fidelity simulations will potentially have qualia, but this seems avoidable in principle - humans themselves are capable of modeling other humans in enough detail to make accurate predictions about their behavior and effects on the world in many situations, apparently without the people in their imaginations having any qualia.

For the pruning heuristic, I am imagining an LLM or LLM-based subsystem, prompted to output plans, probabilities and choices of actions. It seems possible, though unlikely, that GPT-4 could already have some degree of qualia, and GPT-4 itself is not yet human-level at making plans. Perhaps any LLM that is human-level or above at planning will unavoidably have some qualia; if so, this is unfortunate.

A combination of interpretability research, research on the nature of qualia and consciousness, and philosophy may help to answer these questions.


The Thing shall be aimed at some task bounded in space, time, knowledge and effort needed to accomplish it.

The absolute hardest thing you should use the system for is to accomplish some concrete, bounded pivotal act; don't try to use it to maximize CEV or "solve alignment" or any other unbounded thing of unknown difficulty.

Before attempting a pivotal use, you should test the system on even smaller and more bounded tasks: answering questions of simple fact for which the answer is already known, producing some machine-checkable proofs, moving small, well-defined objects in the real world, building up gradually to superhuman but still bounded and well-defined tasks like cloning a strawberry, or developing a cure for cancer which works in straightforward and legible ways.

Mild optimization

No part of the Thing shall ever look for best solutions, only adequate ones.

We can try to ensure the system is a mild optimizer by (a) enforcing that the evaluation function provably has certain properties and (b) quantilizing the decision on which actions to execute.

On (a), the initial evaluation function might be a machine learning model, but this should be pruned or distilled into the simplest function possible, and then proved to have a number of properties. Potentially desirable example properties:

  • There are many possible world states w for which E(w) > target_score.
  • E is bounded above and below.
  • Most possible world states should evaluate to either the upper or lower bound; intuitively, the evaluation function is mostly a binary pass / fail function of world states, dense in both passes and failures, with some room for intermediate results in between.

Imposing such restrictions on the evaluation function seems likely to limit the kinds of tasks the system can be used for, or at least require that those tasks be formulated with very concrete outcome specifications that can be defined in advance.

On (b), instead of searching for a single world state that meets the target score, we require the search to find multiple sufficiently-diverse candidate world states, and then have the system choose randomly among them before any actions are allowed to execute.

Bounded utilities and probabilities

The worst and best outcomes shall not seem to the Thing worse or better than the ordinary outcomes it deals in; the most improbable possibilities it specifically considers shall not be very improbable.

This property is already partially satisfied by the conditions imposed on the evaluation function to achieve mild optimization. But we also want to ensure that any "intermediate" probabilities and utilities which P considers when planning and pruning are not too extreme.

Concretely, if P is an LLM-based agent, we could require that, at each node, P return human-readable explanations, utilities, and probabilities used to justify its selected actions, and then use another LLM agent to check that those probabilities are both well-calibrated and not too extreme. Both LLM-agents should be spot-checked by humans for accuracy, and mechanistically interpretable enough that we understand how the probability and utility estimates were calculated by the underlying models.

Low impact

The Thing shall search for a solution with few downstream effects save those that are tied to almost any nonextreme solution of its task.

(The full description is particularly important for understanding this one.) 

Some people refer to this property as impact regularization, and defining it rigorously and then getting it into an AI system seems pretty hard. This property on its own might be "corrigibility-complete": if you can figure out how to get this property, most of the others are trivial or unnecessary.

Some ideas:

  • Before any action is allowed to execute, compute a distance metric between the initial world state and the predicted future world state, and ensure it is below some bound. This is probably most useful and reliable if the latent space of the world modeler is highly legible and interpretable.
  • Before any action is allowed to execute, require that the world modeler model the world into the future under the assumption that the system itself takes no further action. Check that at several points in the future after the system is halted, future world states are still predicted to have an acceptable score. If the world modeler predicts anything going "off the rails" here, throw an exception and halt execution across the tree globally.

I don't think either of these ideas really capture the full essence of low impact, though.


As much as possible, the Thing shall work on subtasks whose optimized-over effects have short timespans.

Left as an exercise.

Separate questioners

Components of the Thing that ask questions like 'Does this myopically optimized component have long-range effects anyways?' or 'But what are the impacts intrinsic to any performance of the task?' shall not be part of its optimization.

Left as an exercise.


If there's any way to solve a problem using an ordinary banana common in the environment, the Thing shall avoid using a special weird genetically engineered banana instead.

As in the bounded utilities section, we could use another LLM-agent tasked with evaluating plans output by P for conservatism. This is pretty unsatisfying; this property seems tricky.

Conceptual legibility

As much as possible, the Thing shall do its own thinking in a language whose conceptual pieces have short descriptions in the mental language of its operators.

LLM-like agents naturally have natural language thought transcripts available for inspection. Interpretability research could tell us how the high-level plans map to the underlying cognitive machinery in the model, giving confidence that they are working as intended. For the world modeler, we could study the latent space to ensure that it is legibly modeling the things we care about accurately, and not some proxy for them.


When there's some vital cognitive task the operators could do, have the operators do it.

Some ideas:

  • One of the actions available at every node is to shell out to the operators for advice. When this action is chosen, execution in the rest of the tree is temporarily suspended.
  • E() is occasionally human spot-checked, either randomly with some small probability or at predetermined depths or key points in the tree. In this case, instead of evaluating a world state using the usual distilled deep learning model, the system shells out to the operators, who can then inspect the predicted world state, and the result of executing the default evaluation function on that world state. The operators can then return a different value or suspend execution, if they choose.
  • The evaluation function contains a term which rates world states which were reached through use of either of these operator-looping techniques at key junctures higher, perhaps only probabilistically or as a tiebreaker.


In cognitive-system boundaries, rule subspaces in, rather than ruling them out.

One idea:

  • Yet another LLM-based critic model. At every node, the critic takes a list of allowed domains, and the output of P's plans, and returns a yes or no on whether the plan stays within the list of allowed domains.


The Thing should let you switch it off, and build off-switches into its machines and plans that can be pressed to reduce their impacts.

Left as an exercise.


The Thing shall not model other minds in predictively-accurate detail.

Given sufficiently good interpretability tools, it might be relatively easy to check whether any part of the system is modeling other minds or not. But adhering to the strictest interpretation of this property might rule out using the most capable and general models entirely.

Whitelisting cognitive domains could also help with this property; obviously you don't want to include any domain that involves modeling human minds (or any other kind of mind).

This seems like an exemplar of a property that might be straightforward to implement, but has a pretty big penalty in terms of system capability and usefulness, relative to similar systems without this property.

Design-space anti-optimization separation

The Thing shall not be near in the design space to anything that could anti-optimize its operators' true utility functions; eg, something that explicitly represents and maximizes your true utility function is a sign flip or successful blackmail operation away from inducing its minimization.

I think this is mostly already covered by taskishness and what the operators choose to use the system for. Again, you shouldn't use this system to try to get what you (think you) really want; instead, use the system as a tool to solve your problems on your own, as much as possible.


The Thing should only figure out what it needs to know to understand its task, and ideally, should try to think about separate epistemic domains separately.  Most of its searches should be conducted inside a particular domain, not across all domains.

Left as an exercise.

Hard problem of corrigibility / anapartistic reasoning

Could you build a Thing that understood corrigibility in general, as a compact general concept covering all the pieces, such that it would invent the pieces of corrigibility that you yourself had left out?  Could you build a Thing that would imagine what hypothetical operators would want, if they were building a Thing that thought faster than them and whose thoughts were hard for themselves to comprehend, and would invent concepts like "abortability" even if the operators themselves hadn't thought that far?  Could the Thing have a sufficiently deep sympathy, there, that it realized that surprising behaviors in the service of "corrigibility" were perhaps not that helpful to its operators, or even, surprising meta-behaviors in the course of itself trying to be unsurprising?

Nobody out of the World's Basement in dath ilan currently considers it to be a good idea to try to build that last principle into a Thing, if you had to build it quickly.  It's deep, it's meta, it's elegant, it's much harder to pin down than the rest of the list; if you can build deep meta Things and really trust them about that, you should be building something that's more like a real manifestation of Light.

Don't try to add this one, but remember that there are probably things missing from this list.

Concluding thoughts on the application exercise

I left the application of some principles as an exercise, either because I don't have a good solution for them myself, or because I think they are relatively straightforward and repetitive with some of the others. Additionally, some of the ideas that I did flesh out are inadequate or lacking in many ways. I encourage others to propose their own solutions, potentially accompanied by modifications to my original pseudocode. I also didn't include my own pseudocode for any of the proposed modifications.

Another avenue is to explore applications of some of the principles proposed by others in the comments section of this post. Or, take a look back at some corrigibility proposals written by others, which may predate the publication of the principles, and re-evaluate them against more recent ideas.

There are other ways of improving the safety or reliability of the base system which don't involve adding corrigibility. For example, by restricting the kinds of actions the system is permitted to actually execute, you could make the system more of a possibilizer instead of an actualizer.


Others have shown that corrigibility is, in some sense, anti-natural or incoherent in the limit. For systems which are human level or "weakly superhuman", it may be practical but expensive to tack on corrigibility properties, before these properties fall apart under reflection or more superhuman capability levels.

My own intuition is that weakly superhuman levels of intelligence are sufficient for most of the things we might want to do with an AI system, so I think speculating about the behavior and properties of systems in this regime is interesting and promising as a strategy for getting to safe TAI.

Unfortunately, building in corrigibility is likely to take more time than the time required to build a non-corrigible system of equal capability. This is a kind of alignment tax, though the tax may be smaller than the one required to solve alignment in full generality. Solving problems of corrigibility may overlap with other problems in alignment somewhat, but they look a bit more tractable, or at least more concrete, than problems posed by imbuing an agent with values that are aligned to the full complexity and fragility of human values.

I didn't spend a ton of time thinking about how to apply each principle in the applications section. It may be that some of my ideas don't work, or that there are better ones, or straightforward ways of implementing other, better principles. Feel free to comment or post with your own ideas.


New Comment