Lauro Langosco

Wiki Contributions


Let's See You Write That Corrigibility Tag

(This was an interesting exercise! I wrote this before reading any other comments; obviously most of the bullet points are unoriginal)

The basics

  • It doesn't prevent you from shutting it down
  • It doesn't prevent you from modifying it
  • It doesn't deceive or manipulate you
  • It does not try to infer your goals and achieve them; instead it just executes the most straightforward, human-common-sense interpretation of its instructions
  • It performs the task with minimal side-effects (but without explicitly minimizing a measure of side-effects)
  • If it self-modifies or constructs other agents, it will preserve corrigibility. Preferably it does not self-modify or construct other intelligent agents at all


  • Its objective is no more broad or long-term than is required to complete the task
  • In particular, it only cares about results within a short timeframe (chosen to be as short as possible while still enabling it to perform the task)
  • It does not cooperate (in the sense of helping achieve their objective) with future, past, or (duplicate) concurrent versions of itself, unless intended by the operator


  • It doesn't maximize the probability of getting the task done; it just does something that gets the task done with (say) >99% probability
  • It doesn't "optimize too hard" (not sure how to state this better)
    • Example: when communicating with humans (e.g. to query them about their instructions), it does not maximize communication bandwidth / information transfer; it just communicates reasonably well
  • Its objective / task does not consist in maximizing any quantity; rather, it follows a specific bounded instruction (like "make me a coffee", or "tell me a likely outcome of this plan") and then shuts down
  • It doesn't optimize over causal pathways you don't want it to: for example, if it is meant to predict the consequences of a plan, it does not try to make its prediction more likely to happen
  • It does not try to become more consequentialist with respect to its goals
    • for example, if in the middle of deployment the system reads a probability theory textbook, learns about dutch book theorems, and decides that EV maximization is the best way to achieve its goals, it will not change its behavior

No weird stuff

  • It doesn't try to acausally cooperate or trade with far-away possible AIs
  • It doesn't come to believe that it is being simulated by multiverse-aliens trying to manipulate the universal prior (or whatever)
  • It doesn't attempt to simulate a misaligned intelligence
  • In fact it doesn't simulate any other intelligences at all, except to the minimal degree of fidelity that is required to perform the task

Human imitation

  • Where possible, it should imitate a human that is trying to be corrigible
  • To the extent that this is possible while completing the task, it should try to act like a helpful human would (but not unboundedly minimizing the distance in behavior-space)
  • When this is not possible (e.g. because it is executing strategies that a human could not), it should stay near to human-extrapolated behaviour ("what would a corrigible, unusually smart / competent / knowledgable human do?")
  • To the extent that meta-cognition is necessary, it should think about itself and corrigibility in the same way its operators do: its objectives are likely misspecified, therefore it should not become too consequentialist, or "optimize too hard", and [other corrigibility desiderata]

Querying / robustness

  • Insofar as this is feasible it presents its plans to humans for approval, including estimates of the consequences of its plans
  • It will raise an exception, i.e. pause execution of its plans and notify its operators if
    • its instructions are unclear
    • it recognizes a flaw in its design
    • it sees a way in which corrigibility could be strengthened
    • in the course of performing its task, the ability of its operators to shut it down or modify it would be limited
    • in the course of performing its task, its operators would predictably be deceived / misled about the state of the world
Relaxed adversarial training for inner alignment

Minor comment on clarity: you don't explicitly define relaxed adversarial training (it's only mentioned in the title and the conclusion), which is a bit confusing for someone coming across the term for the first time. Since this is the current reference post for RAT I think it would be nice if you did this explicitly; for example, I'd suggest renaming the second section to 'Formalizing relaxed adversarial training', and within the section call it that instead of 'Pauls approach'