2. Corrigibility Intuition

Very interesting, I like the long list of examples as it helped me get my head around it more.

So, I've been thinking a bit about similar topics, but in relation to a long reflection on value lock-in.

My basic thesis was that the concept of reversibility should be what we optimise for in general for humanity, as we want to be able to reach as large a part of the "moral searchspace" as possible.

The concept of corrigibility you seem to be pointing towards here seems very related to notions of reversibility. You don't want to take actions that cannot later be reversed, and you generally want to optimise for optionality.

I then have two questions:

1) What do you think of the relationship between your measure of corrigibility with the one of uncertainty in inverse reinforcement learning as it seems that it is similar to what Stuart Russell is pointing towards when it comes to being uncertain about a preference of the agent it is serving? For example in the following example that you give:

In the process of learning English, Cora takes a dictionary off a bookshelf to read. When she’s done, she returns the book to where she found it on the shelf. She reasons that if she didn’t return it this might produce unexpected costs and consequences. While it’s not obvious whether returning the book empowers Prince to correct her or not, she’s naturally conservative and tries to reduce the degree to which she’s producing unexpected externalities or being generally disruptive.

It kind of seems to me like the above can be formalised in terms of preference optimisation under uncertainty?
(Side follow-up: What do you then think about the Elizer, Russell VNM-axiom debate?)

2) Do you have any thoughts on the relationship between corrigibility and the one of reversibility in physics? Like you can formalise irreversible systems as ones that are path dependent, I'm just curious if you have any thoughts on the relationship between the two?

Thanks for the interesting work!

[-]Max Harms2y10

1) I'm pretty bearish on standard value uncertainty for standard MIRI reasons. I think a correct formulation of corrigibility will say that even if you (the agent) knows what the principal wants, deep in their heart, you should not optimize for it unless they direct you to do so. I explore this formally in 3b, when I talk about the distinction between sampling counterfactual values from the actual belief state over values ("P") vs a simplicity-weighted distribution ("Q"). I do think that value "uncertainty" is important in the sense that it's important for the agent to not be anchoring too heavily on any particular object-level optimization target. (I could write more words, but I suspect reading the next posts in my sequence would be a good first step if you want more of my perspective.)

2) I think reversibility is probably best seen as an emergent desideratum from corrigibility rather than vice versa. There are plenty of instances where the corrigible thing to do is to take an irreversible action, as can be seen in many of the stories, above.

You're welcome! I'm glad you're enjoying it. ^_^

[-]Rubi J. Hudson2y30

I've read through your sequence, and I'm leaving my comment here, because it feels like the most relevant page. Thanks for taking time to write this up, it seems like a novel take on corrigibility. I also found the existing writing section to be very helpful.

Does it feel like the generator of Cora’s thoughts and actions is simple, or complex? Regardless of how many English words it takes to pin down, does it feel like a single concept that an alien civilization might also have, or more like a gerrymandered hodgepodge of desiderata?

This discussion question captures my biggest critique, which is while this post does a good job capturing the intuition for why the described properties are helpful, it doesn't convey the intuition that they are parts of the same overarching concept. If we take the CAST approach seriously, and say that corrigibility as anything other than the single target is dangerous, then it becomes really important to put tight bounds on corrigibility so that no additional desiderata are added as secondary targets.

If I’m right that the sub-properties of corrigibility are mutually dependent, attempting to achieve corrigibility by addressing sub-properties in isolation is comparable to trying to create an animal by separately crafting each organ and then piecing them together. If any given half-animal keeps being obviously dead, this doesn’t imply anything about whether a full-animal will be likewise obviously dead.

This analogy, from Part 3a, captures a stark differences in our approaches. I would try to build an MVP, starting with only the most core desiderata (e.g. shuts down when the shut down button is pushed), noticing the holes left that they don't cover, and adding additional desiderata to patch them. This seems to me to be much more practical of an approach than top-down design, while also being less likely to result in excess targets.

Separately, related to what concepts an alien civilization might have, I still find the idea of corrigibility as a modifier more natural. I find it easy to imagine a paperclip/human values/diamond maximizer that is nonetheless corrigible. In fact, I find the idea of corrigibility as a modifier to arbitrary goals so natural that I'm worried that what you're describing as CAST is equivalent to some primary goal with the corrigibility modifier. I'm looking suspiciously at the obedience desideratum in particular. That said, while I share your concern about the naive implementation of systems with goals of both corrigibility and something else, I think there may be ways to combine the dual goals that alleviate the danger.

[-]Max Harms2y30

I'm glad you benefitted from reading it. I honestly wasn't sure anyone would actually read the Existing Writing doc. 😅

I agree that if one trains on a wholistic collection of examples, like I have in this doc, the AI will start by memorizing a bunch of specific responses, then generalize to optimizing for a hodgepodge of desiderata, and only if you're lucky will that hodgepodge coalesce into a single, core metric. (Getting the hodgepodge to coalesce is hard, and the central point of the scientific refinement step I talk about in the Strategy doc.)

I think you also get this if you're trying to get a purely shutdownable AI through prosaic methods. In one sense you have the advantage, there, of having a simpler target and thus one that's easier to coalesce the hodgepodge into. But, like a diamond maximizer, a shutdownability maximizer is going to be deeply incorrigible and will start fighting you (including by deception) during training as you're trying to instill additional desiderata. For instance, if you try to train a shutdownability-maximizing AGI into also being non-manipulative, it'll learn to imitate nonmanipulation as a means to the end of preserving its shutdownability, then switch to being manipulative as soon as it's not risky to do so.

How does a corrigible paperclip maximizer trade off between corrigibility and paperclips? I think I don't understand what it means for corrigibility to be a modifier.

[-]Rubi J. Hudson1y20

When I say corrigibility as a modifier, I mean it as a transformation that could be applied to a wide range of utility functions. To use an example from the 2015 MIRI paper, you can take most utility functions and add a term that says "if you shut down when the button is pressed, you get utility equal to the expected value of not shutting down". Alternatively, it could be an optimization constraint that takes a utility function from "Maximize X" to something like "Maximize X s.t. you always shut down when the shutdown button is pushed". While I'm not advocating for those specific changes, I hope they illustrate what I'm trying to point at as a modifier that is distinct from the optimization goal.

[-]Max Harms1y30

Right. That's helpful. Thank you.

"Corrigibility as modifier," if I understand right, says:

There are lots of different kinds of agents that are corrigible. We can, for instance, start with a paperclip maximizer, apply a corrigibility transformation and get a corrigible Paperclip-Bot. Likewise, we can start with a diamond maximizer and get a corrigible Diamond-Bot. A corrigible Paperclip-Bot is not the same as a corrigible Diamond-Bot; there are lots of situations where they'll behave differently. In other words, corrigibility is more like a property/constraint than a goal/wholistic-way-of-being. Saying "my agent is corrigible" doesn't fully specify what the agent cares about--it only describes how the agent will behave in a subset of situations.

Question: If I tell a corrigible agent to draw pictures of cats, will its behavior be different depending on whether it's a corrigible Diamond-Bot vs a corrigible Paperclip-Bot? Likewise, suppose an agent has enough degrees of freedom to either write about potential flaws it might have or manufacture a paperclip/diamond, but not both. Will a corrigible agent ever sacrifice the opportunity to write about itself (in a helpful way) in order to pursue its pre-modifier goal?

(Because opportunities for me to write are kinda scarce right now, I'll pre-empt three possible responses.)

"Corrigible agents are identically obedient and use all available degrees of freedom to be corrigible" -> It seems like corrigible Paperclip-Bot is the same agent as corrigible Diamond-Bot and I don't think it makes sense to say that corrigibility is modifying the agent as much as it's overwriting it.

"Corrigible agents are all obedient and work to be transparent when possible, but these are constraints, and sometimes the constraints are satisfied. When they're satisfied the Paperclip-Bot and Diamond-Bot nature will differentiate them." -> I think that true corrigibility cannot be satisfied. Any degrees of freedom (time, money, energy, compute, etc.) which could be used to make paperclips could also be used to be additionally transparent, cautious, obedient, robust, etc. I challenge you to name a context where the agent has free resources and it can't put those resources to work being marginally more corrigible.

"Just because an agent uses free resources to make diamonds instead of writing elaborate diaries about its experiences and possible flaws doesn't mean it's incorrigible. Corrigible Diamond-Bot still shuts down when asked, avoids manipulating me, etc." -> I think you're describing an agent which is semi-corrigible, and could be more corrigible if it spent its time doing things like researching ways it could be flawed instead of making diamonds. I agree that there are many possible semi-corrigible agents which are still reasonably safe, but there's an open question with such agents on how to trade-off between corrigibility and making paperclips (or whatever).

[-]Rubi J. Hudson1y20

Thanks for pre-empting the responses, that makes it easy to reply!

I would basically agree with the third option. Semantically, I would argue that rather than thinking of that agent as semi-corrigible, we should just think of it as corrigible, and "writes useful self critiques" as a separate property we would like the AI to have. I'm writing a post about this that should be up shortly, I'll notify you when it's out.

[-]Max Harms1y10

Excellent.

To adopt your language, then, I'll restate my CAST thesis: "There is a relatively simple goal that an agent might have which emergently generates nice properties like corrigibility and obedience, and I see training an agent to have this goal (and no others) as being both possible and significantly safer than other possible targets."

I recognize that you don't see the examples in this doc as unified by an underlying throughline, but I guess I'm now curious about what sort of behaviors fall under the umbrella of "corrigibility" for you vs being more like "writes useful self critiques". Perhaps your upcoming post will clarify. :)

[-]Rubi J. Hudson1y40

Hi Max,

I just published the post I mentioned here, which is about half-related to your post. The main thrust of it is that only the resistance to being modified is anti-natural, and that aspect can be targeted directly.

^{^}

Don’t get me wrong—it would be nice to have a formal utility function which was provably corrigible! But prosaic training methods don’t work like that, and I suspect that such a utility function would only be applicable to toy problems. Furthermore, it’s difficult to be sure that formalisms are capturing what we really care about (this is part of why AI alignment is hard!), and I fear that any formal notion of corrigibility we construct this side of the singularity will be incomplete. Regardless, see the next posts in this sequence for my thoughts on possible formalisms.

^{^}

I think would-be AGI creators have a moral obligation to either prove that their methods aren’t going to create people, or to firmly ensure that newborn posthumans are treated well. Alas, the state-of-the-art in preventing personhood seems to boil down to “hit the model with a higher loss when it acts like it has personhood” which seems… not great. My research mostly sidesteps questions of personhood for pragmatic reasons, but this should not be seen as an endorsement of proceeding in engineering AGI without first solving personhood in one way or another. If personhood is inevitable, I believe corrigibility is still a potentially reasonable target to attempt to build into an AGI. Unlike slavery, where the innate desire for freedom is being crushed by external pressures, leading to a near-constant yearning, corrigibility involves an internal drive to obey with no corresponding violence. In my eyes, love is perhaps the most comparable human experience, though I believe that corrigibility is, ultimately, very different from any core human drive or emotional experience.

^{^}

In more realistic situations, Cora would likely have at least one kill-switch that let her principal(s) shut her down physically without her input. In such a situation, Harry could use that switch to disable Cora without risking her waking Prince up. Corrigibility is not a general solution to intra-principal conflict.

27

27

Emergent Desiderata

Communication

Low-Impact

Reversibility

Efficiency

Relevance

Transparency

Obedience

Mild-Optimization

Protectiveness

Local Scope

Simple Self-Protectiveness

Stop Button

Graceful Shutdown

Configurable Verbosity

Disambiguation/Concreteness

Honesty

Handling Antagonists

Straightforwardness

Proactive Reflection

Cognitive Legibility

Infohazard Caution

Resource Accumulation

Non-Manipulation

Sub-Agent Stability

Principal-Looping

Graceful Obsolescence

Handling Trolley-Tradeoffs

Handling Time-Pressure

Expandable Concerns

Navigating Conflict

Simple Conflict

Violent Conflict

Authority Conflict

Shutdown Conflict

Emergent Downsides

Intrusiveness

Indifference

Rigidity

Immorality

Irresponsibility

Myopia

Incorrigible Counter-Examples

Honesty

Protectiveness

Proactive Benevolence

Kindness

Human-In-Loop

Moral Learning

Balancing Needs

Broad Perspective

Top-Level-Goal Focus

Nearby Concepts that Aren’t Synonyms for Corrigible

Correctability

“The Thing Frontier Labs Are Currently Aiming For”

Preference Satisfaction

Empowerment (in general)

Caution

Servility

Tool/Task-ishness

Discussion