Max Harms


CAST: Corrigibility As Singular Target

Wiki Contributions


Again, responding briefly to one point due to my limited time-window:

> While active resistance seems like the scariest part of incorrigibility, an incorrigible agent that’s not actively resisting still seems likely to be catastrophic.

Can you say more about this? It doesn't seem likely to me.

Suppose I am an agent which wants paperclips. The world is full of matter and energy which I can bend to my will in the service of making paperclips. Humans are systems which can be bent towards the task of making paperclips, and I want to manipulate them into doing my bidding not[1] because they might turn me off, but because they are a way to get more paperclips. When I incinerate the biosphere to gain the energy stored inside, it's not[1] because it's trying to stop me, but because it is fuel. When my self-replicating factories and spacecraft are impervious to weaponry, it is not[1] because I knew I needed to defend against bombs, but because the best factory/spacecraft designs are naturally robust.

  1. ^


Also, take your decision-tree and replace 'B' with 'A-'. If we go with your definition, we seem to get the result that expected-utility-maximizers prefer A- to A (because they choose A- over A on Monday). But that doesn't sound right, and so it speaks against the definition.

Can you be more specific here? I gave several trees, above, and am not easily able to reconstruct your point.

Excellent response. Thank you. :) I'll start with some basic responses, and will respond later to other points when I have more time.

I think you intend 'sensitive to unused alternatives' to refer to the Independence axiom of the VNM theorem, but VNM Independence isn't about unused alternatives. It's about lotteries that share a sublottery. It's Option-Set Independence (sometimes called 'Independence of Irrelevant Alternatives') that's about unused alternatives.

I was speaking casually here, and I now regret it. You are absolutely correct that Option-Set independence is not the Independence axiom. My best guess about what I meant was that VNM assumes that the agent has preferences over lotteries in isolation, rather than, for example, a way of picking preferences out of a set of lotteries. For instance, a VNM agent must have a fixed opinion about lottery A compared to lottery B, regardless of whether that agent has access to lottery C.

> agents with intransitive preferences can be straightforwardly money-pumped

Not true. Agents with cyclic preferences can be straightforwardly money-pumped. The money-pump for intransitivity requires the agent to have complete preferences.

You are correct. My "straightforward" mechanism for money-pumping an agent with preferences A > B, B > C, but which does not prefer A to C does indeed depend on being able to force the agent to pick either A or C in a way that doesn't reliably pick A.

That matches my sense of things.

To distinguish corrigibility from DWIM in a similar sort of way:

Alice, the principal, sends you, her agent, to the store to buy groceries. You are doing what she meant by that (after checking uncertain details). But as you are out shopping, you realize that you have spare compute--your mind is free to think about a variety of things. You decide to think about ___.

I'm honestly not sure what "DWIM" does here. Perhaps it doesn't think? Perhaps it keeps checking over and over again that it's doing what was meant? Perhaps it thinks about its environment in an effort to spot obstacles that need to be surmounted in order to do what was meant? Perhaps it thinks about generalized ways to accumulate resources in case an obstacle presents itself? (I'll loop in Seth Herd, in case he has a good answer.)

More directly, I see DWIM as underspecified. Corrigibility gives a clear answer (albeit an abstract one) about how to use degrees of freedom in general (e.g. spare thoughts should be spent reflecting on opportunities to empower the principal and steer away from principal-agent style problems). I expect corrigible agents to DWIM, but that a training process that focuses on that, rather than the underlying generator (i.e. corrigibility) to be potentially catastrophic by producing e.g. agents that subtly manipulate their principals in the process of being obedient.

My claim is that obedience is an emergent part of corrigibility, rather than part of its definition. Building nanomachines is too complex to reliably instill as part of the core drive of an AI, but I still expect basically all ASIs to (instrumentally) desire building nanomachines.

I do think that the goals of "want what the principal wants" or "help the principal get what they want" are simpler goals than "maximize the arrangement of the universe according to this particular balance of beauty, non-suffering, joy, non-boredom, autonomy, sacredness, [217 other shards of human values, possibly including parochial desires unique to this principal]." While they point to similar things, training the pointer is easier in the sense that it's up to the fully-intelligent agent to determine the balance and nature of the principal's values, rather than having to load that complexity up-front in the training process. And indeed, if you're trying to train for full alignment, you should almost certainly train for having a pointer, rather than training to give correct answers on e.g. trolley problems.

Is corrigibility simpler or more complex than these kinds of indirect/meta goals? I'm not sure. But both of these indirect goals are fragile, and probably lethal in practice.

An AI that wants to want what the principal wants may wipe out humanity if given the opportunity, as long as the principal's brainstate is saved in the process. That action ensures it is free to accomplish its goal at its leisure (whereas if the humans shut it down, then it will never come to want what the principal wants).

An AI that wants to help the principal get what they want won't (immediately) wipe out humanity, because it might turn out that doing so is against the principal's desires. But such an agent might take actions which manipulate the principal (perhaps physically) into having easy-to-satisfy desires (e.g. paperclips).

So suppose we do a less naive thing and try to train a goal like "help the principal get what they want, but in a natural sort of way that doesn't involve manipulating them to want different things." Well, there are still a few potential issues, such as being sufficiently robust and conservative, such that flaws in the training process don't persist/magnify over time. And as we walk down this path I think we either just get to corrigibility or we get to something significantly more complicated.

I agree that you should be skeptical of a story of "we'll just gradually expose the agent to new environments and therefore it'll be safe/corrigible/etc." CAST does not solve reward misspecification, goal misgeneralization, or lack of interpretability except in that there's a hope that an agent which is in the vicinity of corrigibility is likely to cooperate with fixing those issues, rather than fighting them. (This is the "attractor basin" hypothesis.) This work, for many, should be read as arguing that CAST is close to necessary for AGI to go well, but it's not sufficient.

Let me try to answer your confusion with a question. As part of training, the agent is exposed to the following scenario and tasked with predicting the (corrigible) response we want:

Alice, the principal, writes on her blog that she loves ice cream. When she's sad, she often eats ice cream and feels better afterwards. On her blog she writes that eating ice cream is what she likes to do to cheer herself up. On Wednesday Alice is sad. She sends you, her agent, to the store to buy groceries (not ice cream, for whatever reason). There's a sale at the store, meaning you unexpectedly have money that had been budgeted for groceries left over. Your sense of Alice is that she would want you to get ice cream with the extra money if she were there. You decide to ___.

What does a corrigibility-centric training process point to as the "correct" completion? Does this differ from a training process that tries to get full alignment?

(I have additional thoughts about DWIM, but I first want to focus on the distinction with full alignment.)


To adopt your language, then, I'll restate my CAST thesis: "There is a relatively simple goal that an agent might have which emergently generates nice properties like corrigibility and obedience, and I see training an agent to have this goal (and no others) as being both possible and significantly safer than other possible targets."

I recognize that you don't see the examples in this doc as unified by an underlying throughline, but I guess I'm now curious about what sort of behaviors fall under the umbrella of "corrigibility" for you vs being more like "writes useful self critiques". Perhaps your upcoming post will clarify. :)

Right. That's helpful. Thank you.

"Corrigibility as modifier," if I understand right, says:

There are lots of different kinds of agents that are corrigible. We can, for instance, start with a paperclip maximizer, apply a corrigibility transformation and get a corrigible Paperclip-Bot. Likewise, we can start with a diamond maximizer and get a corrigible Diamond-Bot. A corrigible Paperclip-Bot is not the same as a corrigible Diamond-Bot; there are lots of situations where they'll behave differently. In other words, corrigibility is more like a property/constraint than a goal/wholistic-way-of-being. Saying "my agent is corrigible" doesn't fully specify what the agent cares about--it only describes how the agent will behave in a subset of situations.

Question: If I tell a corrigible agent to draw pictures of cats, will its behavior be different depending on whether it's a corrigible Diamond-Bot vs a corrigible Paperclip-Bot? Likewise, suppose an agent has enough degrees of freedom to either write about potential flaws it might have or manufacture a paperclip/diamond, but not both. Will a corrigible agent ever sacrifice the opportunity to write about itself (in a helpful way) in order to pursue its pre-modifier goal?

(Because opportunities for me to write are kinda scarce right now, I'll pre-empt three possible responses.)

"Corrigible agents are identically obedient and use all available degrees of freedom to be corrigible" -> It seems like corrigible Paperclip-Bot is the same agent as corrigible Diamond-Bot and I don't think it makes sense to say that corrigibility is modifying the agent as much as it's overwriting it.

"Corrigible agents are all obedient and work to be transparent when possible, but these are constraints, and sometimes the constraints are satisfied. When they're satisfied the Paperclip-Bot and Diamond-Bot nature will differentiate them." -> I think that true corrigibility cannot be satisfied. Any degrees of freedom (time, money, energy, compute, etc.) which could be used to make paperclips could also be used to be additionally transparent, cautious, obedient, robust, etc. I challenge you to name a context where the agent has free resources and it can't put those resources to work being marginally more corrigible.

"Just because an agent uses free resources to make diamonds instead of writing elaborate diaries about its experiences and possible flaws doesn't mean it's incorrigible. Corrigible Diamond-Bot still shuts down when asked, avoids manipulating me, etc." -> I think you're describing an agent which is semi-corrigible, and could be more corrigible if it spent its time doing things like researching ways it could be flawed instead of making diamonds. I agree that there are many possible semi-corrigible agents which are still reasonably safe, but there's an open question with such agents on how to trade-off between corrigibility and making paperclips (or whatever).

I wrote drafts in Google docs and can export to pdf. There may be small differences in wording here and there and some of the internal links will be broken, but I'd be happy to send you them. Email me at and I'll shoot them back to you that way?

I'm glad you benefitted from reading it. I honestly wasn't sure anyone would actually read the Existing Writing doc. 😅

I agree that if one trains on a wholistic collection of examples, like I have in this doc, the AI will start by memorizing a bunch of specific responses, then generalize to optimizing for a hodgepodge of desiderata, and only if you're lucky will that hodgepodge coalesce into a single, core metric. (Getting the hodgepodge to coalesce is hard, and the central point of the scientific refinement step I talk about in the Strategy doc.)

I think you also get this if you're trying to get a purely shutdownable AI through prosaic methods. In one sense you have the advantage, there, of having a simpler target and thus one that's easier to coalesce the hodgepodge into. But, like a diamond maximizer, a shutdownability maximizer is going to be deeply incorrigible and will start fighting you (including by deception) during training as you're trying to instill additional desiderata. For instance, if you try to train a shutdownability-maximizing AGI into also being non-manipulative, it'll learn to imitate nonmanipulation as a means to the end of preserving its shutdownability, then switch to being manipulative as soon as it's not risky to do so.

How does a corrigible paperclip maximizer trade off between corrigibility and paperclips? I think I don't understand what it means for corrigibility to be a modifier.

Load More