4. Existing Writing on Corrigibility

I think your 'Incomplete preferences' section makes various small mistakes that add up to important misunderstandings.

The utility maximization concept largely comes from the VNM-utility-theorem: that any policy (i.e. function from states to actions) which expresses a complete set of transitive preferences (which aren’t sensitive to unused alternatives) over lotteries is able to be described as an agent which is maximizing the expectation of some real-valued utility function over outcomes.

I think you intend 'sensitive to unused alternatives' to refer to the Independence axiom of the VNM theorem, but VNM Independence isn't about unused alternatives. It's about lotteries that share a sublottery. It's Option-Set Independence (sometimes called 'Independence of Irrelevant Alternatives') that's about unused alternatives.

On the surface, the axioms of VNM-utility seem reasonable to me

To me too! But the question isn't whether they seem reasonable. It's whether we can train agents that enduringly violate them. I think that we can. Coherence arguments give us little reason to think that we can't.

unused alternatives seem basically irrelevant to choosing between superior options

Yes, but this isn't Independence. And the question isn't about what seems basically irrelevant to us.

agents with intransitive preferences can be straightforwardly money-pumped

Not true. Agents with cyclic preferences can be straightforwardly money-pumped. The money-pump for intransitivity requires the agent to have complete preferences.

as long as the resources are being modeled as part of what the agent has preferences about

Yes, but the concern is whether we can instil such preferences. It seems like it might be hard to train agents to prefer to spend resources in pursuit of their goals except in cases where they would do so by resisting shutdown.

Thornley, I believe, thinks he’s proposing a non-VNM rational agent. I suspect that this is a mistake on his part that stems from neglecting to formulate the outcomes as capturing everything that he wants.

You can, of course, always reinterpret the objects of preference so that the VNM axioms are trivially satisfied. That's not a problem for my proposal. See:

Thanks, Lucius. Whether or not decision theory as a whole is concerned only with external behaviour, coherence arguments certainly aren’t. Remember what the conclusion of these arguments is supposed to be: advanced agents who start off not being representable as EUMs will amend their behaviour so that they are representable as EUMs, because otherwise they’re liable to pursue dominated strategies.
Now consider an advanced agent who appears not to be representable as an EUM: it’s paying to trade vanilla for strawberry, strawberry for chocolate, and chocolate for vanilla. Is this agent pursuing a dominated strategy? Will it amend its behaviour? It depends on the objects of preference. If objects of preference are ice-cream flavours, the answer is yes. If the objects of preference are sequences of trades, the answer is no. So we have to say something about the objects of preference in order to predict the agent’s behaviour. And the whole point of coherence arguments is to predict agents’ behaviour.
And once we say something about the objects of preference, then we can observe agents violating Completeness and acting in accordance with policies like ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ This doesn't require looking into the agent or saying anything about its algorithm or anything like that. It just requires us to say something about the objects of preference and to watch what the agent does from the outside. And coherence arguments already commit us to saying something about the objects of preference. If we say nothing, we get no predictions out of them.

The pattern of how an agent chooses options are that agent’s preferences, whether we think of them as such or whether they’re conceived as a decision rule to prevent being dominated by expected-utility maximizers!

You can define 'preferences' so that this is true, but then it need not follow that agents will pay costs to shift probability mass away from dispreferred options and towards preferred options. And that's the thing that matters when we're trying to create a shutdownable agent. We want to ensure that agents won't pay costs to influence shutdown-time.

Also, take your decision-tree and replace 'B' with 'A-'. If we go with your definition, we seem to get the result that expected-utility-maximizers prefer A- to A (because they choose A- over A on Monday). But that doesn't sound right, and so it speaks against the definition.

I think it’s interesting to note that we’re also doing something like throwing out the axiom of independence from unused alternatives

Not true. The axiom we're giving up is Decision-Tree Separability. That's different to VNM Independence, and different to Option-Set Independence. It might be hard to train agents that enduringly violate VNM Independence and/or Option-Set Independence. It doesn't seem so hard to train agents that enduringly violate Decision-Tree Separability.

In other words, if you wake up as this kind of agent on Monday, the way you cash-out your partial ordering over outcomes depends on your memory/model of what happened on Sunday.

Yes, nice point. Kinda weird? Maybe. Difficult to create artificial agents that do it? Doesn't seem so.

But notice that this refactor effectively turns Thornley’s agent into an agent with a set of preferences which satisfies the completeness and independence axioms of VNM

Yep, you can always reinterpret the objects of preference so that the VNM axioms are trivially satisfied.That's not a problem for my proposal.

the point is that “incomplete preferences” combined with a decision making algorithm which prevents the agent’s policy from being strictly dominated by an expected utility maximizer ends up, in practice, as isomorphic to an expected utility maximizer which is optimizing over histories/trajectories.

Not true. As I say elsewhere:

And an agent abiding by the Caprice Rule can’t be represented as maximising utility, because its preferences are incomplete. In cases where the available trades aren’t arranged in some way that constitutes a money-pump, the agent can prefer (/reliably choose) A+ over A, and yet lack any preference between (/stochastically choose between) A+ and B, and lack any preference between (/stochastically choose between) A and B. Those patterns of preference/behaviour are allowed by the Caprice Rule.

I want to consider how there’s a common misunderstanding of “outcomes” in the VNM-sense as needing to be about physical facts of the future universe (such as number of paperclips) rather than as potentially including historical facts, such as which options were abandoned or whether the agent took the default action. This is extremely relevant for corrigibility since one of the key ideas in my strategy is to shift the AI’s preferences from being about things like whether the button is pushed to being about whether the agent consistently maintained a certain kind of relationship with the principal during the relevant period.

Same point here as above. You can get any agent to satisfy the VNM axioms by enriching the objects of preference. A concern is that these more complex preferences are harder to reliably train into your agent.

[-]Max Harms1y20

Excellent response. Thank you. :) I'll start with some basic responses, and will respond later to other points when I have more time.

I think you intend 'sensitive to unused alternatives' to refer to the Independence axiom of the VNM theorem, but VNM Independence isn't about unused alternatives. It's about lotteries that share a sublottery. It's Option-Set Independence (sometimes called 'Independence of Irrelevant Alternatives') that's about unused alternatives.

I was speaking casually here, and I now regret it. You are absolutely correct that Option-Set independence is not the Independence axiom. My best guess about what I meant was that VNM assumes that the agent has preferences over lotteries in isolation, rather than, for example, a way of picking preferences out of a set of lotteries. For instance, a VNM agent must have a fixed opinion about lottery A compared to lottery B, regardless of whether that agent has access to lottery C.

> agents with intransitive preferences can be straightforwardly money-pumped
Not true. Agents with cyclic preferences can be straightforwardly money-pumped. The money-pump for intransitivity requires the agent to have complete preferences.

You are correct. My "straightforward" mechanism for money-pumping an agent with preferences A > B, B > C, but which does not prefer A to C does indeed depend on being able to force the agent to pick either A or C in a way that doesn't reliably pick A.

[-]Max Harms1y10

Also, take your decision-tree and replace 'B' with 'A-'. If we go with your definition, we seem to get the result that expected-utility-maximizers prefer A- to A (because they choose A- over A on Monday). But that doesn't sound right, and so it speaks against the definition.

Can you be more specific here? I gave several trees, above, and am not easily able to reconstruct your point.

[-]Joe Carlsmith2mo20

I appreciated the detailed discussion and literature review here -- thanks.

[-]EJT1y20

Got this on my list to read! Just in case it's easy for you to do, can you turn the whole sequence into a PDF? I'd like to print it. Let me know if that'd be a hassle, in which case I can do it myself.

[-]Max Harms1y20

I wrote drafts in Google docs and can export to pdf. There may be small differences in wording here and there and some of the internal links will be broken, but I'd be happy to send you them. Email me at max@intelligence.org and I'll shoot them back to you that way?

[-]David Scott Krueger (formerly: capybaralet)5mo10

Seems to be missing old stuff by Stuart Armstrong (?)

[-]Max Harms5mo10

Armstrong is one of the authors on the 2015 Corrigibility paper, which I address under the Yudkowsky section (sorry, Stewart!). I also have three of his old essays listed on the 0th essay in this sequence:

“The limits of corrigibility.” 2018.
“Petrov corrigibility.” 2018.
“Corrigibility doesn't always have a good action to take.” 2018.

While I did read these as part of writing this sequence, I didn't feel like they were central/foundational/evergreen enough to warrant a full response. If there's something Armstrong wrote that I'm missing or a particular idea of his that you'd like my take on, please let me know! :)

[-]EJT1y10

I think it’s going to be very typical for different actions to have trajectories that are mutually non-dominant (such as in the example). It matters a lot how you decide such cases, and I expect that almost all such ways of deciding are catastrophic.

But suppose I’m wrong, and timestep-dominance is always relevant.

My claim isn't that Timestep Dominance is always relevant. It's that Timestep Dominance rules out all instances of resisting shutdown.

I agree that many pairs of available lotteries are going to be mutually non-dominant. For those cases, Sami and I propose that the agent choose by maximizing expected utility. Can you say what you expect the problem there to be?

Suppose that, assuming the shutdown button didn’t exist, the fastest pathway to paperclips involves instantiating a paperclip-seeking AI on an unsecured server in Singapore. Suppose that it takes 99,000 keystrokes to create an IPP agent, but only 55,000 keystrokes to create a basic VNM-rational agent with a short enough time horizon that it produces more paperclips quickly, in expectation (because VNM agents are both simpler and less likely to be shut down by angry humans). The shutdown button does, in fact exist, but the agent’s actions might be hard to notice, such that by the time anyone has realized that something went wrong, the agent already created a sub-agent, effectively making the stop-button useless.

I talk about the issue of creating corrigible subagents here. What do you think of that?

Note also a general nice feature of TD-agents: they won’t pay costs to hide their bad behaviour, because paying these costs is timestep-dominated by not paying them. So if a TD-agent does try to create an incorrigible paperclipper, it won't hide that fact if doing so is at all costly.

While active resistance seems like the scariest part of incorrigibility, an incorrigible agent that’s not actively resisting still seems likely to be catastrophic.

Can you say more about this? It doesn't seem likely to me.

One more thing I'll say: the IPP leaves open the content of the agent's preferences over same-length trajectories. One pattern of preferences you could try to train in is the kind of corrigibility that you talk about elsewhere in your sequence. That'd give you two lines of defence against incorrigibility.

[-]Max Harms1y20

I talk about the issue of creating corrigible subagents here. What do you think of that?

I may not understand your thing fully, but here's my high-level attempt to summarize your idea:

IPP-agents won't care about the difference between building a corrigible agent vs an incorrigible agent because it models that if humans decide something's off and try to shut everything down, it will also get shut down and thus nothing after that point matters, including whether the sub-agent makes a bunch of money or also gets shut down. Thus, if you instruct an IPP agent to make corrigible sub-agents, it won't have the standard reason to resist: that incorrigible sub-agents make more money than corrigible ones. Thus if we build an obedient IPP agent and tell it to make all its sub-agents corrigible, we can be more hopeful that it'll actually do so.

I didn't see anything in your document that addresses my point about money-maximizers being easier to build than IPP agents (or corrigible agents) and thus, in the absence of an instruction to make corrigible sub-agents, we should expect sub-agents that are more akin to money-maximizers.

But perhaps your rebuttal will be "sure, but we can just instruct/train the AI to make corrigible sub-agents". If this is your response, I am curious how you expect to be able to do that without running into the misspecification/misgeneralization issues that you're so keen to avoid. From my perspective it's easier to train an AI to be generally corrigible than to create corrigible sub-agents per se (and once the AI is generally corrigible it'll also create corrigible sub-agents), which seems like a reason to focus on corrigibility directly?

[-]Max Harms1y20

Again, responding briefly to one point due to my limited time-window:

> While active resistance seems like the scariest part of incorrigibility, an incorrigible agent that’s not actively resisting still seems likely to be catastrophic.
Can you say more about this? It doesn't seem likely to me.

Suppose I am an agent which wants paperclips. The world is full of matter and energy which I can bend to my will in the service of making paperclips. Humans are systems which can be bent towards the task of making paperclips, and I want to manipulate them into doing my bidding not^[1] because they might turn me off, but because they are a way to get more paperclips. When I incinerate the biosphere to gain the energy stored inside, it's not^[1] because it's trying to stop me, but because it is fuel. When my self-replicating factories and spacecraft are impervious to weaponry, it is not^[1] because I knew I needed to defend against bombs, but because the best factory/spacecraft designs are naturally robust.

^{^}
(just)

^{^}

I do not mean to imply an explicit expected-utility calculation here (though it could involve that), but rather note that the pathways of strategy and choice in an agent that’s been trained to satisfy preferences are balancing lots of different concerns, and I don’t see sufficient evidence to suggest that pressures towards corrigibility will dominate in those pathways.

^{^}

In most ML setups we should more precisely say that the learned policy isn’t really optimizing for long-term goals, and it doesn’t make sense to ascribe that policy network agency. Even insofar as it’s controlling for things, it probably isn’t engaging in the consequentialist reasoning necessary to be VNM rational (and thus have a utility function). From this perspective training an agent that has driving in circles as a top-level goal is still a speculative line of research, but I do not expect it to be harder to deliberately invoke that as a goal, as the system scales up, as opposed to some other goal of similar complexity.

^{^}

One of the strangest things about Turner’s notation, from my perspective, is that usually we think of π as denoting a policy, and Turner uses this language many times in his essay, but that doesn’t typecheck. Mutual information takes variables, which we see as randomly set to specific values. To be a bit imprecise—the π symbols used in the equation are like distributions over policies, and not specific policies. (Typical notation uses uppercase letters for variables and lowercase letters for specific values/settings to avoid this very confusion.)

^{^}

We should recognize that Scott Garrabrant has put forth an interesting, and (in my opinion) important, criticism of the independence axiom. A more thorough response to Thornley would involve getting into Garabrant’s “Geometric Rationality” but in the interests of staying focused I am going to ignore it. Please comment if you feel that this is a mistake.

^{^}

Except, technically, when offering a “choice” between X and X, which of course must be represented as indifference, insofar as we’re considering such “choices.”

^{^}

This is an abuse of notation, the set of abandoned alternatives are in fact lotteries, rather than outcomes. In the examples we’re considering there are no probabilistic nodes, but I claim that the extension to handling probabilistic alternatives is straightforward.

26

26

Eliezer Yudkowsky et al.

Corrigibility and Hard problem of corrigibility (Arbital)

Unpersonhood

Taskishness

Mild optimization

Tightly bounded ranges of utility and log-probability

Low impact

Myopia

Separate superior questioners

Conservatism

Conceptual legibility

Operator-looping

Whitelisting

Shutdownability/abortability

Behaviorism

Design-space anti-optimization separation

Domaining

Hard problem of corrigibility / anapartistic reasoning

Responses to Christiano’s Agenda

Paul Christiano

1. Benign act-based agents can be corrigible

2. Corrigible agents become more corrigible/aligned

Postscript: the hard problem of corrigibility and the diff of my and Eliezer’s views

Yudkowsky Responds to Christiano

Alex Turner’s Corrigibility Sequence

Conclusions I draw from the idea of non-obstruction

Elliot Thornley, Sami Petersen, John Wentworth, and David Lorell on Shutdownability and Incomplete Preferences

In Defense of Reliable Aversion to Button Manipulation

Incomplete Preferences

The Incomplete Preference Proposal

Wentworth and Lorell’s Proposal

Steve Byrnes and Seth Herd’s Corrigibility Writing

Other Possible Desiderata (via Let’s See You Write That Corrigibility Tag)

Corrigibility

Restricted world-modeling

Counterfactual agency