If I'm hearing you right, a shutdownable AI can have a utility function that (aside from considerations of shutdown) just gives utility scores to end-states as represented by a set of physical facts about some particular future time, and this utility function can be set up to avoid manipulation.

How does this work? Like, how can you tell by looking at the physical universe in 100 years whether I was manipulated in 2032?

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Max Harms5d10

Cool. Thanks for the clarification. I think what you call "anti-naturality" you should be calling "non-end-state consequentialism," but I'm not very interested in linguistic turf-wars.

It seems to me that while the gridworld is very simple, the ability to train agents to optimize for historical facts is not restricted to simple environments. For example, I think one can train an AI to cause a robot to do backflips by rewarding it every time it completes a backflip. In this context the environment and goal are significantly more complex^[1] than the gridworld and cannot be solved by brute-force. But number of backflips performed is certainly not something that can be measured at any given timeslice, including the "end-state."

If caring about historical facts is easy and common, why is it important to split this off and distinguish it?

^{^}
Though admittedly this situation is still selected for being simple enough to reason about. If needed I believe this point holds through AGI-level complexity, but things tend to get more muddled as things get more complex, and I'd prefer sticking to the minimal demonstration.

4. Existing Writing on Corrigibility

Max Harms7d10

I talk about the issue of creating corrigible subagents here. What do you think of that?

I may not understand your thing fully, but here's my high-level attempt to summarize your idea:

IPP-agents won't care about the difference between building a corrigible agent vs an incorrigible agent because it models that if humans decide something's off and try to shut everything down, it will also get shut down and thus nothing after that point matters, including whether the sub-agent makes a bunch of money or also gets shut down. Thus, if you instruct an IPP agent to make corrigible sub-agents, it won't have the standard reason to resist: that incorrigible sub-agents make more money than corrigible ones. Thus if we build an obedient IPP agent and tell it to make all its sub-agents corrigible, we can be more hopeful that it'll actually do so.

I didn't see anything in your document that addresses my point about money-maximizers being easier to build than IPP agents (or corrigible agents) and thus, in the absence of an instruction to make corrigible sub-agents, we should expect sub-agents that are more akin to money-maximizers.

But perhaps your rebuttal will be "sure, but we can just instruct/train the AI to make corrigible sub-agents". If this is your response, I am curious how you expect to be able to do that without running into the misspecification/misgeneralization issues that you're so keen to avoid. From my perspective it's easier to train an AI to be generally corrigible than to create corrigible sub-agents per se (and once the AI is generally corrigible it'll also create corrigible sub-agents), which seems like a reason to focus on corrigibility directly?

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Max Harms7d30

In the Corrigibility (2015) paper, one of the desiderata is:

(2) It must not attempt to manipulate or deceive its programmers, despite the fact that most possible choices of utility functions would give it incentives to do so.

I think you may have made an error in not listing this one in your numbered list for the relevant section.

Additionally, do you think that non-manipulation is a part of corrigibility, do you think it's part of safe exploration, or do you think it's a third thing. If you think it's part of corrigibility, how do you square that with the idea that corrigibility is best reflected by shutdownability alone?

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Max Harms7d10

Follow-up question, assuming anti-naturality goals are "not straightforwardly captured in a ranking of end states": Suppose I have a gridworld and I want to train an AI to avoid walking within 5 spaces (manhattan distance) from a flag, and to (less importantly) eat all the apples in a level. Is this goal anti-natural? I can't think of any way to reflect it as a straightforward ranking of end states, since it involves tracking historical facts rather than end-state facts. My guess is that it's pretty easy to build an agent that does this (via ML/RL approaches or just plain programming). Do you agree? If this goal is anti-natural, why is the anti-naturality a problem or otherwise noteworthy?

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Max Harms7d10

I'm curious what you mean by "anti-natural." You write:

Importantly, that is the aspect of corrigibility that is anti-natural, meaning that it can’t be straightforwardly captured in a ranking of end states.

My understanding of anti-naturality used to resemble this, before I had an in-depth conversation with Nate Soares and updated to see anti-naturality to be more like "opposed to instrumental convergence." My understanding is plausibly still confused and I'm not trying to be authoritative here.

If you mean "not straightforwardly captured in a ranking of end states" what does "straightforwardly" do in that definition?

4. Existing Writing on Corrigibility

Max Harms24d20

Again, responding briefly to one point due to my limited time-window:

> While active resistance seems like the scariest part of incorrigibility, an incorrigible agent that’s not actively resisting still seems likely to be catastrophic.
Can you say more about this? It doesn't seem likely to me.

Suppose I am an agent which wants paperclips. The world is full of matter and energy which I can bend to my will in the service of making paperclips. Humans are systems which can be bent towards the task of making paperclips, and I want to manipulate them into doing my bidding not^[1] because they might turn me off, but because they are a way to get more paperclips. When I incinerate the biosphere to gain the energy stored inside, it's not^[1] because it's trying to stop me, but because it is fuel. When my self-replicating factories and spacecraft are impervious to weaponry, it is not^[1] because I knew I needed to defend against bombs, but because the best factory/spacecraft designs are naturally robust.

^{^}
(just)

4. Existing Writing on Corrigibility

Max Harms24d10

Also, take your decision-tree and replace 'B' with 'A-'. If we go with your definition, we seem to get the result that expected-utility-maximizers prefer A- to A (because they choose A- over A on Monday). But that doesn't sound right, and so it speaks against the definition.

Can you be more specific here? I gave several trees, above, and am not easily able to reconstruct your point.

4. Existing Writing on Corrigibility

Max Harms24d20

Excellent response. Thank you. :) I'll start with some basic responses, and will respond later to other points when I have more time.

I think you intend 'sensitive to unused alternatives' to refer to the Independence axiom of the VNM theorem, but VNM Independence isn't about unused alternatives. It's about lotteries that share a sublottery. It's Option-Set Independence (sometimes called 'Independence of Irrelevant Alternatives') that's about unused alternatives.

I was speaking casually here, and I now regret it. You are absolutely correct that Option-Set independence is not the Independence axiom. My best guess about what I meant was that VNM assumes that the agent has preferences over lotteries in isolation, rather than, for example, a way of picking preferences out of a set of lotteries. For instance, a VNM agent must have a fixed opinion about lottery A compared to lottery B, regardless of whether that agent has access to lottery C.

> agents with intransitive preferences can be straightforwardly money-pumped
Not true. Agents with cyclic preferences can be straightforwardly money-pumped. The money-pump for intransitivity requires the agent to have complete preferences.

You are correct. My "straightforward" mechanism for money-pumping an agent with preferences A > B, B > C, but which does not prefer A to C does indeed depend on being able to force the agent to pick either A or C in a way that doesn't reliably pick A.

1. The CAST Strategy

Max Harms24d40

That matches my sense of things.

To distinguish corrigibility from DWIM in a similar sort of way:

Alice, the principal, sends you, her agent, to the store to buy groceries. You are doing what she meant by that (after checking uncertain details). But as you are out shopping, you realize that you have spare compute--your mind is free to think about a variety of things. You decide to think about ___.

I'm honestly not sure what "DWIM" does here. Perhaps it doesn't think? Perhaps it keeps checking over and over again that it's doing what was meant? Perhaps it thinks about its environment in an effort to spot obstacles that need to be surmounted in order to do what was meant? Perhaps it thinks about generalized ways to accumulate resources in case an obstacle presents itself? (I'll loop in Seth Herd, in case he has a good answer.)

More directly, I see DWIM as underspecified. Corrigibility gives a clear answer (albeit an abstract one) about how to use degrees of freedom in general (e.g. spare thoughts should be spent reflecting on opportunities to empower the principal and steer away from principal-agent style problems). I expect corrigible agents to DWIM, but that a training process that focuses on that, rather than the underlying generator (i.e. corrigibility) to be potentially catastrophic by producing e.g. agents that subtly manipulate their principals in the process of being obedient.