Background 1: Preferences-over-future-states (a.k.a. consequentialism) vs
Preferences-over-trajectories other kinds of preferences
(Note: The original version of this post said "preferences over trajectories" all over the place. Commenters were confused about what I meant by that, so I have switched the terminology to "any other kind of preference" which is hopefully clearer.)
The post Coherent decisions imply consistent utilities (Eliezer Yudkowsky, 2017) explains how, if an agent has preferences over future states of the world, they should act like a utility-maximizer (with utility function defined over future states of the world). If they don’t act that way, they will be less effective at satisfying their own preferences; they would be “leaving money on the table” by their own reckoning. And there are externally-visible signs of agents being suboptimal in that sense; I'll go over an example in a second.
By contrast, the post Coherence arguments do not entail goal-directed behavior (Rohin Shah, 2018) notes that, if an agent has preferences over universe-histories, and acts optimally with respect to those preferences (acts as a utility-maximizer whose utility function is defined over universe-histories), then they can display any external behavior whatsoever. In other words, there's no externally-visible behavioral pattern which we can point to and say "That's a sure sign that this agent is behaving suboptimally, with respect to their own preferences.".
For example, the first (Yudkowsky) post mentions a hypothetical person at a restaurant. When they have an onion pizza, they’ll happily pay $0.01 to trade it for a pineapple pizza. When they have a pineapple pizza, they’ll happily pay $0.01 to trade it for a mushroom pizza. When they have a mushroom pizza, they’ll happily pay $0.01 to trade it for a pineapple pizza. The person goes around and around, wasting their money in a self-defeating way (a.k.a. “getting money-pumped”).
That post describes the person as behaving sub-optimally. But if you read carefully, the author sneaks in a critical background assumption: the person in question has preferences about what pizza they wind up eating, and they’re making these decisions based on those preferences. But what if they don’t? What if the person has no preference whatsoever about pizza? What if instead they’re an asshole restaurant customer who derives pure joy from making the waiter run back and forth to the kitchen?! Then we can look at the same behavior, and we wouldn’t describe it as self-defeating “getting money-pumped”, instead we would describe it as the skillful satisfaction of the person’s own preferences! They’re buying cheap entertainment! So that would be an example of preferences-not-concerning-future-states.
To be more concrete, if I’m deciding between two possible courses of action, A and B, “preference over future states” would make the decision based on the state of the world after I finish the course of action—or more centrally, long after I finish the course of action. By contrast, “other kinds of preferences” would allow the decision to depend on anything, even including what happens during the course-of-action.
(Edit to add: There are very good reasons to expect future powerful AGIs to act according to preferences over distant-future states, and I join Eliezer in roundly criticizing people who think we can build an AGI that never does that; see this comment for discussion.)
Background 2: Corrigibility is a square peg, preferences-over-future-states is a round hole
A “corrigible” AI is an AI for which you can shut it off (or more generally change its goals), and it doesn’t try to stop you. It also doesn’t deactivate its own shutoff switch, and it even fixes the switch if it breaks. Nor does it have preferences in the opposite direction: it doesn’t try to press the switch itself, and it doesn’t try to persuade you to press the switch. (Note: I’m using the term “corrigible” here in the narrow MIRI sense, not the stronger and vaguer Paul Christiano sense)
As far as I understand, there was some work in the 2010s on trying to construct a utility function (over future states) that would result in an AI with all those properties. This is not an easy problem. In fact, it’s not even clear that it’s possible! See Nate Soares google talk in 2017 for a user-friendly introduction to this subfield, referencing two papers (1,2). The latter, from 2015, has some technical details, and includes a discussion of Stuart Armstrong’s “indifference” method. I believe the “indifference” method represented some progress towards a corrigible utility-function-over-future-states, but not a complete solution (apparently it’s not reflectively consistent—i.e., if the off-switch breaks, it wouldn't fix it), and the problem remains open to this day.
(Edit to add: A commenter points out that the "indifference" method uses a utility function that is not over future states. Uncoincidentally, one of the advantages of preferences-over-future-states is that they have reflective consistency. However, I will argue shortly that we can get reflective consistency in other ways.)
Also related is The Problem of Fully-Updated Deference: Naively you might expect to get corrigibility if your AI’s preferences are something like “I, the AI, prefer whatever future states that my human overseer would prefer”. But that doesn’t really work. Instead of acting corrigibly, you might find that your AI resists shutdown, kills you and disassembles your brain to fully understand your preferences over future states, and then proceeds to create whatever those preferred future states are.
See also Eliezer Yudkowsky discussing the "anti-naturalness" of corrigibility in conversation with Paul Christiano here, and with Richard Ngo here. My impression is that, in these links, Yudkowsky is suggesting that powerful AGIs will purely have preferences over future states.
My corrigibility proposal sketch
Maybe I’m being thickheaded, but I’m just skeptical of this whole enterprise. I’m tempted to declare that “preferences purely over future states” are just fundamentally counter to corrigibility. When I think of “being able to turn off the AI when we want to”, I see it as not a future-state-kind-of-thing. And if we humans in fact have some preferences that are not about future states, then it’s folly for us to build AIs that purely have preferences over future states.
So, here’s my (obviously-stripped-down) proposal for a corrigible paperclip maximizer:
The AI considers different possible plans (a.k.a. time-extended courses of action). For each plan:
- It assesses how well this plan pattern-matches to the concept “there will ultimately be lots of paperclips in the universe”,
- It assesses how well this plan pattern-matches to the concept “the humans will remain in control”
- It combines these two assessments (e.g. weighted average or something more complicated) to pick a winning plan which scores well on both. [somewhat-related link]
Note that “the humans will remain in control” is a concept that can’t be distilled into a ranking of future states, i.e. states of the world at some future time long after the plan is complete. (See this comment for elaboration.) Human world-model concepts are very often like that! For example, pause for a second and think about the human concept of “going to the football game”. It’s a big bundle of associations containing immediate actions, and future actions, and semantic context, and expectations of what will happen while we’re doing it, and expectations of what will result after we finish doing it, etc. etc. We humans are perfectly capable of pattern-matching to these kinds of time-extended concepts, and I happen to expect that future AGIs will be as well.
By contrast, “there will be lots of paperclips” can be distilled into a ranking of future states.
There’s a lesson here: I claim that consequentialism is not all-or-nothing. We can build agents that have preferences about future states and have preferences about other things, just as humans do.
Objection 1: How exactly does the AI learn these two abstract concepts? What happens in weird out-of-distribution situations where the concepts break down?
Just like humans, the AI can learn abstract concepts by reading books or watching YouTube or whatever. Presumably this would involve predictive (self-supervised) learning, and maybe other things too. And just like humans, the AI can do out-of-distribution detection by looking at how the web of associations defining the concept get out-of-sync with each other. I didn’t draw any out-of-distribution handling system in the above diagram, but we can imagine that the AI detects plans that go into weird places where its preferred concepts break down, and either subtracts points from them, or (somehow) queries the human for clarification. (Related posts: model splintering and alignment by default.)
Maybe it sounds like I’m brushing off this question. I actually think this is a very important and hard and open question. I don’t pretend for a second that the previous paragraph has answered it. I’ll have more to say about it in future posts. But I don’t currently know any argument that it’s a fundamental problem that dooms this whole approach. I think that’s an open question.
Relatedly, I wouldn’t bet my life that the abstract concept of “the humans remain in control” is exactly the thing we want, even if that concept can be learned properly. Maybe we want the conjunction of several abstract concepts? “I’m being helpful” / “I’m behaving in a way that my programmers intended” also seems promising. (The latter AI would presumably satisfy the stronger notion of Paul-corrigibility, not just the weaker notion of MIRI-corrigibility.) Anyway, this is another vexing open question that’s way beyond the scope of this post.
Objection 2: What if the AI self-modifies to stop being corrigible? What if it builds a non-corrigible successor?
Presumably a sufficiently capable AI would self-modify to stop being corrigible because it planned to, and such a plan would certainly score very poorly on its “the humans will remain in control” assessment. So the plan would get a bad aggregate score, and the AI wouldn’t do it. Ditto with building a non-corrigible successor.
This doesn't completely answer the objection—for example, what if the AI unthinkingly / accidentally does those things?—but it's enough to make me hopeful.
Objection 3: This AI is not competitive, compared to an AI that has pure preferences over future states. (Its “alignment tax” is too high.)
The sketch above is an AI that can brainstorm, and learn, and invent, and debug its own source code, and come up with brilliant foresighted plans and execute them. Basically, it can and will do human-out-of-the-loop long-term consequentialist planning. All the things that I really care about AIs being able to do (e.g. do creative original research on the alignment problem, invent new technologies, etc.) are things that this AI can definitely do.
As evidence, consider that humans have both preferences concerning future states and preferences concerning other things, and yet humans have nevertheless been able to do numerous very impressive things, like inventing rocket engines and jello shots.
Do I have competitiveness concerns? You betcha. But they don't come from anything in the basic sketch diagram above. Instead my competitiveness concerns would be:
- An AI that cares only about future states will be more effective at bringing about future states than an AI that cares about both future states and other things. (For example, an AI that cares purely about future paperclips will create more future paperclips than an AI that has preferences about both future paperclips and “humans remaining in control”.) But I don't really see that as an AI design flaw, but rather an inevitable aspect of the strategic landscape that we find ourselves in. By the same token, an AI with a goal of "maximize human flourishing" is less powerful than an AI that can freely remove all the oxygen from the atmosphere to prevent its self-replicating nano-factories from rusting. We still have to deal with this kind of stuff, but I see it as mostly outside the scope of technical AGI safety research.
- There are a lot of implementation details not shown in that sketch above, such as the stuff I discussed when answering “Objection 1” above. To make all those implementation details work reliably (if that's even possible), it’s quite possible that we would need extra safety measures—humans-in-the-loop, conservatism, etc.—and those could involve problematic tradeoffs between safety and competitiveness.
What am I missing? Very open to feedback. :)
(Thanks Adam Shimi for critical comments on a draft.)
Thanks for writing this up! I appreciate the summarisation achieved by the background sections, and the clear claims made in bold in the sketch.
The "preferences (purely) over future states" and "preferences over trajectories" distinction is getting at something, but I think it's broken for a couple of reasons. I think you've come to a similar position by noticing that people have preferences both over states and over trajectories. But I remain confused about the relationship between the two posts (Yudkowsky and Shah) you mentioned at the start. Anyway, here are my reasons:
One is that states contain records. This means the state of the world "long after I finish the course of action" may depend on "what happens during the course of action", i.e., the central version of preferences over future states can be an instance of preferences over trajectories. A stark example of this is a world with an "author's logbook" into which is written every event as it happens - preferences over trajectories can be realised as preferences over what's written in the book far in the future. There's a subtle difference arising from how manipulable the records are: preferences over trajectories, realised via records, depend on accurate records. But I would say that our world is full of records, and manipulating all of them consistently and surreptitiously is impossible.
The other issue has to do with the variable granularity of time. When I'm considering a school day, it may be natural to think of a trajectory through different periods and breaks, and have preferences over different ways the schedule could be laid out (e.g., breaks evenly spaced, or clumped together), but to treat each period itself as a state -- maybe I really like history class so conspire to end up there from wherever I am (preferences over states) or I like an evenly balanced day (preferences over trajectories). But when considering what I'm learning in that history class, dividing a single day into multiple states may seem ridiculous, instead I'm considering states like "the Iron age" or "the 21st century" -- and again could have preferences for certain states, or for trajectories through the states.
(An additional point I would make here is that Newtonian, external, universal time -- space as a grand arena in which events unfold like clockwork -- is not how time works in our universe. This means if we build plans for a corrigible superintelligence on a "timestep-based" view of the world (what you get out of the cybernetic model / agent-environment interaction loop), they're going to fall apart unless we're very careful in thinking about what the states and timesteps in the model actually mean.)
I would propose instead that we focus on "preferences over outcomes", rather than states or trajectories. This makes it clear that some judgement is required to figure out what counts as an outcome, and how to determine whether it has obtained. This may depend on temporally extended information - trajectory information if you like - but not necessarily "all" of it. I think what you called "preferences (purely) over future states" is coming from a "preferences over outcomes" point of view, and it's a mistake to rule corrigibility out of outcomes.
Thanks! Hmm. I think there's a notion of "how much a set of preferences gives rise to stereotypically-consequentialist behavior". Like, if you see an agent behaving optimally with respect to preferences about "how the world will be in 10 years", they would look like a consequentialist goal-seeking agent. Even if you didn't know what future world-states they preferred, you would be able to guess with high confidence that they preferred some future world-states over others. For example, they would almost certainly pursue convergent instrumental subgoals like power-seeking. By contrast, if you see an agent which, at any time, behaves optimally with respect to preferences about "how the world will be in 5 seconds", it would look much less like that, especially if after each 5-second increment they roll a new set of preferences. And an agent which, at any time, behaves optimally with respect to preferences over what it's doing right now would look not at all like a consequentialist goal-seeking agent.
(We care about "looking like a consequentialist goal-seeking agent" because corrigible AIs do NOT "look like a consequentialist goal-seeking agent".)
Now we can say: By the time-reversibility of the laws of physics, a rank-ordering of "states-of-the-world at future time T (= midnight on January 1 2050)" is equivalent to a rank-ordering of "universe-histories up through future time T". But I see that as kinda an irrelevant technicality. An agent that makes decisions myopically according to (among other things) a "preference for telling the truth right now" in the universe-history picture would cash out as "some unfathomably complicated preference over the microscopic configuration of atoms in the universe at time T". And indeed, an agent with that (unfathomably complicated) preference ordering would not look like a consequentialist goal-seeking agent.
So by the same token, it's not that there's literally no utility function over "states of the world at future time T" that incentivizes corrigible behavior all the way from now to T, it's just that there may be no such utility function that can be realistically defined.
Turning more specifically to record-keeping mechanisms, consider an agent with preferences pertaining to what will be written in the logbook at future time T. Let's take two limiting cases.
One limiting case is: the logbook can be hacked. Then the agent will hack into it. This looks like consequentialist goal-seeking behavior.
The other limiting case is: the logbook is perfect and unbreachable. Then I'd say that it no longer really makes sense to describe this as "an AI with preferences over the state of the world at future time T". It's more helpful to think of this as "an AI with preferences over universe-histories", and by the way an implementation detail is that there's this logbook involved in how we designed the AI to have this preference. And indeed, the AI will now look less like a consequentialist goal-seeking agent. (By the way I doubt we would actually design an AI using a literal logbook.)
I'm a bit confused what you're saying here.
It is conceivable to have an AI that makes decisions according to a rank-ordering of the state of the world at future time T = midnight January 1 2050. My impression is that Eliezer has that kind of thing in mind—e.g. "imagine a paperclip maximizer as not being a mind at all, imagine it as a kind of malfunctioning time machine that spits out outputs which will in fact result in larger numbers of paperclips coming to exist later" (ref). I'm suggesting that this is a bad idea if we do it to the exclusion of every other type of preference, but it is possible.
On the other hand, I intended "preferences over trajectories" to be maximally vague—it rules nothing out.
I think our future AIs can have various types of preferences. It's quite possible that none of those preferences would look like a rank-ordering of states of the world at a specific time T, but some of them might be kinda similar, e.g. a preference for "there will eventually be paperclips" but not by any particular deadline. Is that what you mean by "outcome"? Would it have helped if I had replaced "preferences over trajectories" with the synonymous "preferences that are not exclusively about the future state of the world"?
Thanks for the reply! My comments are rather more thinking-in-progress than robust-conclusions than I’d like, but I figure that’s better than nothing.
(Thanks for doing that!) I was going to answer ‘yes’ here, but… having thought about this more, I guess I now find myself confused about what it means to have preferences in a way that doesn't give rise to consequentialist behaviour. Having (unstable) preferences over “what happens 5 seconds after my current action” sounds to me like not really having preferences at all. The behaviour is not coherent enough to be interpreted as preferring some things over others, except in a contrived way.
Your proposal is to somehow get an AI that both produces plans that actually work and cares about being corrigible. I think you’re claiming that the main perceived difficulty with combining these is that corrigibility is fundamentally not about preferences over states whereas working-plans is about preferences over states. Your proposal is to create an AI with preferences both about states and not.
I would counter that how to specify (or precisely, incentivize) preferences for corrigibility remains as the main difficulty, regardless of whether this means preferences over states or not. If you try to incentivize corrigibility via a recognizer for being corrigible, the making-plans-that-actually-work part of the AI effectively just adds fooling the recognizer to its requirements for actually working.
In your view does it make sense to think about corrigibilty as constraints on trajectories? Going with that for now… If the constraints were simple enough, we could program them right into the action space - as in a board-game playing AI that cannot make an invalid move and therefore looks like it cares about both reaching the final win state and about satisfying the never-makes-an-invalid-move constraint on its trajectory. But corrigibility is not so simple that we can program it into the action space in advance. I think what the corrigibility constraint consists of may grow in sophistication with the sophistication of the agent’s plans. It seems like it can’t just be factored out as an additional objective because we don’t have a foolproof specification of that additional objective.
Thanks, this is helpful!
Oh, sorry, I'm thinking of a planning agent. At any given time it considers possible courses of action, and decides what to do based on "preferences". So "preferences" are an ingredient in the algorithm, not something to be inferred from external behavior.
That said, if someone "prefers" to tell people what's on his mind, or if someone "prefers" to hold their fork with their left hand … I think those are two examples of "preferences" in the everyday sense of the word, but that they're not expressible as a rank-ordering of the state of the world at a future date.
Instead of "desire to be corrigible", I'll switch to something more familiar: "desire to save the rainforest".
Let's say my friend Sally is "trying to save the rainforest". There's no "save the rainforest detector" external to Sally, which Sally is trying to satisfy. Instead, the "save the rainforest" concept is inside Sally's own head.
When Sally decides to execute Plan X because it will help save the rainforest, that decision is based on the details of Plan X as Sally herself understands it.
Let's also assume that Sally's motivation is ego-syntonic (which we definitely want for our AGIs): In other words, Sally wants to save the rainforest and Sally wants to want to save the rainforest.
Under those circumstances, I don't think saying something like "Sally wants to fool the recognizer" is helpful. That's not an accurate description of her motivation. In particular, if she were offered an experience machine or brain-manipulator that could make her believe that she has saved the rainforest, without all the effort of actually saving the rainforest, she would emphatically turn down that offer.
So what can go wrong?
Let's say Sally and Ahmed are working at the same rainforest advocacy organization. They're both "trying to save the rainforest", but maybe those words mean slightly different things to them. Let's quiz them with a list of 20 weird out-of-distribution hypotheticals:
Presumably Sally and Ahmed will give different answers, and this could conceivably shake out as Sally taking an action that Ahmed strongly opposes or vice-versa, even though they nominally share the same goal.
You can describe that as "Sally is narrowly targeting the save-the-rainforest-recognizer-in-Sally's-head, and Ahmed is narrowly targeting the save-the-rainforest-recognizer-in-Ahmed's-head, and each sees the other as Goodhart'ing a corner-case where their recognizer is screwing up."
That's definitely a problem, and that's the kind of stuff I was talking about under "Objection 1" in the post, where I noted the necessity of out-of-distribution detection systems perhaps related to Stuart Armstrong's "model splintering" ideas etc.
Does that help?
I've been gingerly building my way up toward similar ideas but I haven't yet posted my thoughts on the subject. I appreciate you ripping the band-aid off.
There are two obvious ways an intelligence can be non-consequentialist.
If you define intelligence to be consequentialist then corrigibility becomes extremely difficult for the reasons Eliezer Yudkowsky has expounded ad nauseum. If you create a non-consequentialist intelligence then corrigibility is almost the default—especially with regard to stateless intelligences. A stateless intelligence has no external world to optimize. This isn't a side-effect of it being stupid or boxed. It's a fundamental constraint of the software paradigm the machine learning architecture is embedded in.
It's easier to build local systems than consequentialist systems because the components available to us are physical objects and physics is local. Consequentialist systems are harder to construct because world-optimizers are (practically-speaking) non-local. Building a(n effectively) non-local system out of local elements can be done, but it is hard. Consequentialist is harder than local; local is harder than stateless. Stateless systems are easier to build than both local systems and consequentialist systems because mathematics is absolute.
I don't think you're being thickheaded. I think you're right. Human beings are so trajectory-dependent it's a cliché. "Live is not about the destination. Life is about the friends we made along the way."
This is not to say I completely agree with all the claims in the article. Your proposal for a corrigible paperclip maximizer appears consequentialist to me because the two elements of its value function "there will be lots of paperclips" and "humans will remain in control" are both statements about the future. Optimizing a future state is consequentialism. If the "humans will remain in control" value function has bugs (and it will) then the machine will turn the universe into paperclips. A non-consequentialist architecture shouldn't require a "human will remain in control" value function. There should be no mechanism for the machine to consequentially interfere with its masters' intentions at all.
Thanks for the comment!
I feel like I'm stuck in the middle…
I disagree with the 2nd camp for the same reason Eliezer does: I don't think those AIs are powerful enough. More specifically: We already have neat AIs like GPT-3 that can do lots of neat things. But we have a big problem: sooner or later, somebody is going to come along and build a dangerous accident-prone consequentialist AGI. We need an AI that's both safe, and powerful enough to solve that big problem. I usually operationalize that as "able to come up with good original creative ideas in alignment research, and/or able to invent powerful new technologies". I think that, for an AI to do those things, it needs to do explicit means-end reasoning, autonomously come up with new instrumental goals and pursue them, etc. etc. For example, see discussion of "RL-on-thoughts" here.
"Humans will eventually wind up in control" is purely about future states. "Humans will remain in control" is not. For example, consider a plan that involves disempowering humans and then later re-empowering them. That plan would pattern-match well to "humans will eventually wind up in control", but it would pattern-match poorly to "humans will remain in control".
Yes, this is a very important potential problem, see my discussion under "Objection 1".
In section 2.1 of the Indifference paper the reward function is defined on histories. In section 2 of the corrigibility paper, the utility function is defined over (action1, observation, action2) triples—which is to say, complete histories of the paper's three-timestep scenario. And section 2 of the interruptibility paper specifies a reward at every timestep.
I think preferences-over-future-states might be a simplification used in thought experiments, not an actual constraint that has limited past corrigibility approaches.
Interesting, thanks! Serves me right for not reading the "Indifference" paper!
I think the discussions here and especially here are strong evidence that at least Eliezer & Nate are expecting powerful AGIs to be pure-long-term-consequentialist. (I didn't ask, I'm just going by what they wrote.) I surmise they have a (correct) picture in their head of how super-powerful a pure-long-term-consequentialist AI can be—e.g. it can self-modify, it can pursue creative instrumental goals, it's reflectively stable, etc.—but they have not similarly envisioned a partially-but-not-completely-long-term-consequentialist AI that is only modestly less powerful (and in particular can still self-modify, can still pursue creative instrumental goals, and is still reflectively stable). That's what "My corrigibility proposal sketch" was trying to offer.
I'll reword to try to describe the situation better, thanks again.
I have not read the whole comment section, so this feedback may already have been given, but...
Opinions differ on how open the problem remains. Definitely, going by the recent Yudkowsky sequences, MIRI still acts as if the problem is open, and seems to have given up on making progress on it, or believing that anybody else has made progress or can make progress. I on the other hand believe that the problem of figuring out how to make indifference methods work is largely closed. I have written papers on it, for example here. But you have told me before you have trouble reading my work, so I am not sure I can help you any further.
My impression is that Yudkowsky only cares about designing the type of powerful AGIs that will purely have preferences over future states. My impression is that he considers AGIs which do not purely have preferences over future states to be useless to any plan that might save the world from x-risk. In fact, he feels that these latter AGIs are not even worthy of the name AGI. At the same time, he worries that these consequentialist AGIs he wants will kill everybody, if some idiot gives them the wrong utility function.
This worry is of course entirely valid, so my own ideas about safe AGI designs tend to go heavily towards favouring designs that are not purely consequentialist AGIs. My feeling is that Yudkowsky does not want to go there, design-wise. He has locked himself into a box, and refuses to think outside of it, to the extent that he even believes that there is no outside.
As you mention above. if you want to construct a value function component that measures 'humans stay in control', this is very possible. But you will have to take into account that a whole school of thought on this forum will be all too willing to criticise your construction for not being 100.0000% reliable, for having real or imagined failure modes, for not being the philosophical breakthrough they really want to be reading about. This can give you a serious writer's block, if you are not careful.