## AI ALIGNMENT FORUMAF

Caspar Oesterheld

Sorted by New

# Wiki Contributions

Formalizing Objections against Surrogate Goals

Not very important, but: Despite having spent a lot of time on formalizing SPIs, I have some sympathy for a view like the following:

> Yeah, surrogate goals / SPIs are great. But if we want AI to implement them, we should mainly work on solving foundational issues in decision and game theory with an aim toward AI. If we do this, then AI will implement SPIs (or something even better) regardless of how well we understand them. And if we don't solve these issues, then it's hopeless to add SPIs manually. Furthermore, believing that surrogate goals / SPIs work (or, rather, make a big difference for bargaining outcomes) shouldn't change our behavior much (for the reasons discussed in Vojta's post).

On this view, it doesn't help substantially to understand / analyze SPIs formally.

But I think there are sufficiently many gaps in this argument to make the analysis worthwhile. For example, I think it's plausible that the effective use of SPIs hinges on subtle aspects of the design of an agent that we might not think much about if we don't understand SPIs sufficiently well.

Formalizing Objections against Surrogate Goals

Great to see more work on surrogate goals/SPIs!

>Personally, the author believes that SPI might “add up to normality” --- that it will be a sort of reformulation of existing (informal) approaches used by humans, with similar benefits and limitations.

I'm a bit confused by this claim. To me it's a bit unclear what you mean by "adding up to normality". (E.g.: Are you claiming that A) humans in current-day strategic interactions shouldn't change their behavior in response to learning about SPIs (because 1) they are already using them or 2) doing things that are somehow equivalent to them)? Or are you claiming that B) they don't fundamentally change game-theoretic analysis (of any scenario/most scenarios)? Or C) are you saying they are irrelevant for AI v. AI interactions? Or D) that the invention of SPIs will not revolutionize human society, make peace in the middle east, ...) Some of the versions seem clearly false to me. (E.g., re C, even if you think that the requirements for the use of SPIs are rarely satisfied in practice, it's still easy to construct simple, somewhat plausible scenarios / assumptions (see our paper) under which SPIs do seem do matter substantially for game-theoretic analysis.) Some just aren't justified at all in your post. (E.g., re A1, you're saying that (like myself) you find this all confusing and hard to say.) And some are probably not contrary to what anyone else believes about surrogate goals / SPIs. (E.g., I don't know anyone who makes particularly broad or grandiose claims about the use of SPIs by humans.)

My other complaint is that in some places you state some claim X in a way that (to me) suggests that you think that Tobi Baumann or Vince and I (or whoever else is talking/writing about surrogate goals/SPIs) have suggested that X is false, when really Tobi, Vince and I are very much aware of X and have (although perhaps to an insufficient extent) stated X. Here are three instances of this (I think these are the only three), the first one being most significant.

The main objection of the post is that while adopting an SPI, the original players must keep a bunch of things (at least approximately) constant(/analogous to the no-SPI counterfactual) even when they have an incentive to change that thing, and they need to do this credibly (or, rather, make it credible that they aren't making any changes). You argue that this is often unrealistic. Well, the initial reaction of mine was: "Sure, I know these things!" (Relatedly: while I like the bandit v caravan example, this point can also be illustrated with any of the existing examples of SPIs and surrogate goals.) I also don't think the assumption is that unrealistic. It seems that one substantial part of your complaint is that besides instructing the representative/self-modifying the original player/principal can do other things about the threat (like advocating a ban on real or water guns). I agree that this is important. If in 20 years I instruct an AI to manage my resources, it would be problematic if in the meantime I make tons of decisions (e.g., about how to train my AI systems) differently based on my knowledge that I will use surrogate goals anyway. But it's easy to come up scenarios where this is not a problem. E.g., when an agent considers immediate self-modification, *all* her future decisions will be guided by the modified u.f. Or when the SPI is applied to some isolated interaction. When all is in the representative's hand, we only need to ensure that the *representative* always acts in whatever way the representative acts in the same way it would act in a world where SPIs aren't a thing.

And I don't think it's that difficult to come up with situations in which the latter thing can be comfortably achieved. Here is one scenario. Imagine the two of us play a particular game G with SPI G'. The way in which we play this is that we both send a lawyer to a meeting and then the lawyers play the game in some way. Then we could could mutually commit (by contract) to pay our lawyers in proportion to the utilities they obtain in G' (and to not make any additional payments to them). The lawyers at this point may know exactly what's going on (that we don't really care about water guns, and so on) -- but they are still incentivized to play the SPI game G' to the best of their ability. You might even beg your lawyer to never give in (or the like), but the lawyer is incentivized to ignore such pleas. (Obviously, there could still be various complications. If you hire the lawyer only for this specific interaction and you know how aggressive/hawkish different lawyers are (in terms of how they negotiate), you might be inclined to hire a more aggressive one with the SPI. But you might hire the lawyer you usually hire. And in practice I doubt that it'd be easy to figure out how hawkish different lawyers are.

Overall I'd have appreciated more detailed discussion of when this is realistic (or of why you think it rarely is realistic). I don't remember Tobi's posts very well, but our paper definitely doesn't spend much space on discussing these important questions.

On SPI selection, I think the point from Section 10 of our paper is quite important, especially in the kinds of games that inspired the creation of surrogate goals in the first place. I agree that in some games, the SPI selection problem is no easier than the equilibrium selection problem in the base game. But there are games where it does fundamentally change things because *any* SPI that cannot further be Pareto-improved upon drastically increases your utility from one of the outcomes.

Re the "Bargaining in SPI" section: For one, the proposal in Section 9 of our paper can still be used to eliminate the zeroes!

Also, the "Bargaining in SPI" and "SPI Selection" sections to me don't really seem like "objections". They are limitations. (In a similar way as "the small pox vaccine doesn't cure cancer" is useful info but not an objection to the small pox vaccine.)

Extracting Money from Causal Decision Theorists

>If I win I get $6. If I lose, I get$5.

I assume you meant to write: "If I lose, I lose \$5."

Yes, these are basically equivalent. (I even mention rock-paper-scissors bots in a footnote.)

Predictors exist: CDT going bonkers... forever

Apologies, I only saw your comment just now! Yes, I agree, CDT never strictly prefers randomizing. So there are agents who abide by CDT and never randomize. As our scenarios show, these agents are exploitable. However, there could also be CDT agents who, when indifferent between some set of actions (and when randomization is not associated with any cost), do randomize (and choose the probability according to some additional theory -- for example, you could have the decision procedure: "follow CDT, but when indifferent between multiple actions, choose a distribution over these actions that is ratifiable".). The updated version of our paper -- which has now been published Open Access in The Philosophical Quarterly -- actually contains some extra discussion of this in Section IV.1, starting with the paragraph "Nonetheless, what happens if we grant the buyer in Adversarial Offer access to a randomisation device...".

In memoryless Cartesian environments, every UDT policy is a CDT+SIA policy

Sorry for taking an eternity to reply (again).

On the first point: Good point! I've now finally fixed the SSA probabilities so that they sum up to 1, which really they should, to really have a version of EDT.

>prevents coordination between agents making different observations.

Yeah, coordination between different observations is definitely not optimal in this case. But I don't see an EDT way of doing it well. After all, there are cases where given one observation, you prefer one policy and given another observation you favor another policy. So I think you need the ex ante perspective to get consistent preferences over entire policies.

>(Oh, I ignored the splitting up of probabilities of trajectories into SSA probabilities and then adding them back up again, which may have some intuitive appeal but ends up being just a null operation. Does anyone see a significance to that part?)

The only significance is to get a version of EDT, which we would traditionally assume to have self-locating beliefs. From a purely mathematical point of view, I think it's nonsense.

Predictors exist: CDT going bonkers... forever

>Caspar Oesterheld and Vince Conitzer are also doing something like this

That paper can be found at https://users.cs.duke.edu/~ocaspar/CDTMoneyPump.pdf . And yes, it is structurally essentially the same as the problem in the post.

Pavlov Generalizes

Not super important but maybe worth mentioning in the context of generalizing Pavlov: the strategy Pavlov for the iterated PD can be seen as an extremely shortsighted version of the law of effect, which basically says: repeat actions that have worked well in the past (in similar situations). Of course, the LoE can be applied in a wide range of settings. For example, in their reinforcement learning textbook, Sutton and Barto write that LoE underlies all of (model-free) RL.

CDT=EDT=UDT

> I tried to understand Caspar’s EDT+SSA but was unable to figure it out. Can someone show how to apply it to an example like the AMD to help illustrate it?

Sorry about that! I'll try to explain it some more. Let's take the original AMD. Here, the agent only faces a single type of choice -- whether to EXIT or CONTINUE. Hence, in place of a policy we can just condition on when computing our SSA probabilities. Now, when using EDT+SSA, we assign probabilities to being a specific instance in a specific possible history of the world. For example, we assign probabilities of the form , which denotes the probability that given I choose to CONTINUE with probability , history (a.k.a. CONTINUE, EXIT) is actual and that I am the instance intersection (i.e., the first intersection). Since we're using SSA, these probabilities are computed as follows:

That is, we first compute the probability that the history itself is actual (given ). Then we multiply it by the probability that within that history I am the instance at , which is just 1 divided by the number of instances of myself in that history, i.e. 2.

Now, the expected value according to EDT + SSA given can be computed by just summing over all possible situations, i.e. over all combinations of a history and a position within that history and multiplying the probability of that situation with the utility given that situation:

And that's exactly the ex ante expected value (or UDT-expected value, I suppose) of continuing with probability . Hence, EDT+SSA's recommendation in AMD is the ex ante optimal policy (or UDT's recommendation, I suppose). This realization is not original to myself (though I came up with it independently in collaboration with Johannes Treutlein) -- the following papers make the same point:

• Rachael Briggs (2010): Putting a value on Beauty. In Tamar Szabo Gendler and John Hawthorne, editors, Oxford Studies in Epistemology: Volume 3, pages 3–34. Oxford University Press, 2010. http://joelvelasco.net/teaching/3865/briggs10-puttingavalueonbeauty.pdf
• Wolfgang Schwarz (2015): Lost memories and useless coins: revisiting the absentminded driver. In: Synthese. https://www.umsu.de/papers/driver-2011.pdf

My comment generalizes these results a bit to include cases in which the agent faces multiple different decisions.

Dutch-Booking CDT
Caspar Oesterheld is working on similar ideas.

For anyone who's interested, Abram here refers to my work with Vincent Conitzer which we write about here.

ETA: This work has now been published in The Philosophical Quarterly.