All of Caspar42's Comments + Replies

Nice post!

What would happen in your GPT-N fusion reactor story if you ask it a broader question about whether it is a good idea to share the plans?

Perhaps relatedly:

>Ok, but can’t we have an AI tell us what questions we need to ask? That’s trainable, right? And we can apply the iterative design loop to make AIs suggest better questions?

I don't get what your response to this is. Of course, there is the verifiability issue (which I buy). But it seems that the verifiability issue alone is sufficient for failure. If you ask, "Can this design be turned...

3johnswentworth5mo
My response to the "get the AI to tell us what questions we need to ask" is that it fails for multiple reasons, any one of which is sufficient for failure. One of them is the verifiability issue. Another is the Gell-Mann Amnesia thing (which you could view as just another frame on the verifiability issue, but up a meta level). Another is the "get what we measure" problem. Another failure mode which this post did not discuss is the Godzilla Problem [https://www.lesswrong.com/s/TLSzP4xP42PPBctgw/p/DwqgLXn5qYC7GqExF]. In the frame of this post: in order to work in practice the iterative design loop needs to be able to self-correct; if we make a mistake at one iteration it must be fixable at later iterations. "Get the AI to tell us what questions we need to ask" fails that test; just one iteration of acting on malicious advice from an AI can permanently break the design loop.

Sounds interesting! Are you going to post the reading list somewhere once it is completed?

(Sorry for self-promotion in the below!)

I have a mechanism design paper that might be of interest: Caspar Oesterheld and Vincent Conitzer: Decision Scoring Rules. WINE 2020. Extended version. Talk at CMID.

Here's a pitch in the language of incentivizing AI systems -- the paper is written in CS-econ style. Imagine you have an AI system that does two things at the same time:
1) It makes predictions about the world.
2) It takes actions that influence the world. (In the pape...

Cool that this is (hopefully) being done! I have had this on my reading list for a while and since this is about the kind of problems I also spend much time thinking about, I definitely have to understand it better at some point. I guess I can snooze it for a bit now. :P Some suggestions:

Maybe someone could write an FAQ page? Also, a somewhat generic idea is to write something that is more example based, perhaps even something that just solely gives examples. Part of why I suggest these two is that I think they can be written relatively mechanically and th...

I now have a draft for a paper that gives this result and others.

Not very important, but: Despite having spent a lot of time on formalizing SPIs, I have some sympathy for a view like the following:

> Yeah, surrogate goals / SPIs are great. But if we want AI to implement them, we should mainly work on solving foundational issues in decision and game theory with an aim toward AI. If we do this, then AI will implement SPIs (or something even better) regardless of how well we understand them. And if we don't solve these issues, then it's hopeless to add SPIs manually. Furthermore, believing that surrogate goals / SPIs wor...

3Ofer1y
Regarding the following part of the view that you commented on: Just wanted to add: It may be important to consider potential downside risks of such work. It may be important to be vigilant when working on certain topics in game theory and e.g. make certain binding commitments before investigating certain issues, because otherwise one might lose a commitment race [https://www.lesswrong.com/posts/brXr7PJ2W4Na2EW2q/the-commitment-races-problem] in logical time. (I think this is a special case of a more general argument made in Multiverse-wide Cooperation via Correlated Decision Making [https://longtermrisk.org/multiverse-wide-cooperation-via-correlated-decision-making/] about how it may be important to make certain commitments before discovering certain crucial considerations.)

Great to see more work on surrogate goals/SPIs!

>Personally, the author believes that SPI might “add up to normality” --- that it will be a sort of reformulation of existing (informal) approaches used by humans, with similar benefits and limitations.

I'm a bit confused by this claim. To me it's a bit unclear what you mean by "adding up to normality". (E.g.: Are you claiming that A) humans in current-day strategic interactions shouldn't change their behavior in response to learning about SPIs (because 1) they are already using them or 2) doing things that ...

1Vojtech Kovarik1y
I definitely don't think (C) and the "any" variant of (B). Less sure about the "most" variant of (B), but I wouldn't bet on that either. I do believe (D), mostly because I don't think that humans will be able to make the necessary commitments (in the sense mentioned in the thread with Rohin). I am not super sure about (A). My bet is that to the extent that SPI can work for humans, we are already using it (or something equivalent) in most situations. But perhaps some exceptions will work, like the lawyer example? (Although I suspect that our skill at picking hawkish lawyers is stronger than we realize. Or there might be existing incentives where lawyers are being selected for hawkishness, because we are already using them for someting-like-SPI? Overall, I guess that the more one-time-only an event is, the higher is the chance that the pre-existing selection pressures will be weak, and (A) might work.) That is a good point. I will try to expand on it, perhaps at least in a comment here once I have time, or so :-).
2Vojtech Kovarik1y
Thank you for pointing that out. In all these cases, I actually know that you "stated X", so this is not an impression I wanted to create. I added a note at the begging of the document to hopefully clarify this.

>If I win I get $6. If I lose, I get$5.

I assume you meant to write: "If I lose, I lose \$5."

Yes, these are basically equivalent. (I even mention rock-paper-scissors bots in a footnote.)

Apologies, I only saw your comment just now! Yes, I agree, CDT never strictly prefers randomizing. So there are agents who abide by CDT and never randomize. As our scenarios show, these agents are exploitable. However, there could also be CDT agents who, when indifferent between some set of actions (and when randomization is not associated with any cost), do randomize (and choose the probability according to some additional theory -- for example, you could have the decision procedure: "follow CDT, but when indifferent between multiple actions, choose a dis...

Sorry for taking an eternity to reply (again).

On the first point: Good point! I've now finally fixed the SSA probabilities so that they sum up to 1, which really they should, to really have a version of EDT.

>prevents coordination between agents making different observations.

Yeah, coordination between different observations is definitely not optimal in this case. But I don't see an EDT way of doing it well. After all, there are cases where given one observation, you prefer one policy and given another observation you favor another policy. So I ...

>Caspar Oesterheld and Vince Conitzer are also doing something like this

That paper can be found at https://users.cs.duke.edu/~ocaspar/CDTMoneyPump.pdf . And yes, it is structurally essentially the same as the problem in the post.

2Stuart Armstrong3y
Cool! I notice that you assumed there were no independent randomising devices available. But why would the CDT agent ever opt to use a randomising device? Why would it see that as having value?

Not super important but maybe worth mentioning in the context of generalizing Pavlov: the strategy Pavlov for the iterated PD can be seen as an extremely shortsighted version of the law of effect, which basically says: repeat actions that have worked well in the past (in similar situations). Of course, the LoE can be applied in a wide range of settings. For example, in their reinforcement learning textbook, Sutton and Barto write that LoE underlies all of (model-free) RL.

2Abram Demski4y
Somewhat true, but without further bells and whistles, RL does not replicate the Pavlov strategy in Prisoner's Dilemma, so I think looking at it that way is missing something important about what's going on.

> I tried to understand Caspar’s EDT+SSA but was unable to figure it out. Can someone show how to apply it to an example like the AMD to help illustrate it?

Sorry about that! I'll try to explain it some more. Let's take the original AMD. Here, the agent only faces a single type of choice -- whether to EXIT or CONTINUE. Hence, in place of a policy we can just condition on when computing our SSA probabilities. Now, when using EDT+SSA, we assign probabilities to being a specific instance in a specific possible history of the world. For example, ...

3Wei Dai4y
Thanks, I think I understand now, and made some observations about EDT+SSA [https://www.greaterwrong.com/posts/5bd75cc58225bf06703751b2/in-memoryless-cartesian-environments-every-udt-policy-is-a-cdt-sia-policy/comment/kuY5LagQKgnuPTPYZ] at the old thread. At this point I'd say this quote from the OP is clearly wrong: In fact UDT1.0 > EDT+SSA > CDT+SIA, because CDT+SIA is not even able to coordinate agents making the same observation [https://www.lesswrong.com/posts/WkPf6XCzfJLCm2pbK/cdt-edt-udt#Ya8msDGzRdR8yw4br], while EDT+SSA can do that but not coordinate agents making different observations [https://www.greaterwrong.com/posts/5bd75cc58225bf06703751b2/in-memoryless-cartesian-environments-every-udt-policy-is-a-cdt-sia-policy/comment/kuY5LagQKgnuPTPYZ], and UDT1.0 can (probably) coordinate agents making different observations (but seemingly at least some of them require UDT1.1 [https://www.lesswrong.com/posts/g8xh9R7RaNitKtkaa/explicit-optimization-of-global-strategy-fixing-a-bug-in] to coordinate).
Caspar Oesterheld is working on similar ideas.

For anyone who's interested, Abram here refers to my work with Vincent Conitzer which we write about here.

ETA: This work has now been published in The Philosophical Quarterly.

My paper "Robust program equilibrium" (published in Theory and Decision) discusses essentially NicerBot (under the name ϵGroundedFairBot) and mentions Jessica's comment in footnote 3. More generally, the paper takes strategies from iterated games and transfers them into programs for the corresponding program game. As one example, tit for tat in the iterated prisoner's dilemma gives rise to NicerBot in the "open-source prisoner's dilemma".

1Nisan2y
See also this comment [https://www.lesswrong.com/posts/BY8kvyuLzMZJkwTHL/prisoner-s-dilemma-with-visible-source-code-tournament?commentId=yYqhG6SiuKf3W8idk] from 2013 that has the computable version of NicerBot.

Since Briggs [1] shows that EDT+SSA and CDT+SIA are both ex-ante-optimal policies in some class of cases, one might wonder whether the result of this post transfers to EDT+SSA. I.e., in memoryless POMDPs, is every (ex ante) optimal policy also consistent with EDT+SSA in a similar sense. I think it is, as I will try to show below.

Given some existing policy , EDT+SSA recommends that upon receiving observation we should choose an action from (For notational simplicity, I'll assume that poli...

2Caspar Oesterheld1y
I now have a draft [https://www.andrew.cmu.edu/user/coesterh/DeSeVsExAnte.pdf] for a paper that gives this result and others.
3Wei Dai4y
I noticed that the sum inside argmaxa∑s1,...,sn∑ni=1SSA(si in s1,...,sn∣o,πo→a)U(sn) is not actually an expected utility, because the SSA probabilities do not add up to 1 when there is more than one possible observation. The issue is that conditional on making an observation, the probabilities for the trajectories not containing that observation become 0, but the other probabilities are not renormalized. So this seems to be part way between "real" EDT and UDT (which does not set those probabilities to 0 and of course also does not renormalize). This zeroing of probabilities of trajectories not containing the current observation (and renormalizing, if one was to do that) seems at best useless busywork, and at worst prevents coordination between agents making different observations. In this formulation of EDT, such coordination is ruled out in another way, namely by specifying that conditional on o→a, the agent is still sure the rest of π is unchanged (i.e., copies of itself receiving other observations keep following π). If we remove the zeroing/renormalizing and say that the agent ought to have more realistic beliefs conditional on o→a, I think we end up with something close to UDT1.0 (modulo differences in the environment model from the original UDT). (Oh, I ignored the splitting up of probabilities of trajectories into SSA probabilities and then adding them back up again, which may have some intuitive appeal but ends up being just a null operation. Does anyone see a significance to that part?)
1Caspar Oesterheld4y
Elsewhere [https://www.lesswrong.com/posts/WkPf6XCzfJLCm2pbK/cdt-edt-udt#rnRrXZrTzReM93PdH], I illustrate this result for the absent-minded driver.

Caveat: The version of EDT provided above only takes dependences between instances of EDT making the same observation into account. Other dependences are possible because different decision situations may be completely "isomorphic"/symmetric even if the observations are different. It turns out that the result is not valid once one takes such dependences into account, as shown by Conitzer [2]. I propose a possible solution in https://casparoesterheld.com/2017/10/22/a-behaviorist-approach-to-building-phenomenological-bridges/ . Roughly speaking, my solution

...

Caveat: The version of EDT provided above only takes dependences between instances of EDT making the same observation into account. Other dependences are possible because different decision situations may be completely "isomorphic"/symmetric even if the observations are different. It turns out that the result is not valid once one takes such dependences into account, as shown by Conitzer [2]. I propose a possible solution in https://casparoesterheld.com/2017/10/22/a-behaviorist-approach-to-building-phenomenological-bridges/ . Roughly speaking, my solution

...