Background in mathematics (descriptive set theory, Banach spaces) and game-theory (mostly zero-sum, imperfect information games). CFAR mentor. Usually doing alignment research.
Personally, the author believes that SPI might “add up to normality” --- that it will be a sort of reformulation of existing (informal) approaches used by humans, with similar benefits and limitations.
I'm a bit confused by this claim. To me it's a bit unclear what you mean by "adding up to normality". (E.g.: Are you claiming that A) humans in current-day strategic interactions shouldn't change their behavior in response to learning about SPIs (because 1) they are already using them or 2) doing things that are somehow equivalent to them)? Or are you claiming that B) they don't fundamentally change game-theoretic analysis (of any scenario/most scenarios)? Or C) are you saying they are irrelevant for AI v. AI interactions? Or D) that the invention of SPIs will not revolutionize human society, make peace in the middle east, ...) Some of the versions seem clearly false to me. (E.g., re C, even if you think that the requirements for the use of SPIs are rarely satisfied in practice, it's still easy to construct simple, somewhat plausible scenarios / assumptions (see our paper) under which SPIs do seem do matter substantially for game-theoretic analysis.) Some just aren't justified at all in your post. (E.g., re A1, you're saying that (like myself) you find this all confusing and hard to say.) And some are probably not contrary to what anyone else believes about surrogate goals / SPIs. (E.g., I don't know anyone who makes particularly broad or grandiose claims about the use of SPIs by humans.)
I definitely don't think (C) and the "any" variant of (B). Less sure about the "most" variant of (B), but I wouldn't bet on that either.
I do believe (D), mostly because I don't think that humans will be able to make the necessary commitments (in the sense mentioned in the thread with Rohin). I am not super sure about (A). My bet is that to the extent that SPI can work for humans, we are already using it (or something equivalent) in most situations. But perhaps some exceptions will work, like the lawyer example? (Although I suspect that our skill at picking hawkish lawyers is stronger than we realize. Or there might be existing incentives where lawyers are being selected for hawkishness, because we are already using them for someting-like-SPI? Overall, I guess that the more one-time-only an event is, the higher is the chance that the pre-existing selection pressures will be weak, and (A) might work.)
Overall I'd have appreciated more detailed discussion of when this is realistic (or of why you think it rarely is realistic).
That is a good point. I will try to expand on it, perhaps at least in a comment here once I have time, or so :-).
My other complaint is that in some places you state some claim X in a way that (to me) suggests that you think that Tobi Baumann or Vince and I (or whoever else is talking/writing about surrogate goals/SPIs) have suggested that X is false, when really Tobi, Vince and I are very much aware of X and have (although perhaps to an insufficient extent) stated X.
Thank you for pointing that out. In all these cases, I actually know that you "stated X", so this is not an impression I wanted to create. I added a note at the begging of the document to hopefully clarify this.
Perfect, that is indeed the diffeence. I agree with all of what you write here.
In this light, the reason for my objection is that I understand how we can make a commitment of the first type, but I have no clue how to make a commitment of the second type. (In our specific example, once demand unarmed is an option -- once SPI is in use -- the counterfactual world where there is only demand armed just seems so different. Wouldn't history need to go very differently? Perhaps it wouldn't even be clear what "you" is in that world?)
But I agree that with SDA-AGIs, the second type of commitment becomes more realistic. (Although, the potential line of thinking mentioned by Caspar applies here: Perhaps those AGIs will come up with SPI-or-something on their own, so there is less value in thinking about this type of SPI now.)
That is -- I think* -- a correct way to parse it. But I don't think it false... uhm, that makes me genuinely confused. Let me try to re-rephrase, see if uncovers the crux :-).
You are in a world where most (1- Ɛ ) of the bandits demand unarmed when paired with a caravan commited to [responding to demand unarmed the same as it responds to demand armed] (and they demand armed against caravans without such commitment). The bandit population (ie, their strategies) either remains the same (for simplicity) or the strategies that led to more profit increase in relative frequency. And you have a commited caravan. If you instruct it to always resist, you get payoff 9(1-Ɛ) - 2Ɛ (using the payoff matrix from "G'; extension of G"). If you instruct it to always give in, you get payoff 4(1- Ɛ ) + 3Ɛ. So it is better to instruct it to always resist.
*The only issue that comes to mind is my [responding to demand unarmed the same as it responds to demand armed] vs your [treating demand unarmed as they would have treated demand armed]? If you think there is a difference between the two, then I endorse the former and I am confused about what the latter would mean.
I agree with (1) and (2), in the same way that I would agree that "one boxing will work in some settings and fail to work in others" and "whether you should one box depends on the larger context it appears in". I'd find it weird to call this an "objection" to one boxing though.
I agree that (1)+(2) isn't significant enough to qualify as "an objection". I think that (3)+(4)+(my interpretation of it? or something?) further make me believe something like (2') below. And that seems like an objection to me.
(2') Whether or not it works-as-intended depends on the larger setting and there are many settings -- more than you might initially think -- where SPI will not work-as-intended.
The reason I think (4) is a big deal is that I don't think it relies on you being unable to distinguish the SPI and non-SPI caravans. What exactly do I mean by this? I view the caravans as having two parameters: (A) do they agree to using SPI? and, independently of this, (B) if bandits ambush & threaten them (using SPI or not), do they resist or do they give in? When I talk about incentives to be more agressive, I meant (B), not (A). That is, in the evolutionary setting, "you" (as the caravan-dispatcher / caravan-parameter-setter) will always want to tell the caravans to use SPI. But if most of the bandits also use SPI, you will want to set the (B) parameter to "always resist".
I would say that (4) relies on the bandits not being able to distinguish whether your caravan uses a different (B)-parameter from the one it would use in a world where nobody invented SPI. But this assumption seems pretty realistic? (At least if humans are involved. I agree this might be less of an issue in the AGI-with-DSA scenario.)
I think I agree with your claims about commiting, AI designs, and self-modifying into two-boxer being stupid. But I think we are using a different framing, or there is some misunderstanding about what my claim is. Let me try to rephrase it:
(1) I am not claiming that SPI will never work as intended (ie, get adopted, don't change players' strategies, don't change players' "meta strategies"). Rather, I am saying it will work in some settings and fail to work in others.
(2) Whether SPI works-as-intended depends on the larger context it appears in. (Some examples of settings are described in the "Embedding in a Larger Setting" section.) Importantly, this is much more about which real-world setting you apply SPI to than about the prediction being too sensitive to how you formalize things.
(3) Because of (2), I think this is a really tricky topic to talk about informally. I think it might be best to separately ask (a) Given some specific formalization of the meta-setting, what will the introduction of SPI do? and (b) Is it formalizing a reasonable real-world situation (and is the formalization appropriate)?
(4) An IMO imporant real-world situation is when you -- a human or an institution -- employ AI (or multiple AIs) which repeatedly interact on your behalf, in a world where bunch of other humans or institutions are doing the same. The details really matter here but I personally expect this to behave similarly to the "Evolutionary dynamics" example described in the report. Informally speaking, once SPI is getting widely adopted, you will have incentives to modify your AIs to be more aggressive / learn towards using AIs that are more aggresive. And even if you resist those incentives, you/your institution will get outcompeted by those who use more aggressive AIs. This will result in SPI not working as intended.
I originally thought the section "Illustrating our Main Objection: Unrealistic Framing" should suffice to explain all this, but apparently it isn't as clear as I thought it was. Nevertheless, it might perhaps be helpful to read it with this example in mind?
Humans and human institutions can't easily make credible commitments.
That seems right. (Perhaps with the exception of legal contracts, unless one of the parties is powerful enough to make the contract difficult to enforce.) And even when individual people in an institution have powerful commitment mechanisms, this is not the same as the institution being able to credible commit. For example, suppose you have a head of a state that threatens suicidal war unless X happens, and they are stubborn enough to follow up on it. Then if X happens, you might get a coup instead, thus avoiding the war.
It seems like your main objection arises because you view SPI as an agreement between the two players.
I would say that my main objection is that if you know that you will encounter SPI in situation X, you have an incentive to alter the policy that you will be using in X. Which might cause other agents to behave differently, possibly in ways that lead to the threat being carried out (which is precisely the thing that SPI aimed to avoid).
In the bandit case, suppose the caravan credibly commits to treating nerf guns identically to regular runs. And suppose this incentivizes the bandits to avoid regular guns. Then you are incentivized to self-modify to start resisting more. (EG, if you both use CDT and the "logical time" is "self modify?" --> "credibly commit?" --> "use nerf?" .) However, if the bandits realize this --- i.e., if the "logical time" is "use nerf?" --> "self modify?" --> "credibly commit?" --- then the bandits will want to not use nerf guns, forcing you to not self-modify. And if you each think that you are "logically before" the other party, you will make incompatible comitments (use regular guns & self-modify to resist) and people get shot with regular guns.
So, I agree that credible unilateral commitments can be useful and they can lead to guaranteed Pareto improvements. It's just that I don't think it addresses my main objection against the proposal.
So perhaps the optimal unilateral commitment is more complicated and involves a condition where the bandits are required to somehow make the Nerf gun attack almost as costly for themselves as a regular attack.
Yup, I fully agree.
Thanks for pointing this out :-). Indeed, my original formulation is false; I agree with the "more likely to work if we formalise it" formulation.
the tl;dr version of the full report:
The surrogate goals (SG) idea proposes that an agent might adopt a new seemingly meaningless goal (such as preventing the existence of a sphere of platinum with a diameter of exactly 42.82cm or really hating being shot by a water gun) to prevent the realization of threats against some goals they actually value (such as staying alive) [TB1, TB2]. If they can commit to treating threats to this goal as seriously as threats to their actual goals, the hope is that the new goal gets threatened instead. In particular, the purpose of this proposal is not to become more resistant to threats. Rather, we hope that if the agent and the threatener misjudge each other (underestimating the commitment to ignore/carry out the threat), the outcome (Ignore threat, Carry out threat) will be replaced by something harmless.
Safe Pareto improvements (SPIs) [OC] is a formalization and a generalization of this idea. In the straightforward interpretation, the approach applies to a situation where an agent delegates a problem to their representative --- but we could also imagine “delegating” to a future self-modified version of oneself. The idea is to give instructions like “if other agents you encounter also have this same instruction, replace ‘real’ conflicts with them with ‘mock’ conflicts, in which you all act like you would in a real conflict, but bad outcomes are replaced by less harmful variants”. Some potential examples would be fighting (should a fight arise) to the first blood instead of to the death or threatening diplomatic conflict instead of a military one. In the SG setting, SPI could correspond to a joint commitment to only use water-gun threats while behaving as if the water gun was real.
First, I have one “methodological” observation: If we are to understand whether these approaches “work”, it is not sufficient to look just at the threat situation itself. Instead, we need to embed it in a larger context. (Is the interaction one-time-only, or will there be repetitions? Did the agents have a chance to update on the fact that SG/SPI are being used? What do the agents know about each other?) In the report, I show that different settings can lead to (very) different outcomes.
Incidentally, I think this observation applies to most technical research. We should specify the larger context, at least informally, such that we can answer questions like “but will it help?” and “wouldn’t this other approach be even better?”.
Second, SG and SPI seem to have one limitation that seems hard to overcome. Both SG and SPI hope to make agents treat conflict as seriously as before while in-fact making conflict outcomes less costly. However, if SG/SPI starts “being a thing”, agents will have incentives to treat conflict as less costly. (EG, ignoring threats they might take seriously before or making threats they wouldn’t dare to make otherwise.) The report gives an example of an “evolutionary” setting that illustrates this well, imo.
Finally: (1) I believe that the hypothesis that “we ‘only’ need to work out the details of SG/SPI, and then we will avoid the realization of most threats” is false. (2) Instead, I would expect that the approach “adds up to normality” --- that with SG/SPI, we can do mostly the same things that we could do with “just” the ability to make contracts and being somewhat transparent to each other. (3) However, I still think that studying the approach is useful --- it is a way of formalizing things, it will work in some settings, and even when it doesn’t, we might get ideas for which other things could work instead.