Caspar Oesterheld


>the biggest distinction is that this post's proposal does not require specifying the decision maker's utility function in order to reward one of the predictors and shape their behavior into maximizing it.

Hmm... Johannes made a similar argument in personal conversation yesterday. I'm not sure how convinced I am by this argument.

So first, here's one variant of the proper decision scoring rules setup where we also don't need to specify the decision maker's utility function: Ask the predictor for her full conditional probability distribution for each action. Then take the action that is best according to your utility function and the predictor's conditional probability distribution. Then score the predictor according to a strictly proper decision scoring rule. (If you think of strictly proper decision scoring rules as taking only a predicted expected utility as input, you have to first calculate the expected utility of the reported distribution, and then score that expected utility against the utility you actually obtained.) (Note that if the expert has no idea what your utility function is, they are now strictly incentivized to report fully honestly about all actions! The same is true in your setup as well, I think, but in what I describe here a single predictor suffices.) In this setup you also don't need to specify your utility function.

One important difference, I suppose, is that in all the existing methods (like proper decision scoring rules) the decision maker needs to at some point assess her utility in a single outcome -- the one obtained after choosing the recommended action -- and reward the expert in proportion to that. In your approach one never needs to do this. However, in your approach one instead needs to look at a bunch of probability distributions and assess which one of these is best. Isn't this much harder? (If you're doing expected utility maximization -- doesn't your approach entail assigning probabilities to all hypothetical outcomes?) In realistic settings, these outcome distributions are huge objects!

The following is based on an in-person discussion with Johannes Treutlein (the second author of the OP).

>But is there some concrete advantage of zero-sum conditional prediction over the above method?

So, here's a very concrete and clear (though perhaps not very important) advantage of the proposed method over the method I proposed. The method I proposed only works if you want to maximize expected utility relative to the predictor's beliefs. The zero-sum competition model enables optimal choice under a much broader set of possible preferences over outcome distributions.

Let's say that you have some arbitrary (potentially wacky discontinuous) function V that maps a distributions over outcomes onto a real value representing how much you like the distribution over outcomes. Then you can do zero-sum competition as normal and select the action for which V is highest (as usual with "optimism bias", i.e., if the two predictors make different predictions for an action a, then take the maximum of the Vs of the two actions). This should still be incentive compatible and result in taking the action that is best in terms of V applied to the predictors' belief.

(Of course, one could have even crazier preferences. For example, one's preferences could just be a function that takes as input a set of distributions and selects one distribution as its favorite. But I think if this preference function is intransitive, doesn't satisfy independence of irrelevant alternatives and the like, it's not so clear whether the proposed approach still works. For example, you might be able to slightly misreport some option that will not be taken anyway in such a way as to ensure that the decision maker ends up taking a different action. I don't think this is ever strictly incentivized. But it's not strictly disincentivized to do this.)

Interestingly, if V is a strictly convex function over outcome distributions (why would it be? I don't know!), then you can strictly incentivize a single predictor to report the best action and honestly report the full distribution over outcomes for that action! Simply use the scoring rule , where  is the reported distribution for the recommended action,  is the true distribution of the recommended action and  is a subderivative of . Because a proper scoring rule is used, the expert will be incentivized to report  and thus gets a score of , where  is the distribution of the recommended action. So it will recommend the action  whose associate distribution maximizes . It's easy to show that if  -- the function saying how much you like different distribution -- is not strictly convex, then you can't construct such a scoring rule. If I recall correctly, these facts are also pointed out in one of the papers by Chen et al. on this topic.

I don't find this very important, because I find expected utility maximization w.r.t. the predictors' beliefs much more plausible than anything else. But if nothing else, this difference further shows that the proposed method is fundamentally different and more capable in some ways than other methods (like the one I proposed in my comment).

Nice post!

Miscellaneous comments and questions, some of which I made on earlier versions of this post. Many of these are bibliographic, relating the post in more detail to prior work, or alternative approaches.

In my view, the proposal is basically to use a futarchy / conditional prediction market design like that the one proposed by Hanson, with I think two important details:
- The markets aren't subsidized. This ensures that the game is zero-sum for the predictors -- they don't prefer one action to be taken over another. In the scoring rules setting, subsidizing would mean scoring relative to some initial prediction $p_0$ provided by the market. Because the initial prediction might differ in how bad it is for different actions, the predictors might prefer a particular action to be taken. Conversely, the predictors might have no incentive to correct an overly optimistic prediction for one of the actions if doing so causes that action not to be taken. The examples in Section 3.2 of the Othman and Sandholm paper show these things.
- The second is "optimism bias" (a good thing in this context): "If the predictors disagree about the probabilities conditional on any action, the decision maker acts as though they believe the more optimistic one." (This is as opposed to taking the market average, which I assume is what Hanson had in mind with his futarchy proposal.) If you don't have optimism bias, then you get failure modes like the ones pointed out in Obstacle 1 of Scott Garrabrant's post "Two Major Obstacles for Logical Inductor Decision Theory": One predictor/trader could claim that the optimal action will lead to disaster and thus cause the optimal action to never be taken and her prediction to never be tested. This optimism bias is reminiscent of some other ideas. For example some ideas for solving the 5-and-10 problem are based on first searching for proofs of high utility. Decision auctions also work based on this optimism. (Decision auctions work like this: Auction off the right to make the decision on my behalf to the highest bidder. The highest bidder has to pay their bid (or maybe the second-highest bid) and gets paid in proportion to the utility I obtain.) Maybe getting too far afield here, but the UCB term in bandit algorithms also works this way in some sense: if you're still quite unsure how good an action is, pretend that it is very good (as good as some upper bound of some confidence interval).

My work on decision scoring rules describes the best you can get out of a single predictor. Basically you can incentivize a single predictor to tell you what the best action is and what the expected utility of that action is, but nothing more (aside from some degenerate cases).

Your result shows that if you have two predictors with the same information, then you can get slightly more: you can incentivize them to tell you what the best action is and what the full distribution over outcomes will be if you take the action.

You also get some other stuff (as you describe starting from the sentence, "Additionally, there is a bound on how inaccurate..."). But these other things seem much less important. (You also say: "while it does not guarantee that the predictions conditional on the actions not taken will be accurate, crucially there is no incentive to lie about them." But the same is true of decision scoring rules for example.)

Here's one thing that is a bit unclear to me, though.

If you have two predictors that have the same information, there's other, more obvious stuff you can do. For example, here's one:
- Ask Predictor 1 for a recommendation for what to do.
- Ask Predictor 2 for a prediction over outcomes conditional on Predictor 1's recommendation.
- Take the action recommended by Predictor 1.
- Observe an outcome o with a utility u(o).
- Pay Predictor 1 in proportion to u(o).
- Pay Predictor 2 according to a proper scoring rule.

In essence, this is just splitting the task into two: There's the issue of making the best possible choice and there's the issue of predicting what will happen. We assign Predictor 1 to the first and Predictor 2 to the second problem. For each of these problems separately, we know what to do (use proper (decision) scoring rules). So we can solve the overall problem.

So this mechanism also gets you an honest prediction and an honest recommendation for what to do. In fact, one advantage of this approach is that honesty is maintained even if the Predictors 1 and 2 have _different_ information/beliefs! (You don't get any information aggregation with this (though see below). But your approach doesn't have any information aggregation either.)

As per the decision scoring rules paper, you could additionally ask Predictor 1 for an estimate of the expected utility you will obtain. You can also let the Predictor 2 look at Predictor 1's prediction (or perhaps even score Predictor 2 relative to Predictor 1's prediction). (This way you'd get some information aggregation.) (You can also let Predictor 1 look at Predictor 2's predictions if Predictor 2 starts out by making conditional predictions before Predictor 1 gives a recommendation. This gets more tricky because now Predictor 2 will want to mislead Predictor 1.)

I think your proposal for what to do instead of the above is very interesting and I'm glad that we now know that this method exists that that it works. It seems fundamentally different and it seems plausible that this insight will be very useful. But is there some concrete advantage of zero-sum conditional prediction over the above method?

Minor bibliographical note: A related academic paper is Arif Ahmed's unpublished paper, "Sequential Choice and the Agent's Perspective". (This is from memory -- I read that paper a few years ago.)

Nice post!

What would happen in your GPT-N fusion reactor story if you ask it a broader question about whether it is a good idea to share the plans? 

Perhaps relatedly:

>Ok, but can’t we have an AI tell us what questions we need to ask? That’s trainable, right? And we can apply the iterative design loop to make AIs suggest better questions?

I don't get what your response to this is. Of course, there is the verifiability issue (which I buy). But it seems that the verifiability issue alone is sufficient for failure. If you ask, "Can this design be turned into a bomb?" and the AI says, "No, it's safe for such and such reasons", then if you can't evaluate these reasons, it doesn't help you that you have asked the right question.

Sounds interesting! Are you going to post the reading list somewhere once it is completed?

(Sorry for self-promotion in the below!)

I have a mechanism design paper that might be of interest: Caspar Oesterheld and Vincent Conitzer: Decision Scoring Rules. WINE 2020. Extended version. Talk at CMID.

Here's a pitch in the language of incentivizing AI systems -- the paper is written in CS-econ style. Imagine you have an AI system that does two things at the same time:
1) It makes predictions about the world.
2) It takes actions that influence the world. (In the paper, we specifically imagine that the agent makes recommendations to a principal who then takes the recommended action.) Note that if the predictions are seen by humanity, they themselves influence the world. So even a pure oracle AI might satisfy 2, as has been discussed before (see end of this comment).
We want to design a reward system for this agent such the agent maximizes its reward by making accurate predictions and taking actions that maximize our, the principals', utility.

The challenge is that if we reward the accuracy of the agent's predictions, we may set an incentive on the agent to make the world more predictable, which will generally not be aligned without mazimizing our utility.

So how can we properly incentivize the agent? The paper provides a full and very simple characterization of such incentive schemes, which we call proper decision scoring rules:

We show that proper decision scoring rules cannot give the [agent] strict incentives to report any properties of the outcome distribution [...] other than its expected utility. Intuitively, rewarding the [agent] for getting anything else about the distribution right will make him [take] actions whose outcome is easy to predict as opposed to actions with high expected utility [for the principal]. Hence, the [agent's] reward can depend only on the reported expected utility for the recommended action. [...] we then obtain four characterizations of proper decision scoring rules, two of which are analogous to existing results on proper affine scoring [...]. One of the [...] characterizations [...] has an especially intuitive interpretation in economic contexts: the principal offers shares in her project to the [agent] at some pricing schedule. The price schedule does not depend on the action chosen. Thus, given the chosen action, the [agent] is incentivized to buy shares up to the point where the price of a share exceeds the expected value of the share, thereby revealing the principal's expected utility. Moreover, once the [agent] has some positive share in the principal's utility, it will be (strictly) incentivized to [take] an optimal action.

Also see Johannes Treutlein's post on "Training goals for large language models", which also discusses some of the above results among other things that seem like they might be a good fit for the reading group, e.g., Armstrong and O'Rourke's work.

My motivation for working on this was to address issues of decision making under logical uncertainty. For this I drew inspiration from the fact that Garrabrant et al.'s work on logical induction is also inspired by market design ideas (specifically prediction markets).

Cool that this is (hopefully) being done! I have had this on my reading list for a while and since this is about the kind of problems I also spend much time thinking about, I definitely have to understand it better at some point. I guess I can snooze it for a bit now. :P Some suggestions:

Maybe someone could write an FAQ page? Also, a somewhat generic idea is to write something that is more example based, perhaps even something that just solely gives examples. Part of why I suggest these two is that I think they can be written relatively mechanically and therefore wouldn't take that much time and insight to write. Also, maybe Vanessa or Alex could also record a talk? (Typically one explains things differently in talks/on a whiteboard and some people claim that one generally does so better than in writing.)

I think for me the kind of writeup that would have been most helpful (and maybe still is) would be some relatively short (5-15 pages), clean, self-contained article that communicates the main insight(s), perhaps at the cost of losing generality and leaving some things informal. So somewhere in between the original intro post / the content in the AXRP episode / Rohin's summary (all of which explain the main idea but are very informal) and the actual sequence (which seems to require wading through a lot of intrinsically not that interesting things before getting to the juicy bits). I don't know to what extent this is feasible, given that I haven't read any of the technical parts yet. (Of course, a lot of projects have this presentation problem, but I think usually there's some way to address this. E.g., compare the logical induction paper, which probably has a number of important technical aspects that I still don't understand or forgot at this point. But where by making a lot of things a bit informal, the main idea can be grasped from the short version, or from a talk.)

Not very important, but: Despite having spent a lot of time on formalizing SPIs, I have some sympathy for a view like the following:

> Yeah, surrogate goals / SPIs are great. But if we want AI to implement them, we should mainly work on solving foundational issues in decision and game theory with an aim toward AI. If we do this, then AI will implement SPIs (or something even better) regardless of how well we understand them. And if we don't solve these issues, then it's hopeless to add SPIs manually. Furthermore, believing that surrogate goals / SPIs work (or, rather, make a big difference for bargaining outcomes) shouldn't change our behavior much (for the reasons discussed in Vojta's post).

On this view, it doesn't help substantially to understand / analyze SPIs formally.

But I think there are sufficiently many gaps in this argument to make the analysis worthwhile. For example, I think it's plausible that the effective use of SPIs hinges on subtle aspects of the design of an agent that we might not think much about if we don't understand SPIs sufficiently well.

Great to see more work on surrogate goals/SPIs!

>Personally, the author believes that SPI might “add up to normality” --- that it will be a sort of reformulation of existing (informal) approaches used by humans, with similar benefits and limitations.

I'm a bit confused by this claim. To me it's a bit unclear what you mean by "adding up to normality". (E.g.: Are you claiming that A) humans in current-day strategic interactions shouldn't change their behavior in response to learning about SPIs (because 1) they are already using them or 2) doing things that are somehow equivalent to them)? Or are you claiming that B) they don't fundamentally change game-theoretic analysis (of any scenario/most scenarios)? Or C) are you saying they are irrelevant for AI v. AI interactions? Or D) that the invention of SPIs will not revolutionize human society, make peace in the middle east, ...) Some of the versions seem clearly false to me. (E.g., re C, even if you think that the requirements for the use of SPIs are rarely satisfied in practice, it's still easy to construct simple, somewhat plausible scenarios / assumptions (see our paper) under which SPIs do seem do matter substantially for game-theoretic analysis.) Some just aren't justified at all in your post. (E.g., re A1, you're saying that (like myself) you find this all confusing and hard to say.) And some are probably not contrary to what anyone else believes about surrogate goals / SPIs. (E.g., I don't know anyone who makes particularly broad or grandiose claims about the use of SPIs by humans.)

My other complaint is that in some places you state some claim X in a way that (to me) suggests that you think that Tobi Baumann or Vince and I (or whoever else is talking/writing about surrogate goals/SPIs) have suggested that X is false, when really Tobi, Vince and I are very much aware of X and have (although perhaps to an insufficient extent) stated X. Here are three instances of this (I think these are the only three), the first one being most significant.

The main objection of the post is that while adopting an SPI, the original players must keep a bunch of things (at least approximately) constant(/analogous to the no-SPI counterfactual) even when they have an incentive to change that thing, and they need to do this credibly (or, rather, make it credible that they aren't making any changes). You argue that this is often unrealistic. Well, the initial reaction of mine was: "Sure, I know these things!" (Relatedly: while I like the bandit v caravan example, this point can also be illustrated with any of the existing examples of SPIs and surrogate goals.) I also don't think the assumption is that unrealistic. It seems that one substantial part of your complaint is that besides instructing the representative/self-modifying the original player/principal can do other things about the threat (like advocating a ban on real or water guns). I agree that this is important. If in 20 years I instruct an AI to manage my resources, it would be problematic if in the meantime I make tons of decisions (e.g., about how to train my AI systems) differently based on my knowledge that I will use surrogate goals anyway. But it's easy to come up scenarios where this is not a problem. E.g., when an agent considers immediate self-modification, *all* her future decisions will be guided by the modified u.f. Or when the SPI is applied to some isolated interaction. When all is in the representative's hand, we only need to ensure that the *representative* always acts in whatever way the representative acts in the same way it would act in a world where SPIs aren't a thing.

And I don't think it's that difficult to come up with situations in which the latter thing can be comfortably achieved. Here is one scenario. Imagine the two of us play a particular game G with SPI G'. The way in which we play this is that we both send a lawyer to a meeting and then the lawyers play the game in some way. Then we could could mutually commit (by contract) to pay our lawyers in proportion to the utilities they obtain in G' (and to not make any additional payments to them). The lawyers at this point may know exactly what's going on (that we don't really care about water guns, and so on) -- but they are still incentivized to play the SPI game G' to the best of their ability. You might even beg your lawyer to never give in (or the like), but the lawyer is incentivized to ignore such pleas. (Obviously, there could still be various complications. If you hire the lawyer only for this specific interaction and you know how aggressive/hawkish different lawyers are (in terms of how they negotiate), you might be inclined to hire a more aggressive one with the SPI. But you might hire the lawyer you usually hire. And in practice I doubt that it'd be easy to figure out how hawkish different lawyers are.

Overall I'd have appreciated more detailed discussion of when this is realistic (or of why you think it rarely is realistic). I don't remember Tobi's posts very well, but our paper definitely doesn't spend much space on discussing these important questions.

On SPI selection, I think the point from Section 10 of our paper is quite important, especially in the kinds of games that inspired the creation of surrogate goals in the first place. I agree that in some games, the SPI selection problem is no easier than the equilibrium selection problem in the base game. But there are games where it does fundamentally change things because *any* SPI that cannot further be Pareto-improved upon drastically increases your utility from one of the outcomes.

Re the "Bargaining in SPI" section: For one, the proposal in Section 9 of our paper can still be used to eliminate the zeroes!

Also, the "Bargaining in SPI" and "SPI Selection" sections to me don't really seem like "objections". They are limitations. (In a similar way as "the small pox vaccine doesn't cure cancer" is useful info but not an objection to the small pox vaccine.)

Load More