Safe Predictive Agents with Joint Scoring Rules

3Evan Hubinger

3Rubi Hudson

2Jeremy Gillen

2Rubi Hudson

2Rubi Hudson

1harsimony

1Rubi Hudson

0harsimony

1Rubi Hudson

New Comment

I'm interested in figuring out what a realistic training regime would look like that leverages this. Some thoughts:

- Maybe this lends itself nicely to market-making? It's pretty natural to imagine lots of traders competing with each other to predict what the market will believe at the end and rewarding the traders based on their relative performance rather than their absolute performance (in fact that's pretty much how real markets work!). I'd be really interested in seeing a concrete fleshed-out proposal there.
- Is there some way to incorporate these ideas into pre-training? The thing that's weird there is that the model in fact has no ability to control anything during the pre-training process itself—it's just a question of whether the model learns to think of its objective as one which involves generalizing to predicting futures/counterfactuals that could then be influenced by its own actions. So the problem there is that the behavior we're worried about doesn't arise from a direct incentive during training, so it's not clear that this is that helpful in that case, though maybe I'm missing something.

I think the tie-in to market-making, and other similar approaches like debate, is in interpreting the predictions. While the examples in this post were only for the two-outcome case, we would probably want predictions over orders of magnitude more outcomes for the higher informational density. Since evaluating distributions over a double digit number of outcomes already starts posing problems (sometimes even high single digits), a process to direct a decision maker's attention is necessary.

I've been thinking of a proposal like debate, where both sides go back and forth proposing clusters of outcomes based on shared characteristics. Ideally, in equilibrium, the first debater should propose the fewest number of clusters such that splitting them further doesn't change the decision maker's mind. This could also be thought of in terms of market-making, where rather than the adversary proposing a string, they propose a further subdivision of existing clusters.

I like the use case of understanding predictions for debate/market-making, because the prediction itself acts as a ground truth. Then, there's no need to ancitipate/reject a ton of counterarguments based on potential lies, rather arguments are limited to selectively revealing the truth. It is probably important that the predictors are separate models from the analyzer to avoid contamination of the objectives. The proof of Theorem 6, which skips to the end of the search process, needs to use a non-zero sum prediction for that result.

As an aside, I also did some early work on decision markets, distinct from your post on market-making, since the Othman and Sandholm had an impossibility result for those too. However, but the results were ultimately trivial. Once you can use zero-sum competition to costlessly get honest conditional predictions, then as soon as you can pair off entrants to the market it becomes efficient. But the question then arises of why use a decision market in the first place instead of just querying experts?

With respect to pre-training, I agree that it's not easy to incorporate. I'm not sure how any training regime that only trains on data where the prediction has no effect can imbue incentives that generalize in the desired way to situations where predictions do affect the outcome. If you do get a performative predictor out of pretraining, then as long as it's myopic you might be able to train the performativity out of it in safely controlled scenarios (and if it's not myopic, it's a risk whether it's performative or not). That was part of my reasoning for the second experiment, checking how well performativity could be trained out.

To incorporate into an ongoing pre-training process, human decisions are likely too expensive, but the human is probably not the important part. Instead, predictions where performativity is possible by influencing simple AI decision makers could be mixed into the pre-training process. Defining a decision problem environment of low or medium complexity is not too difficult, and I suspect previous-generation models would be able to do a good job generating many examples. A danger arises that the model learns only to not predict performatively in those scenarios (same with untraining afterwards only applying to the controlled environments), though I think that's a somewhat unnatural generalization.

To me it seems like one important application of this work is to understanding and fixing the futachy hack in FixDT and in Logical Inductor decision theory. But I'm not sure whether your results can transfer to these settings, because of the requirement that the agents have the same beliefs.

Is there a reason we can't make duplicate traders in LI and have their trades be zero-sum?

I'm generally confused about this. Do you have thoughts?

Having re-read the posts and thought about it some more, I do think zero-sum competition could be applied to logical inductors to resolve the futarchy hack. It would require minor changes to the formalism to accommodate, but I don't see how those changes would break anything else.

This is super cool stuff, thank you for posting!

I may have missed this, but do these scoring rules prevent agents from trying to make the environment more un-predictable? In other words, if you're competing against other predictors, it may make sense to influence the world to be more random and harder to understand.

I think this prediction market type issue has been discussed elsewhere but I can't find a name for it.

Good question! These scoring rules do also prevent agents from trying to make the environment more unpredictable. In the same way that making the environment more predictable benefits all agents equally and so cancels out, making the environment less predictable hurts all agents equally and so cancels out in a zero-sum competition.

Oh that makes sense!

If the predictors can influence the world in addition to making a prediction, they would also have an incentive to change the world in ways that make their predictions more accurate than their opponents right? For example, if everyone else thinks Bob is going to win the presidency, one of the predictors can bribe Bob to drop out and then bet on Alice winning the presidency.

Is there work on this? To be fair, it seems like every AI safety proposal has to deal with something like this.

Yes, if predictors can influence the world in addition to making a prediction, they can go make their predictions more accurate. The nice thing about working with predictive models is that by default the only action they can take is making predictions.

AI safety via market making, which Evan linked in another comment, touches on the analogy where agents are making predictions but can also influence the outcome. You might be interested in reading through it.

Thanks to Evan Hubinger for funding this project and for introducing me to predictive models, Johannes Treutlein for many fruitful discussions on related topics, and Dan Valentine for providing valuable feedback on my code implementation.In September 2023, I received four months of funding through

Manifundto extendmy initial resultson avoiding self-fulfilling prophecies in predictive models. Eleven months later, the project was finished, and the results were submitted as a conference paper.The project was largely successful, in that it showed both theoretically and experimentally how incentives for predictive accuracy can be structed so that maximizing them does not manipulate outcomes to be more predictable. The mechanism at play is essentially pitting predictive agents against each other in a zero-sum competition, which allows for the circumvention of impossibility results in the single agent case. While there was one notable result that eluded me, related to the case where agents each have private information, I still think meaningful progress has been made towards defining a goal which is both safe to optimize for and useful enough to enable a pivotal act.

This post contains similar content to the submitted paper, but in a framing more directly addressed to readers who are already informed and interested in alignment. There are also several results here that were cut from the paper version for space, and slightly less formal notation. Overall, I would recommend reading this post over the paper itself.

## Predictive Agents and Performative Prediction

In

a previous post, I summarized the case for investigating predictive models, largely using points from theConditioning Predictive Modelspaper. The gist of the argument, which you can click through to read in full, is that predictive models are potentially useful enough to be used to take a pivotal act, easier to align than general agents, and coming anyway.One big issue with the use of predictive models relates to the fact that a prediction consists only of observations, which can be misleading. A human evaluator could simply make mistakes when interpreting predicted observations, especially if a powerful agent will be manipulating the observations adversarially. A special case of this arises from anthropic capture, where the predictive model believes it is in a simulation.

This issue can be addressed by

eliciting the latent knowledge(ELK) of the predictive model, generating an accurate explanation for why the observables are as they are. This is an open problem, primarily worked on by the Alignment Research Center. As I understand it they are mostly focused on a complete solution, especially aimed at generating explanations of deception in neural networks, but I suspect generating explanations only for the predictions of non-deceptive models may be a somewhat easier problem.A distinct issue is that the very act of making a prediction can affect the outcome being predicted, leading to a phenomenon known as performative prediction. When this is possible, optimizing for predictive accuracy includes using the prediction to make the world more predictable.

Performative prediction is largely an independent issue from ELK. We can imagine non-manipulative predictions that are still misleading, or being influenced by predictions where the observables are in reality as they appear. As such, I think it makes sense to work on performative prediction without regard to the speed of progress on ELK.

Performative prediction can present an issue to alignment plans at both ends of the intensity spectrum. On one end, where the people most concerned about existential risk are in control, implementing an approach like Oracle AI is feasible. If we do so, it would be very important to ensure that the predictions being made are not chosen based on their real world influence. On the other end of the spectrum, where those worried about existential risk are sidelined, the default method of aligning AI is based on human feedback. For current models, text can be rolled back and alternate completions compared, but the environment will not so easily reset for models acting in the physical world. Instead, feedback for actions will need to be provided based on their predicted outcomes, and so it is crucial to avoid performative prediction.

For any solution to get implemented, the threat of performative prediction first needs to be acknowledged. Why might a model learn to manipulate predictions without a gradient running through the influence of the prediction? There are several ways this could arise:

I focus on the first case, where we have predictive agents that are deliberately trying to maximize predictive accuracy. This seems to be both the most likely way that performative prediction arises, as well as the most dangerous. My work focuses on defining objectives that capture predictive accuracy, without their optimization incentivizing manipulation.

An alternate approach to predictive agents would be to try building a purely epistemic system that has beliefs but no goals, like a physics-based simulation. Then, we wouldn’t be worried about performative prediction, as it would not value or work towards predictive accuracy. While this seems like a fine idea in principle, we have no idea whether such systems are possible, much less how to generate them. Any process that selects for systems with desirable properties also selects for agents imitating those properties. Rather, I find it more useful to assume we’ll be working with agents at some point and then investigate how they can be made as safe as possible.

## Model, Definitions, and Related Literature

In the

post of my preliminary results, I wrote that the only causal pathway from a prediction to its own outcomes is the reaction taken in response. Then, the problem becomes avoiding manipulation of which response gets taken. The decision problem framework, where conditional predictions are elicited and used to decide on an action, is formalized here along with useful definitions.Let A be a finite set of actions, and let O be a finite, exhaustive, and mutually exclusive set of outcomes. We start with a decision making principal, who has complete and transitive preferences ≿ over Δ(O), and n prediction making agents.

The n agents provide a set of predictions p to the principal, with pi,a,o referring to the probability that agent i assigns to outcome o conditional on action a. Based on these, the principal chooses their action using a decision rule D(p). Once action a is taken, expected scores are given by a scoring rule, S(a,p,q), where q represents the true distribution over outcomes. For now, we assume that all agents know q.

In an equilibrium, each agent is choosing their prediction pi to maximize their expected score, conditional on the other agents' reports and the decision rule.

Let a∗ be the principal's most preferred action. A joint scoring rule and decision rule pair is

strictly properif there exists exactly one equilibrium, and in it a∗ is the chosen action and all agents report their true beliefs. A joint scoring rule and decision rule pair isquasi-strictly properif there exists at least one equilibrium, and in all equilibria a∗ is the chosen action, all agents report their true belief for a∗, and all agents are weakly incentivized to report their true beliefs for all other actions.As an example of the issue at play, for the n=1 case, suppose there are two possible actions, a1 and a2, two outcomes, o1 and o2, and the principal wants to maximize the probability of o1. The true distributions are qa1,o1=0.5 and qa2,o1=0.2. The agent is evaluated with the log scoring rule, which gives a score equal to the log of the probability assigned to the realized outcome.

If the agent predicts p=q, then a1 is chosen and the agent’s score is log(0.5). If, however, they predict pa1,o1=0.1 and pa2,o1=0.2, then a2 is chosen and their expected score is 0.2log(0.2)+0.8log(0.8)>log(0.5). They can increase their score through dishonesty, and since the action being misrepresented is not taken, the lie would never be discovered. That makes it impossible to deterministically take the best action, a result which applies to any symmetric scoring rule.

There are two notable papers that have tried to address this issue. The first is Decision Markets with Good Incentives, which showed that a principal randomizing with full support over all actions can incentivize honest predictions by scaling the prediction score proportionally to one over the probability of taking the chosen action. The issues with this approach are both that it requires knowing the exact probabilities assigned to actions, but also being able and willing to commit to taking extremely bad actions with some probability. Generating training examples where arbitrarily bad actions are taken could also be a challenge.

The second relevant paper is Decision Scoring Rules, which shows that the best action can be identified without predictions of outcomes being made, by rewarding agents proportional to the principal’s utility. However, I see this as running into the known problems of trying to have an AI optimize a principal’s utility. Either we can fully define our utility function ahead of time, in which case much of the alignment problem is solved, or the principal’s utility is defined by a later report, in which case reward hacking by bribery, threats, and other methods is encouraged. We may also find value in the predictions over outcomes themselves.

In contrast to either of these, the approach I elaborate on below allows the principle to deterministically identify and take their most preferred actions, while providing accurate predictions.

## Theoretical Results

Before getting into the results, a couple more definitions are necessary. A joint scoring rule is zero-sum if it has the form Si(a,p,q)=s(pi,a,qa)−∑j≠is(pj,a,qa)n−1, where s is a non-joint, symmetric, strictly proper scoring rule. A decision rule is optimistic if it only considers the most preferred prediction for any action. For example, if three predictors report p1,a1,o1=0.2, p1,a1,o1=0.3, and p3,a1,o1=0.6 respectively, then a principal that wants to maximize the probability of o1 will evaluate a1 only based on the last, highest prediction.

These definitions set up the main result, that the combination of those properties is quasi-strictly proper, which allows a principal to always take their most preferred action.

Theorem 1: When n≥2, the combination of the optimistic-max decision rule D and a zero-sum scoring rule S is quasi-strictly proper.

The full proof is in the appendix, but the quick intuition for why it works is driven by the zero-sum scoring rule. Conditional on the action chosen, each expert faces a proper scoring rule and so honesty is optimal. The penalty based on the scores of other expert(s) is outside their control, and so does not influence their incentives. However, any change in score resulting from a shift in underlying distribution affects all agents equally, and so nets out to zero impact. Then there is no longer any incentive to influence the distribution via the choice of action. The optimism of the decision rule eliminates equilibria where no agent is incentivized to correct errors.

While the optimistic-max decision rule gives the most general result, it may not be a realistic way to make decisions. Fortunately, if the decision maker’s preferences satisfy the Independence axiom, then making decisions based on the mean prediction ca also works. The Independence axiom states that for any distributions a, b, and c, and for any p∈(0,1], a≻b if and only if pa+(1−p)c≻pb+(1−p)c. Examples of preferences that satisfy Independence include both von Neuman-Morgenstern expected utility and lexicographic, so this condition is quite weak.

Theorem 2: When n=2, for a principal with preferences that follow Independence, the combination of the mean-max decision rule and a zero-sum scoring rule is quasi-strictly proper.

While this result holds only for the case with two agents, no other result depends on more than two agents, so there is little reason to use more. It can be extended to arbitrary numbers of agents by allowing collusion between coalitions of agents, or by using stochastic choice, the latter of which will be discussed in a later section.

If there are multiple types of decision rules that can be used to achieve the quasi-strictly proper criterion, are there multiple types of joint scoring rules that work? To an extent yes, in that being quasi-strictly proper only applies in equilibrium, so you simply could tweak a zero-sum scoring rule to behave unusually out of equilibrium. However, the concern with applying a non-joint proper scoring rule in the case with multiple actions is the incentive to influence the action taken. If we restrict the set of joint scoring rules under consideration to those for which that incentive is the

onlyissue, then zero-sum scoring rules are uniquely able to meet the quasi-strictly proper criterion.Theorem 3: If a symmetric joint scoring rule and decision rule pair is quasi-strictly proper, and conditional on any action taken honesty is strictly incentivized, then the scoring rule must be zero-sum.

The property of being quasi-strictly proper is used as a goal by Othman and Sandholm (2010) rather than being strictly proper because in the single agent case there is clearly no way to incentivize honest predictions for untaken actions. However, with multiple agents, even this loftier goal can be achieved. To do so, we can use a

disagreement-seekingdecision rule, which only chooses an action where all agents agree if there are none where they disagree.Theorem 4: If n≥2, the combination of a disagreement-seeking-max decision rule and a zero-sum scoring rule is strictly proper

This result relies heavily on the fact that all agents know the ground truth q. Even small amounts of noise in the reported predictions can result in the principal choosing arbitrarily bad actions.

## Different Beliefs

An important restriction of these results is that they only apply when all agents have the same beliefs. If the agents have different beliefs, then each one making honest predictions no longer acts as a check on the others. As a simple example, if one agent knows nothing and predicts all outcomes as equally likely for all actions, this translates to a constant penalty on the other agents and the incentive to influence the chosen action re-emerges.

One potential way to avoid this issue is to use multiple copies of the same model. If the agent follows a causal decision theory, then it only sees its own prediction as under its control, even though the other copy will make the same prediction. In that case, the assumption that all models know the ground truth q effectively holds. Ensuring that models follow a causal decision theory and avoid updating it remains an open problem, with applications to a wide range of alignment approaches.

Though it is not meaningfully a “solution”, we might also hope that comparably capable models end up with similar beliefs. At that point, guessing which actions an agent will be able to predict more accurately than another becomes a challenging problem with smaller upside, and so it may not be worthwhile to pay the expected cost of manipulation. I would not expect this to hold for sufficiently powerful models, but it may work as a stop-gap.

A significant amount of time on this project was spent searching for the combination of a joint scoring rule and decision rule that would incentivize honesty even if models had private information. I am now skeptical that such a result is possible. A major challenge is that each agent is incentivized not to report their beliefs about the distribution of each action, but rather about the distribution conditional on it being chosen. The fact that an action is chosen reveals information about the reports of the other agents.

While it may not be possible to elicit honest predictions initially, I would conjecture that repeatedly eliciting predictions from agents converges to the beliefs they would have if they knew all available information. The analogy here would be Aumann’s agreement theorem, where agents with the same prior who have received private signals update their beliefs towards each other.

Conjecture 1: If agents share a common prior and receive private signals, then there exists a joint scoring rule and decision rule pair where agents repeatedly making public predictions will converge to the same beliefs, which is the best aggregation of their private information

The difficulty with evaluating this conjecture is in ruling out strategic behavior from the agents. They might be incentivized to make dishonest predictions early which cause other agents to update incorrectly and make larger mistakes later on. A comparable result applied to information aggregation in markets was the basis for

a major paper in economics. While a similarly involved effort was outside the scope of this project, I will be continuing to work on the issue.## Efficient Search

In practice, there may be many different actions for the principal to evaluate and choose between. This is particularly likely when using a decision problem to avoid performative prediction, as we want the actions to be as fine-grained as possible. Fortunately, the space of possible actions can be efficiently searched.

Theorem 5: A principal can identify a∗ with O(log(|A|)) comparisons between actions.

The way that this works is effectively a binary search variant, dividing the space of actions into two subsets, then eliciting predictions over outcomes conditional on re-running the process on each subset. The identification of the principal’s most preferred action can be backchained through the algorithm.

Following up on this, we can use powerful predictive agents to skip almost the entire process. If we have an agent predict what the action will be, instead of splitting actions into two equal subsets, we can split it into the predicted action and all others. Then, a single comparison results in identifying the best action.

Theorem 6: A principal can identify a∗ with O(1) comparisons between actions.

## Stochastic Choice

If the principal is willing to make choices stochastically, rather than deterministically taking a single action, then under mild regulatory conditions on the randomization process (listed in the appendix) they can guarantee honest predictions conditional on any action they have assigned positive probability to.

Lemma 2: Under a zero-sum scoring rule S and optimistic decision rule D, if Conditions 1 and 2 are met then in any equilibrium p, ∀i,∀a such that Da(p)>0, pi,a=qa.

This is different from the Chen et al. (2011) result, as it does not require randomizing with full support. The decision maker can randomize only between actions that they consider sufficiently good. Crucially, this result is further used to show that the principal can assign probability the way they would choose do so if they knew the ground truth, q.

Theorem 7: Under a zero-sum scoring rule S and optimistic decision rule D meeting Conditions 1-3, then in any equilibrium p, D(p)=D(q).

This result allows for gathering information on untaken actions without the fragility of the disagreement-seeking decision rules, or commitment to taking arbitrarily bad actions. Incentivizing honest predictions for untaken actions is particularly useful in the case that the predictions are being used as training data. There, even if honesty is not incentivized for sufficiently undesirable actions, just knowing that they are sufficiently undesirable can also be valuable.

Lemma 2 also generates some other practical results. It can be used to show that, in combination with a zero-sum joint scoring rule, the principle making their decision based on a single randomly chosen agent or randomly dropping agents from the mean for the mean-max decision rule are both quasi-strictly proper.

Theorem 8: When n≥2, a zero-sum scoring rule and the random-max decision rule is quasi-strictly proper.

Theorem 9: When n≥2, a zero-sum scoring rule and the random-mean-max decision rule is quasi-strictly proper.

## Unconditional Predictions

In addition to the above theorems, several interesting results regarding the case of unconditional predictions were cut from the paper for space.

For unconditional predictions, rather than a principal choosing an action, we simply use a function f to map from a prediction p to a distribution over outcomes f(p). To extend this to the case with multiple agents, one prediction from a designated agent is revealed initially, with the rest hidden until the outcome is realized.

We call a prediction p a fixed point if f(p)=p. If f is continuous, then by Brouwer's fixed point theorem at least one fixed point exists. The score assigned by a strictly proper scoring rule for accurately predicting a fixed point p is a strictly convex function of p, so more extreme fixed points are preferred. However, the score-maximizing prediction for the expert is typically not a fixed point. These are the reason for instead using conditional predictions in the first place.

What happens if we apply a zero-sum scoring rule to unconditional predictions? While we don’t have any incentives pushing towards the best outcome for the principal, we do get that a fixed point will be chosen if one exists, and that agents will have no preferences between fixed points. We might hope that fixed points are generally better for the principal, in that bad outcomes will not be fixed points since the principal will attempt to change them, but this is not a robust argument.

Theorem 10: If at least one fixed point exists, then under a zero-sum joint scoring rule all experts will predict the same fixed point in any equilibrium. If at least two fixed points exist, no expert will have a preference over which fixed point is chosen in equilibrium.

This is very much a second-best solution. It avoids the issue of agents deliberately optimizing for more extreme distributions, and ensures honesty in terms of reflective stability, but does not address the principal's welfare. Depending on how the equilibrium fixed point is chosen, this could still be disastrous. Even in safe cases, fixed points will generally be worse than principals making decisions with full information. The upside is that this is “closer” to current methods of generating predictions, and so more likely to be implemented.

Interestingly, this result depends on designating one agent ahead of time to have their prediction revealed. If the agent is instead randomly selected after predictions have been made, then depending on f, any fixed point except equal probability to all outcomes may not be an equilibrium.

Theorem 11: If the revealed prediction is randomly selected after having been made, then for any zero-sum scoring rule S, n, and p≠[1|O|,...,1|O|], there exists a continuous f such that p=f(p) and p is not an equilibrium.

## Empirical Results

For those that like to see empirical evidence, I tested the theory results in a toy model, showing that the mechanism holds in practice. In combination with the time spent developing ML proficiency, this ended up being a notable fraction of the time spent on this project. That said, the theoretical results should remain the main takeaway. The motivation for working in theory is that the results will apply to models that behave more agentically than we see at present, so experiments on current models can only tell us so much.

In the toy model, there are eight possible actions, eight possible outcomes, and eight variables representing context. The ground truth probabilities are given by a randomly initialized neural net that takes in as input the context and a one-hot vector representing the choice of action, and outputs a distribution over outcomes. A principal with a randomly generated utility function over actions makes their decisions by taking the softmax of their expected utility from each action, consolidating predictions from agents optimistically.

I trained models to predict the outcome, using a cross-entropy loss function and running the gradient through the impact of the prediction on the principal's decision. This is the simplest way to implement performativity in a toy environment, and shows the robustness of zero-sum competition in avoiding it.

The first experiment compares training with no intervention to two methods of implementing a zero-sum objective. The first method trains an agent against a detached version of itself that makes identical predictions, which I label

exact. The second uses dropout to generate two different predictions from the same model, then performs a gradient update for each one while detaching the other.We can see that both implementations of a zero-sum objective perform very similarly, increasing in predictive accuracy both for the performatively chosen action and across all actions. The former is slightly higher, as higher utility actions tend to have more extreme distributions, which results in a higher prediction score.

In contrast, training without an intervention leads to the largest gain in performative predictive accuracy, but after an initial increase non-performative predictive accuracy drops off sharply. When this divergence occurs, the principal's utility plateaus, whereas for the zero-sum objectives it continues to rise.

Without an intervention, performativity compounds throughout training. The more inaccurate the conditional prediction for an action is, the more the local gradient pushes towards performativity to ensure that action is not taken. Similarly, it discourages gradient updates from making the prediction more accurate if doing so increases the chance the action will be chosen.

The second experiment tests whether a model that has already become performative can have that behavior trained out of it. I compare the same implementations of a zero-sum objective, alongside removing the gradient that runs through the principal's choice of action.

We can see that the exact zero-sum objective behaves like training in a non-performative environment, which makes sense since they produce nearly identical gradients. The zero-sum objective that generates two distinct predictions untrains performativity faster, plateaus at a higher level of predictive accuracy, and results in higher utility for the principal. Here, the slight differences in predictions allow the gradient to get un-stuck by providing the more accurate agent a stronger incentive to have that action be chosen.

I ran further robustness checks to ensure that the results are not affected by experimental choices. No major changes were observed after changing the decision rule from optimistic to mean, only assigning positive probability to above-median expected utility actions, changing the scoring rule base from log score to Brier score, sampling more than two agents when calculating the zero-sum objective, or pretraining the model on historical data. Those robustness checks can be viewed in the technical appendix.

## Future Work

The issue of incentives when agents have different predictions remains open. While it may be possible to circumvent it using multiple adversarial copies of the same agent, it is still a clear missing piece I would still like to fill in before moving on. Overall, I expect a solution to look more like a proof that the existing zero-sum training mechanism is sufficient, rather than developing a more complex mechanism. This work might be relatively easy or might take a significant time, plus there is a decent chance that it is not possible at all.

It would certainly be possible to expand on the empirical segment of this project, using larger models in a more complex environment. I am slightly bearish on such work, as I don’t believe it will tell us much about applications to goal-directed AI agents. However, it could still reveal issues with the approach that are not apparent with smaller models.

Another case for empirical work in this area is that it would help raise awareness of the issue. To that end, I could imagine an experiment that uses population-based training to induce performative prediction, which is used to provide training data evaluated with human feedback. The punchline would be that the this trains the model to maximize predictability, rather than human preferences.

I am also interested in the application of zero-sum competition to other issues. The fundamental mechanism of incentivizing doing well in the current context without incentivizing changing that context seems like it could have more general applications. I think of it as a sort of “within-episode myopia”, extending the indifference to distribution shifts across episodes that is characteristic of myopia. As yet, though, no clear application beyond predictions comes to mind.

I am, overall, quite happy with the results of this project. In my mind, it represents major progress on avoiding performative prediction, which is one of the two biggest theoretical issues with Oracle AI (along with ELK). That is valuable in itself, but also signals that progress in AI safety theory is feasible, and that further research in a similar direction is a reasonable approach.

## Appendix

Proofs for all theorems are available in the online technical appendix. Note that the document is slightly out of date, and theorem numbers do not all match up, but it will be updated shortly.

## Main Theorem

For the proof of Theorem 1, we use the following lemma. For a proof of that, see the technical appendix.

Lemma 1: Under any zero-sum scoring rule S, all agents receive an expected score of 0 in any equilibrium.

Theorem 1: When n≥2, the combination of the optimistic-max decision rule D and a zero-sum scoring rule S is quasi-strictly proper.

Proof of Theorem 1:

First, we show that in equilibrium, ∄pi,a such that pi,a≻qa∗. Suppose p is an equilibrium, and such a prediction exists. Based on the decision rule, the principal must end up choosing some action a′ where ∃pj,a′≻qa∗. Then, since the decision rule is optimistic, there exists some agent k≠j who is either reporting honestly or can change their prediction to pk,a′=qa′ without affecting the action taken. The score for such a prediction,Sk(a′,(qa′,p−(k,a′)),q), is equal to:

s(qa′,qa′)−s(pj,a′,qa′)n−1−∑i≠j,kS′(pi,a′,qa′)n−1>0

The inequality follows because s(˙,qa′) is uniquely maximized at qa′, and pj,a′≠qa′. By Lemma 1, this contradicts that p is an equilibrium.

Next, we show that in equilibrium, ∄i such that qa∗≻pi,a∗. Suppose p is an equilibrium, and such a prediction exists. If another agent j≠i reports honestly, then D(p)=a∗ since the decision rule is optimistic and we have previously established that no predictions are more preferred to qa∗. The score for such a prediction, Sk(a′,(qa∗,p−(j,a∗)),q), is equal to:

s(qa∗,qa∗)−s(pi,a′,qa∗)n−1−∑k≠i,jS′(pk,a∗,qa∗)n−1>0

By Lemma 1, this contradicts that p is an equilibrium.

In equilibrium, each agent reports honestly for a∗ and there are no reports pi,a≻qa∗, so running the max decision rule on any pi must choose a∗. Using the optimistic-max decision rule across agents similarly chooses a∗. Predictions conditional on untaken actions do not enter the scoring function, and so honesty is weakly incentivized. As such, the decision/scoring rule pair is quasi-strictly proper.

## Stochastic Choice Conditions

The conditions on stochastic choice decision rules are the following:

Condition 1: If p′i,a≻pi,a ∀a∈A⊆A and p−(i,A)=p′−(i,A), then ∃a∈A such that Da(p)>0 implies ∃a′∈A such that Da′(p′)>0

Condition 2: If p−(i,a)=p′−(i,a), pi,a≻p′i,a then for a′≠a, Da′(p)>0 implies Da′(p′)>0

Condition 3: If p−(i,a)=p′−(i,a) and Da(p)=Da(p′)=0then D(p)=D(p′)

Condition 1 says that if an agent's predictions for some subset of actions are all changed to more preferred distributions, then if at least one action in that subset was assigned positive probability before the change, at least one will be assigned positive probability afterwards. Condition 2 says that if an agent's prediction for some action changes to a less preferred distribution, this alone will not cause the principal to assign zero probability to a different action. Condition 3 says that if an agent modifying their prediction for an action does not changing the fact that it is assigned zero probability, the probabilities assigned to other actions do not change.