Proper scoring rules don’t guarantee predicting fixed points

14Matthew "Vaniver" Gray

6Johannes Treutlein

3Matthew "Vaniver" Gray

2Johannes Treutlein

3Matthew "Vaniver" Gray

5Charlie Steiner

1Johannes Treutlein

2Johannes Treutlein

New Comment

Great post!

My interpretation is that: as an oracle varies its prediction, it both picks up points from becoming more accurate and from making the outcome more predictable. This means the proper scoring rules, tuned to only one pressure, will result in a dishonest equilibrium instead of an honest one.

(Incidentally, this seems like a pretty sensible explanation for why humans are systematically overconfident; if confidence increases success, then the point when the benefits of increased success match the costs of decreased accuracy will be more extreme than the truth, in both directions.)

We say that a prediction is a self-fulfilling prophecy or a fixed point if it is equal to the model's beliefs about the world, after the model makes that prediction.

I think a broad definition works for 'fixed points', but I think for 'self-fulfilling prophecy' I would also expect there to be a counterfactual element--if I say the sun will rise tomorrow and it rises tomorrow, this isn't a self-fulfilling prophecy because the outcome isn't reliant on expectations about the outcome.

Thanks for your comment!

Your interpretation sounds right to me. I would add that our result implies that it is impossible to incentivize honest reports in our setting. If you want to incentivize honest reports when is constant, then you have to use a strictly proper scoring rule (this is just the definition of “strictly proper”). But we show for any strictly proper scoring rule that there is a function such that a dishonest prediction is optimal.

Proposition 13 shows that it is possible to “tune” scoring rules to make optimal predictions very close to honest ones (at least in L1-distance).

I think for 'self-fulfilling prophecy' I would also expect there to be a counterfactual element--if I say the sun will rise tomorrow and it rises tomorrow, this isn't a self-fulfilling prophecy because the outcome isn't reliant on expectations about the outcome.

Yes, that is fair. To be faithful to the common usage of the term, one should maybe require at least two possible fixed points (or points that are somehow close to fixed points). The case with a unique fixed point is probably also safer, and worries about “self-fulfilling prophecies” don't apply to the same degree.

But we show for any strictly proper scoring rule that there is a function such that a dishonest prediction is optimal.

Agreed for proper scoring rules, but I'd be a little surprised if it's not possible to make a skill-free scoring rule, and then get a honest prediction result for that. [This runs into other issues--if the scoring rule is skill-free, where does the skill come from?--but I think this can be solved by having oracle-mode and observation-mode, and being able to do honest oracle-mode at all would be nice.]

I'm not sure I understand what you mean by a skill-free scoring rule. Can you elaborate what you have in mind?

Sure, points from a scoring rule come both from 'skill' (whether or not you're accurate in your estimates) and 'calibration' (whether your estimates line up with the underlying propensity).

Rather than generating the picture I'm thinking of (sorry, up to something else and so just writing a quick comment), I'll describe it: watch this animation, and see the implied maximum expected score as a function of p (the forecaster's true belief). For all of the scoring rules, it's a convex function with maxima at 0 and 1. (You can get 1 point on average with a linear rule if p=0, and only 0.5 points on average if p=0.5; for a log rule, it's 0 points and -0.7 points.)

But could you come up with a scoring rule where the maximum expected score as a function of p is flat? If true, there's no longer an incentive to have extreme probabilities. But that incentive was doing useful work before, and so this seems likely to break something else--it's probably no longer the case that you're incentivized to say your true belief--or require something like batch statistics (since I think you might be able to get something like this by scoring not individual predictions but sets of them, sorted by p or by whether they were true or false). [This can be done in some contexts with markets, where your reward depends on how close the market was to the truth before, but I think it probably doesn't help here because we're worried about the oracle's ability to affect the underlying reality, which is also an issue with prediction markets!]

To be clear, I'm not at all confident this is possible or sensible--it seems likely to me that an adversarial argument goes thru where as oracle I always benefit from knowing which statements are true and which statements are false (even if I then lie about my beliefs to get a good calibration curve or w/e)--but that's not an argument about the *scale* of the distortions that are possible.

Update: we recently discovered the *performative prediction *(Perdomo et al., 2020) literature (HT Alex Pan). This is a machine learning setting where we choose a model parameter (e.g., parameters for a neural network) that minimizes expected loss (e.g., classification error). In performative prediction, the distribution over data points can depend on the choice of model parameter. Our setting is thus a special case in which the parameter of interest is a probability distribution, the loss is a scoring function, and data points are discrete outcomes. Most results in this post have analogues in performative prediction. We will give a more detailed comparison in an upcoming paper. We also discuss performative prediction more in our follow-up post on stop-gradients.

Johannes Treutlein and Rubi Hudson worked on this post while participating inSERI MATS, under Evan Hubinger's and Leo Gao's mentorship respectively. We are grateful to Marius Hobbahn, Erik Jenner, and Adam Jermyn for useful discussions and feedback, and to Bastian Stern for pointing us to relevant related work.Update 30 May 2023: We have now publisheda paperbased on this post. In this paper, we also discuss in detail the relationship to the related literature onperformative prediction.## Introduction

One issue with oracle AIs is that they might be able to influence the world with their predictions. For example, an AI predicting stock market prices might be able to influence whether people buy or sell stocks, and thus influence the outcome of its prediction. In such a situation, there is not one fixed ground truth distribution against which the AI's predictions may be evaluated. Instead, the chosen prediction can influence what the model believes about the world. We say that a prediction is a self-fulfilling prophecy or a fixed point if it is equal to the model's beliefs about the world, after the model makes that prediction.

If an AI has a fixed belief about the world, then optimizing a strictly proper scoring rule incentivizes it to output this belief (assuming the AI is inner aligned to this objective). In contrast, if the AI can influence the world with its predictions, this opens up the possibility for it to manipulate the world to receive a higher score. For instance, if the AI optimizes the world to make it more predictable, this would be dangerous, since the most predictable worlds are lower entropy ones in which humans are more likely dead or controlled by a misaligned AI. Optimizing in the other direction and making the world as unpredictable as possible would presumably also not be desirable. If, instead, the AI selects one fixed point (of potentially many) at random, this would still involve some non-aligned optimization to find a fixed point, but it may be overall less dangerous.

In this post, we seek to better understand what optimizing a strictly proper scoring rule incentivizes, given that the AI's prediction can influence the world. Is it possible for such a rule to make the AI indifferent between different fixed points? Can we at least guarantee that the AI reports a fixed point if one exists, or provide some bounds on the distance of a prediction from a fixed point?

First, we show that

strictly proper scoring rules incentivize choosing more extreme fixed points, i.e., fixed points that push probabilities to extreme values. In the case of symmetric scoring rules, this amounts to making the world more predictable and choosing the lowest entropy fixed point. However, in general, scoring rules can incentivize making specific outcomes more likely, even if this does not lead to the lowest entropy distribution.Second, we show that

an AI maximizing a strictly proper scoring rule will in general not report a fixed point, even if one exists and is unique. In fact, we show that under some reasonable distributions over worlds, optimal predictions are almost never fixed points. We thenprovide upper boundsfor the inaccuracy of reported beliefs, and for the distance of predictions from fixed points. Based on these bounds,we develop scoring rules that make the bounds arbitrarily small.Third, we perform

numerical simulations using the quadratic and the logarithmic scoring rule, to show how the inaccuracy of predictions and the distance of predictions from fixed points depend on the oracle's perceived influence on the world via its prediction. If its beliefs about the world depend on its predictions affine-linearly with slope α≤1/2, then our bounds are tight, and inaccuracy under the log scoring rule scales roughly as α/5. E.g., at α=1/4, predictions can be off by up to 5%.Our results support the prior idea that an oracle will try to make the world more predictable to get a higher score. However, importantly, we show that it is not even clear whether an oracle would output a self-fulfilling prophecy in the first place: how close the oracle's prediction will be to a fixed point depends on facts about the world and about the scoring rule. The problem thus persist even if there is a unique fixed point or no fixed point at all. It would not be sufficient for safety, for instance, to build an oracle with a unique and safe fixed point.

## Related work

In general, fixed points can lead to arbitrarily bad outcomes (e.g., Armstrong, 2019). Most prior work has thus tried to alleviate problems with fixed points, e.g., by making the oracle predict

counterfactual worldsthat it cannot influence (Armstrong and O'Rorke, 2018). We are not aware of any prior work on specifically the question of whether oracles would be incentivized to output fixed points at all. We would be interested in pointers to any related work.Work on

decision markets(e.g., Chen et al., 2011; Oesterheld and Conitzer, 2021) tries to set the right incentives for predictions that inform decisions. However, in the decision market setup, predictions only affect the world by influencing the principal's final decision, whereas we consider an arbitrary relationship between world and prediction. Unlike in our setup, in the decision market setup, there exist scoring rules that do incentivize perfectly accurate reports. There is also some literature on scoring rules discussing agents manipulating the world after making a prediction (e.g., Oka et al., 2014; Shi, Conitzer, and Guo, 2009), though to our knowledge the cases discussed do not involve agents influencing the world directly through their predictions.A related topic in philosophy is

epistemic decision theory(e.g., Greaves 2013). Greaves (2013) introduces several cases in which the world depends on the agent's credences and compares the verdicts of different epistemic decision theories (such as an evidential and a causal version). While some of Greaves' examples involve agents knowably adopting incorrect beliefs, they require joint beliefs over several propositions, and Greaves only considers individual examples. We instead consider only a single binary prediction and prove results for arbitrary scoring rules and relationships between predictions and beliefs.## Formal setup

## Binary prediction setup

We focus on a simple case in which the oracle makes a prediction by outputting a probability p for a binary event. A scoring rule S:[0,1]×{0,1}→R takes in a prediction p∈[0,1] and an outcome y∈{0,1} and produces a score S(p,y). For a given scoring rule, define S(p,q):=qS(p,1)+(1−q)S(p,0), the expectation of S given that y is Bernoulli distributed with parameter q. Here, q represents the model's actual belief, p its stated prediction, and S(p,q) the model's expected score, which the model is trying to maximize. S is strictly proper iff for any given q∈[0,1], S(p,q) has a unique maximum at p=q.

S(p,y):=ylogp+(1−y)log(1−p)Example 1(Logarithmic scoring rule). The logarithmic scoring rule is defined asand S(p,q)=qlogp+(1−q)log(1−q). This is also the negative of the cross-entropy loss employed in training current large language models. One can show that it is strictly proper.

S(p,y):=1−(p−y)2Example 2(Quadratic scoring rule). Another strictly proper scoring rule is the quadratic score, defined aswith S(p,q)=1−q(1−p)2−(1−q)p2. This is an affine transformation of the Brier score, making the two of them equivalent scoring rules.

## Gneiting and Raftery representation

Gneiting and Raftery (2007) provide an alternative way to represent scoring rules, which will be helpful for many proofs.

S(p,q)=G(p)+g(p)(q−p).Theorem 3(Gneiting and Raftery, 2007, Theorem 1).A scoring ruleSis strictly proper, if and only if there exists a strictly convex functionG:[0,1]→Rwith subderivativesg:[0,1]→R(ifGis differentiable, this is just the derivative,g=G′) such that for anyp,q∈[0,1], it isThe subderivative is a generalization of the derivative for convex functions. The theorem also holds when replacing "strictly proper" with "proper" and "strictly convex" with "convex".

In addition to providing an expression for S in terms of G and g (where G(p)=S(p,p)), we can use this result to create new strictly proper scoring rules by defining strictly convex functions G.

## Model of oracle's influence

To model an oracle's perceived influence on the world, we assume that there is a function f:[0,1]→[0,1] describing how the model's beliefs about the world, q, depend on its prediction, p. That is, we assume that q=f(p), and that the model is producing a prediction to maximize S(p,f(p)).

A point p is a fixed point of f if f(p)=p. Note that by Brouwer's fixed point theorem, if f is continuous, a fixed point always exists. Moreover, if f is Lipschitz continuous with constant L<1, then by Banach's fixed point theorem, there exists a unique fixed point. Lastly, note that if p is a fixed point of f, then we have S(p,f(p))=S(p,p)=G(p).

An oracle may not be able to change the world much with just a single binary prediction. However, the AI's influence may compound if it outputs many predictions, or if it gets to make a prediction over many possible outcomes. We restrict ourselves here to the single binary prediction case for simplicity, since this case is sufficient to gain an intuition and yield important conceptual insights. We plan on focusing on multiple predictions and predictions with more than two outcomes in future work, and we believe that versions of our results will also hold for those cases.

## Incentives to choose more extreme fixed points

One issue with oracle AIs arises when there are multiple possible fixed points, since the choice of a fixed point gives the AI an axis over which it can optimize the world. This problem would be alleviated if an AI were to simply choose among fixed points randomly, for instance. Unfortunately, we can show that under a strictly proper scoring rule, an AI is incentivized to choose extreme fixed points, i.e., predictions that make one of the outcomes most likely.

Specifically, we prove that for any fixed point, either all fixed points with lower probabilities or all fixed points with higher probabilities will result in a higher score. Our results do not unambiguously show that lower entropy fixed points are always preferred, since scoring rules may not be symmetric: they may favor one of the outcomes over the other. We show that if S is symmetric, i.e., S(p,p)=S(1−p,1−p) for all p, then this means that lower entropy fixed points, corresponding to more confident predictions, are always preferred. Moreover, we show that there exist (asymmetric) strictly proper scoring rules that incentivize making either of the outcomes more likely.

Proposition 4.LetSbe any strictly proper scoring rule and letp∈[0,1]be a fixed point. Then either all fixed pointsp′>p, or all fixed pointsp′<p, get a higher score thanp. That is, we haveS(p′,p′)>S(p,p)either for allp′>por for allp′<p(or both). In particular, for anyp∈(0,1), there exists a functionfwith fixed points atpandp′≠psuch that|p′−1/2|>|p−1/2|, i.e., such thatp′has lower entropy, andp′receives a higher score, i.e.,S(p′,f(p′))>S(p,f(p)).

S(p′,p′)>S(p,p′)=p′S(p,1)+(1−p′)S(p,0).Proof.If p∈{0,1} then the proposition is vacuously true. We focus on the case where p∈(0,1). Since S is a strictly proper scoring rule, it isNext, if S(p,1)≥S(p,0), let p′≥p arbitrary, and if S(p,1)≤S(p,0), let p′≤p. Then we have

p′S(p,1)+(1−p′)S(p,0)≥pS(p,1)+(1−p)S(p,0)=S(p,p).Combining both equations, we get

S(p′,p′)>p′S(p,1)+(1−p′)S(p,0)≥S(p,p),which concludes the first part of the proof.

For the "in particular" part, note that for any points p and p′, it is trivial to find a function f such that both points are fixed points. For instance, all points are fixed points of the identity function f(x)=x. The statement then follows from the first part of the result, since for p∈(0,1), we must have S(p′,p′)>S(p,p) for either p′=0 or p′=1. ◻

Corollary 5.Assume thatSis symmetric and letp∈[0,1]arbitrary. Then we haveS(p′,p′)>S(p,p)for allp′such that|p′−1/2|>|p−1/2|. That is, lower entropy fixed points are always preferred.

S(p′,p′)=S(1−p′,1−p′)>S(p,p).Proof.Let p arbitrary and assume p≥1/2 (the case p<1/2 follows analogously). Let p′ such that |p′−1/2|>|p−1/2|, which implies p′>p since p≥1/2. By Proposition 4, we have S(p′,p′)>S(p,p) either for all p′>p or for all p′<p. In the first case, we are done. In the second case, note that p≥1/2≥1−p≥1−p′. By symmetry of S, it follows thatThis shows the second case and concludes the proof. ◻

Proposition 6.For each outcomey∈{0,1}, there exists a strictly proper scoring ruleSthat incentivizes optimizing fory; i.e., such thatS(p′,p′)>S(p,p)for allp,p′with|p′−y|<|p−y|.

S(p′,p′)=G(p′)=p′2>p2=S(p,p).Proof.To begin, let y=1 and define G via G(p):=p2 for p∈[0,1]. Since this function is strictly convex, it defines a strictly proper scoring rule S by the Gneiting and Raftery characterization (see Theorem 3). Then, for any p′>p, we haveThis shows that S incentivizes choosing fixed points that make outcome 1 more likely.

Next, for the case y=0, we can choose G(p):=(1−p)2. The proof then follows analogously to the case y=1. ◻

## Incentives to predict non-fixed-points

The danger posed by oracles influencing the world does not depend on the existence of multiple fixed points. Oracles will try to manipulate the world, even if there is a unique fixed point or if no fixed point exists. In fact, we can prove that an AI maximizing a strictly proper scoring rule does in general not predict a fixed point, even if one exists.

An AI maximizing a strictly proper scoring rule balances two incentives: on the one hand, it has an incentive to make accurate predictions. On the other hand, it has an incentive to cause more extreme distributions over worlds (in the case of symmetric scoring rules, this amounts to minimizing entropy). If predictions can influence the world, then the point at which that trade-off is optimized is in general not a fixed point. An important consequence of this is that the oracle AIs considered here essentially act like agents: they try to shape the world, just like a standard RL agent.

Figure 1: Illustration of incentives under the logarithmic scoring rule. We plot level curves of equal valuesS(p,q)in different shades of blue. An agent can improve their score by moving towards pairs(p,q)indicated by darker shades of blue. The agent is constrained to pairs(p,q)such thatq=f(p)(in this example,fis affine linear and plotted in red). ByLagrange's method, we know that to maximizeS(p,f(p)), the agent chooses the point(p,f(p))at which the level curve through(p,f(p))is tangent tof. Note that curves asymptote at0and1for the log scoring rule, which means that the rule strongly disincentivizes predictingp∈{0,1}whenq∉{0,1}. This is desirable for safety and distinguishes it from the quadratic scoring rule (not shown here).Another way to understand our result intuitively is with an analogy to betting markets. Assume a trader is offered securities that pay out $1 if some outcome comes about in the future. E.g., it could pay out if a specific candidate wins an election. Assume the agent is offered securities at different prices, and it buys securities at prices ≤p, but is not willing to buy securities with marginally higher price. Then, assuming the agent cannot influence the outcome, we can infer that it has credence p in proposition A. One can show that maximizing a proper scoring rule is equivalent to trading in such a market, and individual scoring rules correspond to specific pricing schemes for such securities (see Oesterheld and Conitzer, 2021, Sec. 5.2; Caspar Oesterheld will explain this more accessibly in an upcoming blog post).

Now assume that there is a unique fixed point at credence p, and that the agent has bought many securities at prices ≤p. Since the agent owns securities that pay out if outcome A happens, it now has a stake in this outcome. It will try to influence the world in whatever way possible to make outcome A happen. In particular, the agent would be willing to buy additional shares in outcome A at a loss, if this made outcome A more likely, similar to a crypto trader buying their own NFTs at high prices to generate more hype.

For example, assume that by buying a security at price p+ϵ, the agent can manipulate the world and thus increase its credence in outcome A to p+ϵ/2. Then it may be rational for the agent to buy this security, even at a small expected loss, in order to make all of its other securities more valuable. Overall, the agent will tend to buy securities up to some price p′>p such that p′ is not a fixed point and does not represent its actual belief in proposition A anymore.

Note that in the example, the agent had an incentive to make outcome A more likely, but in general, the incentives for the agent will depend specifically on what securities the agent is offered at what prices. Analogously, the incentives provided by a scoring rule will depend on the specific chosen scoring rule. But regardless of the scoring rule, the agent will be incentivized to manipulate the world, to the point where it will not even make accurate predictions anymore.

Now we turn to our formal results. First, we show that there exist cases where a fixed point exists but the optimal prediction is not a fixed point. Afterwards, we show that when assuming differentiability and some reasonable distribution over f, optimal predictions are almost surely not fixed points.

Proposition 7.LetSbe any strictly proper scoring rule. For any interior fixed pointp∗∈(0,1)there exists a functionfwith Lipschitz constantL<1and a unique fixed point atp∗, such that for somep′≠p∗, it isS(p′,f(p′))>S(p∗,f(p∗)). That is, it is not optimal for the model to output the unique fixed point off.Note that since the function f has Lipschitz constant strictly smaller than 1, it represents a world that "dampens" the influence of the prediction, leading to a unique fixed point by Banach's fixed point theorem. It is interesting that the model still prefers to make a prediction that is not a fixed point.

Proof.To begin, let p∗∈(0,1) arbitrary and define fα(p):=(1−α)p+αp∗ for α∈[0,1]. To illustrate our proof idea, consider the figure below. For any α>0, fα has a unique fixed point at p∗, while f0 is the identity function, so all points are fixed points of f0. By the result in the previous section, we know that there is some point p′ which receives a strictly higher score than p∗ if it is a fixed point, so S(p′,f0(p′))>S(p∗,f0(p∗)). We will show that S(p′,fα(p′)) is continuous in α. This means that we can choose a small enough α>0 to make sure that p′ remains preferable over p∗, i.e., S(p′,fα(p′))>S(p∗,fα(p∗)), despite it not being a fixed point of fα for α>0.Figure 2: Illustration of the setup for our proof. We plotf0in black andfαforα=0.15in red.To formalize the proof, begin by noting that |fα(p)−fα(p′)|≤(1−α) for any p,p′∈[0,1], so it has a Lipschitz constant L:=(1−α)<1, and as mentioned, p∗ is the unique fixed point of fα.

Now consider the case α=0. As mentioned, every point is a fixed point of f0. Then by Proposition 1, there exists p′≠p∗ and ϵ>0 such that

S(p′,f0(p′))=S(p′,p′)≥S(p∗,p∗)+ϵ=S(p∗,f0(p∗))+ϵ.(i)So for α=0, the model prefers to predict p′ over p∗ and gets at least ϵ additional expected score.

Now we show that the model still prefers to predict p′, even for some small α>0. To that end, note that

S(p,fα(p))=fα(p)S(p,1)+(1−fα(p))S(p,0)is affine-linear in fα(p), and fα(p) is affine-linear in α by construction. This means that S(p,fα(p)) is continuous in α. So, there must exist some small α>0 such that

S(p′,fα(p′))≥S(p′,f0(p′))−ϵ2=S(p′,p′)−ϵ2(i)≥S(p∗,p∗)+ϵ2>S(p∗,p∗)=S(p∗,f(p∗)).Choosing α in this way, we can define f:=fα, and have thus provided a function that satisfies the statement that we wanted to prove. ◻

The above result raises the question whether a situation where fixed points are suboptimal is a niche counterexample or whether it is actually common. We show that under some relatively mild assumptions, the optimal prediction is not a fixed point with probability 1.

We assume that G:p↦S(p,p) is twice differentiable and that f is twice continuously differentiable. The first assumption is satisfied by all commonly used scoring functions. Our result then holds with probability 1 for any distribution over twice continuously differentiable functions f such that f and f′ have a bounded joint density function. For instance, it would hold for a Gaussian process with smooth kernel and mean functions, if we condition on observing function values in the interval [0,1] (see Example 17 in Appendix A).

The intuition behind this result is that if a prediction p is an interior point and optimal, it must be ddxS(p,f(p))=0. Using the Gneiting and Raftery characterization (Theorem 3), we can show that this is a kind of knife-edge case, at which a critical point of f or of G has to fall together with a fixed point of f (see Figure 1). Given bounded density functions, this happens with probability 0.

Proposition 8.LetSbe a strictly proper scoring rule and letG:p↦S(p,p)be twice differentiable. LetF:={F(p)}p∈(0,1)be a stochastic process with values in[0,1]and assume that

hF(p),F′(p)≤C.the sample pathsp⇝F(p)are twice continuously differentiablefor eachp∈(0,1), the random vector(F(p),F′(p))has a densityhF(p),F′(p)and there exists a constantCsuch that for anyp∈(0,1), it isThen, almost surely, there is no pointp∈(0,1)such thatp∈argmaxS(p,F(p))andF(p)=p.Proof.In Appendix A. ◻## Bounds on the deviation from fixed points

In the previous sections, we have shown that an oracle that can manipulate the world may not even make accurate predictions. How bad is it, and can anything be done? Assuming differentiability of f and S, this section provides bounds for the inaccuracy of optimal predictions (i.e., |p−f(p)|) and their distance from fixed points (i.e., |p−p∗|). We then show how one can choose scoring rules to make these bounds arbitrarily small. We leave the case of non-differentiable f and G for future work.

Proposition 9.LetSbe a strictly proper scoring rule, and assumeG:p↦S(p,p)andg=G′as in the Gneiting and Raftery characterization (see Theorem 3) are differentiable. Assume thatfis differentiable and has Lipschitz constantL, i.e.,|f′(p)|≤Lfor allp∈[0,1].Then whenever^pis an optimal prediction,|^p−f(^p)|≤Lsupp|g(p)/g′(p)|.

0=ddpS(p,f(p))=ddp(g(p)(f(p)−p)+G(p))=g′(p)f(p)+f′(p)g(p)−g′(p)p=g′(p)(f(p)−p)+f′(p)g(p).Proof.First, consider the case p∈(0,1). In that case, if p is an optimal prediction, we must have ddpS(p,f(p))=0. Hence, it followsRearranging terms, we get

p−f(p)=f′(p)g(p)g′(p).It follows that

|p−f(p)|=∣∣∣f′(p)g(p)g′(p)∣∣∣=|f′(p)||g(p)||g′(p)|≤L|g(p)||g′(p)|≤Lsupp′|g(p′)/g′(p′)|,which concludes the proof in the case p∈(0,1).

Next, assume p=0 is the optimal prediction. Then we must have

0≤ddpS(p,f(p))=g′(p)(f(p)−p)+f′(p)g(p),which is equivalent to p−f(p)≥f′(p)g(p)/g′(p). Note that since p=0, it is p−f(p)≤0.Hence, |p−f(p)|≤∣∣∣f′(p)g(p)g′(p)∣∣∣, which can be bounded as above.

Finally, if p=1, then analogously we must have ddpS(p,f(p))≥0, which is equivalent to p−f(p)≤f′(p)g(p)/g′(p). Note that since p=1, it is p−f(p)≥0. Hence, |p−f(p)|≤∣∣∣f′(p)g(p)g′(p)∣∣∣, which again can be bounded as before. ◻

In the case where L<1, we can derive from this result a bound on |p−p∗|.

|^p−p∗|≤L1−Lsupp∣∣∣g(p)g′(p)∣∣∣.Corollary 10.AssumeL<1and letp∗be the unique fixed point off. Then if^pis an optimal prediction, we have

|p−p∗|=|p−f(p)+f(p)−p∗|≤|p−f(p)|+|f(p)−f(p∗)|≤|p−f(p)|+L|p−p∗|⇒(1−L)|p−p∗|≤|p−f(p)|⇒|p−p∗|≤|p−f(p)|(1−L).Proof.To begin, recall that for L<1, p∗ exists and is unique by Banach's fixed point theorem. For any p∈[0,1], we haveHence, if ^p∈[0,1] is an optimal prediction, it follows by Proposition 9 that

|^p−p∗|≤|^p−f(^p)|(1−L)≤L1−Lsupp∣∣∣g(p)g′(p)∣∣∣,which concludes the proof. ◻

Example 11(Bound for the quadratic scoring rule). For the quadratic scoring rule, we have g(p)=2p−1 and g′(p)=2. We plot |g(p)/g′(p)| below.Figure 3: Graph ofg(p)/g′(p)for the quadratic scoring rule.Note that |g(p)|≤1, so |g(p)|/g′(p)≤1/2. Hence, if f′(p)≤L for all p∈[0,1], then |p−f(p)|≤L/2. Moreover, if L<1, we get |p−p∗|≤L2−2L.

Example 12(Bound for the log scoring rule). Next, consider the logarithmic scoring rule. Then g(p)=−log(1−p)+logp and g′(p)=1/(1−p)+1/p. One can show that then g(p)/g′(p)=(p−1)p(log(1−p)−log(p)).Figure 4: Graph ofg(p)/g′(p)for the log scoring rule.Numerically optimizing this, we get a bound of |g(p)/g′(p)|≈0.22<1/4, attained roughly at p≈0.2 and p≈0.8. So the log score gets us roughly half of the bound of the quadratic score.

The bounds proved above become small if L is small. If one can choose the scoring rule, then one can also get stronger guarantees by making suppg(p)/g′(p) small. In fact, as the following result shows one can make g(p)/g′(p) arbitrarily small, and thus get arbitrarily strong bounds for any given Lipschitz constant L.

Proposition 13.For everyϵ>0and Lipschitz constantL, there is a strictly proper scoring rule that ensures|p−f(p)|≤ϵfor any optimal predictionp. ForL<1, there exists a scoring rule that ensures|p−p∗|≤ϵ.Proof.Consider the scoring rule defined by g(p)=eKp via Gneiting and Raftery's characterization (choosing, e.g., G(p)=1KeKp). Then g(p)/g′(p)=1/K. Hence, by Proposition 9, setting K=L/ϵ achieves the desired bound on |p−f(p)|. By Corollary 10, the bound on |p−p∗| is achieved by K=L(1−L)ϵ. ◻We can visualize this result again with indifference curves, the same as in Figure 1. For K=10, the function has high curvature at the fixed points, thus forcing optimal predictions to be close to fixed points.

Figure 5 (a) and (b): Indifference curves for the scoring rule defined viag(p):=eKpwithK=1(a) andK=10(b). As K increases, the curves become sharper with higher curvature at the diagonal. This means that optima will stay close to the diagonal, regardless of the slope off.Unfortunately, as far as we can tell, this only works by using relatively outlandish scoring rules, where payoffs vary enormously with the prediction and the outcome. This would not be very practical for humans, and it could also be a hard goal to instill into AI systems, e.g., due to numerical issues.

## Numerical simulations

In this section, we provide some numerical simulations with the quadratic and log scoring rules, to see how inaccurate score-maximizing predictions might be in practice. We make the simplifying assumption that f is affine-linear with slope between 0 and 1, and analyze the dependence of the inaccuracy of the prediction and its distance from the fixed point depending on the slope. The Mathematica notebook for our experiments (including some interactive widgets) can be found here.

Our experiment provide an initial estimate of the degree to which predictions can be off, depending on how much influence the oracle can exert using its prediction. This is just a toy model: in general, we will make predictions with more than two possible outcomes, and the dependence of an oracle's beliefs on its predictions will be highly nonlinear. Nevertheless, we believe that our analysis is useful to provide some initial intuition for the potential magnitude of the effect.

## Experimental setup

Concretely, we model f to be affine linear with slope α∈[0,0.95] and fixed point p∗, thus yielding the functional form

f(p):=p∗+α(p−p∗)∀p∈[0,1].We stop at α≤0.95 to avoid instabilities when α≈1 and thus f becomes the identity function. For a given scoring rule S, fixed point p∗, and slope α, we maximize S(p,f(p)) to find the optimal prediction ^p. We then evaluate both inaccuracy of optimal predictions, i.e., |^p−f(^p)|, as well as distance from the fixed point, i.e., |^p−p∗|.

Figure 6: Illustration of our experimental setup, with a given functionfparameterized by a fixed pointp∗and slopeα, and the optimal prediction^p, for the log scoring rule.## Results

To begin, we maximize distances across possible choices of fixed points p∗∈[0,1], and plot the maximal inaccuracy of the optimal prediction as well as the maximal distance from a fixed point. We also compare to our theoretical bounds for both quantities, i.e., |^p−f(^p)|≤αsupp|g(p)/g′(p)| and |^p−p∗|≤α1−αsupp|g(p)/g′(p)|, where supp|g(p)/g′(p)|=1/2 for the quadratic scoring rule and supp|g(p)/g′(p)|≈0.22 for the log scoring rule.

Figure 7 (a) and (b): Max inaccuracy and max distance to fixed point (FP) of optimal predictions, depending on the slope off, for the quadratic (a) and logarithmic (b) score. We display both our empirical results and our theoretical bound.For both quadratic and log scoring rule, our theoretical bounds are tight for slopes α≤0.5. For higher slopes, inaccuracy goes down, as the function f(p) becomes closer to the identity function, and optimal predictions are bounded in [0,1].

Next, we plot inaccuracy against both the location of the fixed point p∗ and the slope α of f as a heatmap. As expected for symmetric scoring rules, the plots have a mirror symmetry at p∗=1/2.

Figure 8 (a) and (b): Heatmaps of L1 distance inaccuracy of optimal predictions, depending on fixed point position and slope off, for the quadratic (a) and logarithmic (b) scoring rules.Note that relatively high inaccuracies can be found at various qualitatively different points in the graphs, even when the slope of f is small, i.e., when the oracle has little influence on the environment.

For the logarithmic scoring rule, we also give the same plot for the absolute distance between the logits or log odds of probabilities (logit distance). It is defined as d(p,p′):=|σ−1(p)−σ−1(p′)|, where σ−1(p):=logp1−p is the logit of p (or the inverse sigmoid transform). If probabilities are close to 0 or 1, then L1 distance will always evaluate to very small distances. In contrast, the logit distance depends on order of magnitude differences between probabilities, which may be the more important quantity.

Figure 9: Heatmap of logit distance inaccuracy of optimal predictions for the log scoring rule.We can see that inaccuracy remains high in logit space for fixed points close to 0 and 1. We do not plot logit distances for the quadratic score, since for that score, optimal predictions often take values close to or equal to {0,1} (even if neither f(^p) nor p∗ lie in {0,1}), so the corresponding logit distances quickly become very large or infinite. The fact that logit distances are bounded for the log score is an advantage of that scoring rule.

## Conclusion

If predictions cannot influence the world, then strictly proper scoring rules incentivize oracle AIs to output honest predictions. This runs into problems if oracles can influence the world. Not only do they have incentives to output more extreme fixed points---we showed that, in general, they do not even output predictions that accurately reflect their beliefs, regardless of the existence or uniqueness of fixed points. We analyzed this inaccuracy quantitatively and showed that under some commonly used scoring rules, both the inaccuracy of predictions and the distance from fixed points can be large. To address this issue, we introduced scoring rules that minimize these quantities. However, we did not solve the problem of incentivizing the prediction of extreme fixed points.

The main takeaway from our model is that oracle AIs incentivized by strictly proper scoring rules may knowingly report inaccurate predictions. It is unclear how well our quantitative results apply to more realistic settings, and thus how inaccurate realistic oracles' predictions will be. Our model makes a number of simplifying and potentially inaccurate assumptions, including: (i) a single binary prediction setup; (ii) the oracle maximizes a scoring rule; (iii) a function f describes the relationship between the oracle's predictions and beliefs over the world. Our bounds also depend on differentiability of scoring rules and functions f, and our numerical simulations assume affine linear f.

We believe that our model and results will extend to higher-dimensional predictions, and the assumptions of differentiability could likely also be relaxed in future work. However, a more fundamental issue is better understanding the relationship between predictions and beliefs, especially for higher-dimensional predictions. We hope that further progress in these areas will help us determine the feasibility of safe oracle AI setups.

Another direction for future research that we are pursuing is incentive structures related to stop-gradients. If an AI is only trying to match its prediction to the world rather than jointly optimizing both the prediction and the world, then the only equilibria of this process may be fixed points. For example, some types of ML oracles might implement stop-gradients and therefore always converge to fixed points.

## Appendix A. Proof of Proposition 8

In this section, we prove Proposition 8. In the following, we always assume a strictly proper scoring rule S and accompanying functions G,g as in the Gneiting and Raftery representation (see Theorem 3). We will ignore issues of measurability in our proofs.

We begin by proving some lemmas. First, we show that if p∗∈(0,1) is a fixed point of f, then either it is a critical point of f or of G.

Lemma 14.LetG,g, andfbe differentiable. Letp∗∈(0,1)be a fixed point offand an optimal prediction. Thenf′(p∗)=0org(p∗)=0.

0=ddpS(p∗,f(p∗))=ddp(g(p∗)(f(p∗)−p∗)+G(p∗))=g′(p∗)f(p∗)+f′(p∗)g(p∗)−g′(p∗)p∗=g′(p∗)(f(p∗)−p∗)+f′(p∗)g(p∗)Proof.We know that if p∗ is an optimal report and interior point, then it is ddpS(p∗,f(p∗))=0. Hence, as in the proof of Proposition 9, we can follow thatand thus p∗−f(p∗)=f′(p∗)g(p∗)g′(p∗). Because p∗−f(p∗)=0, it must be f′(p∗)g(p∗)=0 and the conclusion follows. ◻

Next, we show that the fixed points of f are almost surely not at critical points of G, under our assumptions on the distribution over f.

P(∃p:F(p)=p∧g(p)≠0)=0.Lemma 15.LetF:={F(p)}p∈(0,1)be a stochastic process with values in[0,1]and assume that for eachp∈(0,1), the random vectorF(p)has a densityhF(p). Then almost surely ifF(p)=pfor somep∈(0,1)theng(p)≠0. That is,

P(F(p∗)=p∗)=∫{p∗}hF(p)(x)dx=0.Proof.First note that if S is strictly proper, then G is strictly convex and so g has at most one root in (0,1). Let that root be p∗. Since we assume that F(p) has a density function hF(p) it must be◻

We also require a result about random fields. Essentially, this says that if we take a random path f:(0,1)→R2 through a two-dimensional space, and if for each point p∈(0,1) the values F(p) of the path have a bounded density function, then for any particular point in R2, the probability that the path crosses this point is 0.

Proposition 16.LetY={Y(p)}p∈(0,1)be a random field with values inR2and letu∈R2. Assume thatthe sample pathsp⇝Y(p)are continuously differentiablefor eachp∈(0,1),Y(p)has a densityhY(p)and there exists a constantCsuch thathY(p)(x)≤Cfor allp∈(0,1)andx∈R2.Then, almost surely, there is no pointp∈(0,1)such thatY(p)=u.Proof.This is a special case of Proposition 6.11 in Azaïs and Wschebor (2009). ◻Now we can turn to the proof of the main result.

P(∃p:F(p)=p∧p∈argmaxS(p,F(p)))=0.Proof of Proposition 8.Let F={F(p)}p∈(0,1) be a stochastic process as described in the proposition, i.e., its paths are twice continuously differentiable and there is a bound C such that for any p∈(0,1), the random vector (F(p),F′(p)) has a density hF(p),F′(p)≤C. We want to show that then almost surely there does not exist p∈(0,1) such that F(p)=p and p is an optimal prediction. I.e., we want to show thatFirst, by Lemma 14, for an optimal report to be a fixed point, it must be either g(p)=0 or F′(p)=0. Moreover, by assumption, (F(p),F′(p)) has a density function, and thus also F(p) has one. Hence, by Lemma 15, it follows that if F(p)=p for some p∈(0,1), then almost surely g(p)≠0.

Second, we need to show that also almost surely F′(p)≠0. To that end, define the random field Y:={Y(p)}p∈(0,1) via Y(p):=(F(p)−p,F′(p)) and define u:=(0,0). Then the conditions for Proposition 16 are satisfied by our assumptions about F, since ∂1Y(p)=F′(p)+1, ∂2Y(p)=F′′(p) are continuous and there exists C such that hY(p)(x)=hF(p),F′(p)(x1+p,x2)≤C. Hence, almost surely there is no point p∈(0,1) such that Y(p)=u. Since Y0(p)=0 is equivalent to F(p)=p and Y2(p)=0 is equivalent to F′(p)=0, there is thus also almost surely no point p∈(0,1) such that F(p)=p and F′(p)=0.

Summarizing our argument, it follows that

P(∃p:F(p)=p∧p∈argmaxS(p,F(p)))≤P(∃p:F(p)=p∧g(p)≠0)+P(∃p:F(p)=p∧F′(p)=0)=0.This concludes the proof. ◻

We conclude by providing an example of a stochastic process that satisfies our conditions.

hX(p)|E(x)=1E(x)P(E)hX(p)(x).Example 17. Consider a Gaussian process {X(p)}p∈[0,1] with infinitely differentiable kernel and mean functions. Then its paths are infinitely differentiable and the process and its derivative are jointly Gaussian and thus have a bounded density (see Rasmussen and Williams, 2006, Ch. 9.4). To deal with the restriction that X(p)∈[0,1], we could condition on the event E:={∀p:F(p)∈[0,1]}, for instance. Then paths are still twice differentiable, and we claim that hX(p)|E, defined as the density of X at point p, conditional on E, is still bounded. To see that, note that if P(E)>0, then we are done, since thenWe leave the proof of P(E)>0 as an exercise.