Caspar Oesterheld

Proper scoring rules don’t guarantee predicting fixed points

Johannes Treutlein and Rubi Hudson worked on this post while participating in SERI MATS, under Evan Hubinger's and Leo Gao's mentorship respectively. We are grateful to Marius Hobbahn, Erik Jenner, and Adam Jermyn for useful discussions and feedback, and to Bastian Stern for pointing us to relevant related work. Update 30 May 2023: We have now published a paper based on this post. In this paper, we also discuss in detail the relationship to the related literature on performative prediction. Introduction One issue with oracle AIs is that they might be able to influence the world with their predictions. For example, an AI predicting stock market prices might be able to influence whether people buy or sell stocks, and thus influence the outcome of its prediction. In such a situation, there is not one fixed ground truth distribution against which the AI's predictions may be evaluated. Instead, the chosen prediction can influence what the model believes about the world. We say that a prediction is a self-fulfilling prophecy or a fixed point if it is equal to the model's beliefs about the world, after the model makes that prediction. If an AI has a fixed belief about the world, then optimizing a strictly proper scoring rule incentivizes it to output this belief (assuming the AI is inner aligned to this objective). In contrast, if the AI can influence the world with its predictions, this opens up the possibility for it to manipulate the world to receive a higher score. For instance, if the AI optimizes the world to make it more predictable, this would be dangerous, since the most predictable worlds are lower entropy ones in which humans are more likely dead or controlled by a misaligned AI. Optimizing in the other direction and making the world as unpredictable as possible would presumably also not be desirable. If, instead, the AI selects one fixed point (of potentially many) at random, this would still involve some non-aligned optimization to find a fixed point, but

80Dec 16, 2022

Caspar Oesterheld

Message

Academic website: https://www.andrew.cmu.edu/user/coesterh/

Blog: https://casparoesterheld.com/

689

138

144

12y

A dataset of questions on decision-theoretic reasoning in Newcomb-like problems

I’ve spent a lot of the last few years working on issues related to acausal cooperation. With LLMs being clearly dominant over recent years, I’ve now led a team to make a benchmark to figure out how good LLMs are at decision theory and whether and when they lean more...

Dec 16, 202450

Stop-gradients lead to fixed point predictions

Johannes Treutlein and Rubi Hudson worked on this post as part of SERI MATS, under the mentorship of Evan Hubinger. Rubi has also received mentorship from Leo Gao. We thank Erik Jenner for helpful discussions and Alexander Pan for bringing the performative prediction literature to our attention. Update 30 May...

Jan 28, 202337

Proper scoring rules don’t guarantee predicting fixed points

Dec 16, 202280

Extracting Money from Causal Decision Theorists

My paper with my Ph.D. advisor Vince Conitzer titled "Extracting Money from Causal Decision Theorists" has been formally published (Open Access) in The Philosophical Quarterly. Probably many of you have seen either earlier drafts of this paper or similar arguments that others have independently given on this forum (e.g., Stuart...

Jan 28, 202127

Naturalized induction – a challenge for evidential and causal decision theory

As some of you may know, I disagree with many of the criticisms leveled against evidential decision theory (EDT). Most notably, I believe that Smoking lesion-type problems don't refute EDT. I also don't think that EDT's non-updatelessness leaves a lot of room for disagreement, given that EDT recommends immediate self-modification...

Sep 22, 201715

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Caspar Oesterheld

Caspar Oesterheld

Caspar Oesterheld

Proper scoring rules don’t guarantee predicting fixed points

A dataset of questions on decision-theoretic reasoning in Newcomb-like problems

Stop-gradients lead to fixed point predictions

Extracting Money from Causal Decision Theorists

Caspar Oesterheld

A dataset of questions on decision-theoretic reasoning in Newcomb-like problems

Stop-gradients lead to fixed point predictions

Proper scoring rules don’t guarantee predicting fixed points

Extracting Money from Causal Decision Theorists

Naturalized induction – a challenge for evidential and causal decision theory

Proper scoring rules don’t guarantee predicting fixed points

A dataset of questions on decision-theoretic reasoning in Newcomb-like problems

Stop-gradients lead to fixed point predictions

Extracting Money from Causal Decision Theorists

A dataset of questions on decision-theoretic reasoning in Newcomb-like problems

Stop-gradients lead to fixed point predictions

Proper scoring rules don’t guarantee predicting fixed points

Extracting Money from Causal Decision Theorists

Naturalized induction – a challenge for evidential and causal decision theory