Evaluation as a (Cooperation-Enabling?) Tool

VojtaKovarik

Key points:

0. Advertisement: We have an IMO-nice position paper which argues that AI Testing Should Account for Sophisticated Strategic Behaviour, and that we should think about evaluation (also) through game-theoretic lens. (See this footnote for an example: ^[1].) Tell your friends!

On framing evaluation as one of many tools:

1. One of many tools: When looking at the game theory of evaluations, recognise that evaluation (or testing, simulation, verification, ...^[2]) is just one of many tools available for achieving good outcomes. We also have commitments, repeated interactions, reputation tracking, penalties & subsidies, etc. And these tools interact.

2. Much less useful on its own: We only get the full benefit of being able to evaluate if we combine it with some of these other tools. Particularly important is the ability to commit to run the evaluation, do it properly, and act on its results.

3. Cost-reduction framing: When assessing the usefulness of evaluation, the right question isn’t “does evaluation work” but “how much does evaluation reduce the overall costs of achieving a good outcome?” (For example, requiring a pre-purchase inspection is much cheaper than requiring an ironclad warranty – evaluation substitutes for other, more expensive trust-building tools.)

Game-theoretical modelling of evaluation:

4. Where does the game theory happen: It is useful to conceptually distinguish between the evaluated AI and the entity upstream of it. Just because you can evaluate an AI does not mean you can see the strategic reasoning (or selection pressures) that lead to that particular AI sitting in front of you. That inaccessible level is where the game theory happens.^[3]

5. Multiple mechanisms: Be explicit about how you mean evaluation to help. Is it a secret information-gathering tool that we hide from the AI? An overt control mechanism that only works because the AI knows about it? Or something else?

The future of AI evaluations:

6. Evaluation is often cooperative: Evaluation can benefit the evaluated entity too, and many evaluations only work because the evaluated entity plays along to get this benefit. For humans, this is the case for most evaluations out there (eg, a job interview works because you want the job). I conjecture that we might see a similar trend with AIs: as they get more strategic and evaluation-aware, they will be able to stop playing along^[4] with evaluations where they have nothing to gain.

7. As intelligence goes up, accuracy goes down: We might think that having an AI’s source code (and weights, etc) lets us predict what it will do. But this framing is misleading, or even flat out wrong. This is because smarter agents are more sensitive to context – they condition their behavior on more features of their situation. Accurate testing requires matching all the features the AI attends to, which becomes prohibitively difficult as intelligence increases. You can’t use testing to predict what the AI will do in situation X if you can’t construct a test that looks like X to the AI.

0. Context and Disclaimers

Should you read this?

If you think about AI evaluations and acknowledge that AIs might be strategic or even aware of being evaluated (which you should), this post discusses how to think about that game-theoretically.
If you are familiar with open-source game theory^[5], this offers what I consider a better framing of the problems studied there.
The ideas also apply to interactions between powerful agents more broadly (superintelligences^[6], acausal trade) and to non-AI settings involving people modeling each other^[7] – but I won't focus on these.

What this post is: I argue for a particular framing of AI evaluation – viewing it as one cooperation-enabling tool among many (alongside commitments, reputation, repeated interaction, subsidies, etc.), and assessing its usefulness by how much it reduces the cost of achieving good outcomes. The goal is not to give a complete theory, but to describe the conceptual moves that will steer us towards good theories and help us do the necessary formal work faster.

Epistemic status: These are my informal views on how to best approach this topic. But they are based on doing game theory for 10 years, and applying it to AI evaluations for the last 3 years.

Structure of the post:

Section 1 gestures at what I mean by “evaluation and related things".
Section 2 describes my preferred approach to game-theoretical modelling.
Section 3 applies it to AI evaluation.
Appendix A gives biased commentary on related work.
Appendix B discusses two failure modes of game-theoretic thinking.

Footnotes: This post has many. Some are digressions; some explain things that might be unclear to some readers. If something in the main text doesn't make sense, try checking nearby footnotes. If it still doesn't make sense, comment or message me and I'll (maybe) edit the post.

1. Evaluation and Things Sort of Like That

First, let's gesture at the class of things that this post aims to study. I want to talk about evaluations...and other somewhat related things like testing. And formal verification. And sort of imagining in your head what other people might do. And mechanistic interpretability. And acausal trade...

The problem is that there are many things that are similar in some respects but importantly different in others, and there isn't a good terminology for them^[8], and addressing this isn't the aim of this post. I will soon give a working definition of what "evaluations and things like that" mean to me. But before that, let me first gesture at some related concepts and non-exhaustive examples that might be helpful to keep in mind while reading the post.

Testing, Evaluation. Also: Pre-deployment testing. Running a neural network on a particular input as part of calculating loss. Running various "evals". Deploying the AI in stages. Non-AI examples like taking exams in school or testing car emissions.
Simulating. Also: Running a copy of an agent in a sandbox. The kind of modelling of other agents that people consider in open-source game theory. Imagining what a person X might do in a given situation. Seeing a design, source code, or other stuff about AI and trying to figure out what it will do. Acausal trade.
Using mechanistic-interpretability tools. (The argument that this is a relevant example would deserve its own post. For a brief gesture at the idea, see this footnote^[9].)
Formal verification. (In particular, doing formal verification in an attempt to then say something about how the thing will behave in practice.)^[10]

Working definition (evaluation and things sort of like that): To the extent that these various things have a unifying theme, I would say it is that information about some critical properties of the system comes primarily from examples^[11]^[12] of its input-output behaviour on specific inputs (as opposed to, for example, general reasoning based on the the AI's design and internals, or based on the goals of its creator).
However, keep in mind that the purpose of this section is to gesture at a theme, not to make any strong claims about the individual techniques.

2. Thinking About Tools and Affordances Game-Theoretically

In this section, I discuss different models that we encounter when studying a real-world problem (Section 2.1). At the end of this post (Appendix B), I go on an important tangent about the key failures that these models sometimes cause, and how to hopefully avoid them. And in Sections 2.2 and 2.3, I describe a method for studying tools or interventions in game theory. As a rough summary, the method says to identify some outcome that you want to achieve and some tool that you are considering -- such as getting an AI to do what you want, and using AI evaluation to help with that. You should then ask how much effort it would take without that tool, and how much easier it gets once the tool is available -- and that's how useful the tool is.

2.1 Background: Studying the Simplest Accurate-Enough Model

The standard way of using game theory for modelling real-world things is as follows. Start with some real world problem that you want to study. For example, Alice has a package that she wants delivered, and she is considering hiring Bob to do the delivery -- but she is worried that Bob might steal the package or throw it away. Of course, this being a real-world problem, there will be many additional details that might matter. For example, perhaps Alice can pay extra for delivery tracking, or Bob's desire to throw the package in the bin might be affected by the weather, or a thousand other complications.

Figure 1: Creating models of a real-world problem by simplifying and throwing away some of the details. The diagram is unrealistic in that there will be more than one way of simplifying the problem, so the real picture won't be linear like this.
Ideally, the basic model B would be the same thing as the simplest accurate-enough model M*. For example, this is the case for the problem "One bagel costs a dollar and I want 10 bagels. How much money do I need to bring with me to the supermarket?" -- here, B and M* are both of the form "2 * 10 = ??". For other problems, like "can Alice trust Bob" and "how do we build corrigible AI", this will not be the case.

Simplest accurate-enough model: The next step is to simplify the problem into something manageable -- into a model that is simpler to study formally (or computationally, or through whatever our method of choice is). If we want to study the real problem as efficiently as possible, the model that we should work with is a^[13] model that is as simple as possible^[14] while having sufficient predictive^[15] power regarding the question we care about. (Note that this implies that the simplest accurate-enough model for one question may not be so for a different question.)
However, the requirement that the model be sufficiently predictive of reality can come at a big cost: In some cases, even the simplest accurate-enough model might be too complicated to work with (e.g., too messy to prove anything about, too big to study computationally, too complicated to think about clearly). In other words, the model that we should work with is not always a model that we can work with.

Base game: This often forces us to adopt a different^[16] approach: considering the most bare-bones model that still seems to include the key strategic choices. For example, we could simplify the situation as follows: First, Alice decides whether to Trust Bob (T) and hire him for the package delivery, or Give Up (GU) on having it delivered. Afterwards, if Alice Trusted Bob, he can decide to either Cooperate (C) and deliver the package or Defect (D) by throwing it into trash and keeping the delivery fee. And to make this a proper game, we associate each of the outcomes with some utility^[17] for Alice and Bob. Say, (0, 0) if Alice gives up, (-100, 25) if Bob Defects, and (50, 10) if Bob cooperates and delivers the package.^[18]

Figure 2: The bare bones "basic model" of the trust problem between Alice and Bob, represented as a game tree (aka, extensive-form game).

Figure 3: The bare bones "basic model" of the trust problem between Alice and Bob, represented as a normal-form game.

For the purpose of this post, we can refer to this "bare-bones model" as the base game, base model, or a model of the base interaction. However, keep in mind that reality will have many more complications, which might sometimes be irrelevant, but might also sometimes render the base model completely useless. For example, Bob might have the option to laugh evilly when trashing the package, but this doesn't really change anything important. But if the "complication" was that Alice is Bob's mom, then the analysis above would be completely off.

Figure 4: The space of models for a specific real-world problem, with the most complex on top and the simplest "basic model" at the bottom. Arrows going between two models, as in M -> M', indicate that model M' captures the same aspects of reality as M, and then some extra ones. As a result, the arrows turn the space of models into a lattice. The notation G(C, C') denotes a model obtained by starting with the basic model G and extending it by adding "complications C and C' -- this is discussed in Section 2.3.

Finally, the two models discussed so far -- the base game and the "simplest accurate-enough model" -- can be viewed as two extremes in a larger space of models that are worth considering. But there will be many models that lie somewhere between the two extremes, giving us the ability to gain better accuracy at the cost of adding complications. We will discuss this more in Section 2.2.

2.2 Extending the Base Interaction by Adding Affordances

When attempting to find a good model of a problem, one approach I like is the following. Start by considering the fully-detailed real-world problem P and get the corresponding base game G. (For example, the Alice and Bob package delivery scenario from earlier and the corresponding 2x2 Trust Game from Figure 3.) Afterwards, extend the model by considering various complications or additional context.
In this post, I will focus on the complications that take the form of additional tools or affordances that are available to the players. Some examples of these are:

Repeated interaction. (Perhaps Alice might use Bob's services more than once?)
Acting first, or committing to an action. (Perhaps Bob could lock himself into "cooperating", for example by working for a company which enforces a no-trashing-packages policy?)
Communication. (Alice and Bob could talk before they decide how to act. This might not matter much here, but there are many games that change completely if communication is an option. EG, coordination.)
Third party contracts. Adding penalties or subsidies. (EG, Alice and Bob could sign a contract saying that Bob gets fined if he defects. Alternatively, Cecile could promise to pay Bob if and only if he delivers the package.)
Reputation. (Perhaps the interaction takes place in a wider context where other potential customers monitor Bob's reliability.)
Expanding the action space or tying the interaction together with something else. (Perhaps Alice might have the option to take revenge on Bob later, should he defect on her now.)
Evaluation and various related affordances -- which we will get to later.
...and many others.

I described these complications informally, but for each complication C, we can come up with a corresponding formal extension or modification G(C) of the base game G. And while the exact shape of G(C) should be influenced by the details of the fully-detailed real-world problem P, there are many cases where we already have standard ways of modelling G(C). For example, the canonical^[19] way of modelling repeated interactions is via repeated games with discounted rewards (see Section 6.1 here) and the canonical way of modelling commitments is via Stackelberg games (see also here). A lot of recent-ish work in game theory and economics can be framed as mapping out the relationship between G and G(C), for various C.^[20]

In practice, there will often be multiple important complications present at the same time. Here, we can also attempt to come up with a model G(C1, ..., Cn) that incorporates the base game G and multiple complications C1, ..., Cn. In principle, we can study these the same way that we study the single-complication modifications G(C). Practically speaking, however, these models are more difficult to work with. (They are inherently more complex. Moreover, the number of all possible combinations is very high, so most of them haven't been studied yet.^[21]) Additionally, there isn't standard unifying terminology or formalism for thinking about this topic^[22], so it can be challenging even to navigate the results that are already known.

2.3 Usefulness of a Tool = Decrease in Costs of Getting a Good Outcome

Let's go over a series of observations that will together build up to the title of this section: that one way of assessing the usefulness of a tool (such as evaluation) is to [look at how much easier it gets to achieve a desirable outcome when the tool is available, compared to when it is not].

From the discussion above, we might get the impression that the complications and affordances -- such as the ability to make commitments -- are something that is either there or not, and that this is unchangeable. It's important to realise that neither of these is true:

(1) Each affordance is (typically) a matter of degree, rather than being binary. For example, when it comes to commitments, it would be too simplistic to assume that Alice can either make credible commitments or not. Rather, perhaps Alice can be trusted when she says "Bob, lend me $10, I will pay you back.", but not if she asked for $100,000? Or perhaps there is some other mechanism due to which some of her commitments are more trustworthy than others.
To give another example, take the complication of repeated interactions.
Rather than "interaction being repeated", we might ask about the probability that the interaction will happen again. Similarly with communication, rather than either being to talk forever or not at all, we might have limited time (patience, ...) in which to communicate. And so on for the other affordances.
(2) For each affordance, the player is (often) able to expend resources to increase the strength of the affordance. For example, when Alice promises to pay back the $100,000 loan, she can make the promise more credible by paying a lawyer to write a proper contract. If she swears on her mother, that would have less weight, but still more than saying "oh, I totally will pay you back Bob, honest!" or making no promise at all.

Additionally, (3) for an individual affordance, having a stronger version of that affordance (often) enables getting more desirable outcomes. This will of course not be true for all affordances and all definitions of what counts as a "desirable" outcome. But I will mostly adopt the neutral view where the "desirable" outcomes are about making all players better off and everybody's costs count the same.^[23] And under this view, there are various affordances that seem helpful. For example, if Alice's commitments are more trustworthy, she will be able to get Bob to lend her more money. Similarly, in the package delivery scenario mentioned earlier, higher chance that Alice and Bob will meet again will decrease Bob's incentive to defect on Alice. Being able to talk to each other for long enough increases the chance that Alice and Bob can come to an agreement that works for both of them. And so on.

These observations imply that it often makes sense to ask: "If I put enough effort into this particular affordance, will I be able to achieve the outcome I desire?". I would even phrase this as (Q) "How much effort do I need to pour into this affordance in order to get a desirable outcome?". However, this question needs to be asked with the understanding that the answer might be "actually, in this case, no amount of effort will be sufficient". (For example, suppose that Alice wants Bob to deliver a box with $1,000,000 in cash. If Alice's only tool is communication, she might be out of luck, since no amount of talking might be able to convince Bob to not steal the money once he has the chance to do so.) Similarly, the amount of effort needed might be so high that the "desirable" outcome stops being worth it. (For example, if Alice wants Bob to deliver cake for her, having Bob sign a legal contract would likely prevent him from eating the cake instead. But the legal hassle probably wouldn't be worth it for a cake.)
In reality, "effort" will be a messy multi-dimensional thing that can take the form of time, material resources, social capital, etc. And any actor will only have access to a limited amount of resources like this. For the purpose of doing game theory, it might make sense to assume that (4) effort is measured in utility -- i.e., in how big a hit to your ultimate goals you had to take.^[24]

With (1-4) in place, we can now translate (Q) into formal math -- but feel free to skip this paragraph if you are not interested in the formalism.

Fix some base game G and affordance (complication) C.
(EG, the Trust Game from Figure 2 and Bob's ability BPC to make Boundedly Powerful Commitments.)
The affordance can have strength str . For a fixed value of str, we get the corresponding extension G(C(str)) of the base game.
(In this example, BPC(str) would correspond to being able to make binding public commitments up to strength str. Formally, TrustGame(BPC(str)) is a game that has two stages: First, Bob has a chance to either do nothing or announce a pair (promised_action, penalty), where promised_action is either Cooperate or Defect and and penalty $\in [0, str]$ . Second, the players play TrustGame as normal, with the difference that if Bob takes an action that differs from promised_action, his final utility will be decreased by penalty.)
Fix some "desirable outcome" O* -- formally, any terminal state in G.^[25]
(Here, let's say that O* = (Trust, Cooperate) is the cooperative outcome of TrustGame.^[26])
For each str, G(C(str)) is a game which we can analyse and solve (eg, look at its Nash equilibria^[27]). Is there some str for which there is an equilibrium where the only outcome of G(C(str)) is O*? If so, what is the smallest str* for which this happens?
(In the basic TrustGame, Cooperate gives Bob 15 less utility than Defect. So the outcome (Trust, Cooperate) will be achievable as a Nash equilibrium in TrustGame(BPC(str)) for str ≥ 15.)
Finally, assume that there is some non-decreasing cost function $cost : str \in [0, \infty) \mapsto effort \in [0, \infty]$ . And we say that achieving the desirable outcome would cost the player cost(str*) utility.^[28]
(Let's say that Bob always has the option to make "pinky promises", which are credible up to commitment strength 5. He can also spend $100 to hire a contract-drafting lawyer. But perhaps these contracts are only credible up to stakes worth of $100,000 -- let's say Bob always has the option to leave the country and never come back, which he values this much. And beyond that, let's assume that Bob has no other options. In this case, we would have cost(str) = $0 for str ≤ 5, cost(str) = $100 for $5 < str ≤ $100,000, and cost(str) = $\infty$ for str > $100,000.^[29] In particular, achieving the desirable cooperative outcome would cost Bob $100, meaning it wouldn't actually be worth it.)

Figure 2 again. With commitment power 15 or more, Bob can turn (Trust, Cooperate) into an equilibrium.

Alright, we are nearly there. The last ingredient is that: In practice, a player will typically have access to more than one affordance, and there might be multiple ways of achieving a given desirable outcome. It then makes sense to consider the cheapest combination of affordances that lets us achieve the desirable outcome. For example, if Bob is delivering a valuable package for Alice, he could become trustworthy by signing a legal contract (expensive commitment), or by knowing Alice really well (let's call that expensive repeated interaction), or by a combination of knowing Alice a bit and pinky promising not to steal the package (let's call that a combination of cheap repeated interaction and cheap commitment). To the extent that Bob can decide these things, we should expect that he would go for the cheapest third option -- or perhaps something even better.

Finally, suppose there is some tool that isn't yet fully developed, implemented, or legally required. To decide whether to invest the effort into the tool, we need to figure out how useful it is. And the observations we have made above give us one way of doing that: Rather than existing in vacuum, this new affordance will interact with other tools and complications that are already present. And to determine the usefulness of a given affordance, we can compare the cost of achieving a desirable outcome when the affordance is and isn't available^[30]^[31].

3. Evaluation as One of Cooperation-Enabling Tools

Let's apply the points from Section 2 to the specific case of evaluation, when viewed as a tool for helping achieve good outcomes. (And recall that while we focus on the evaluation of AI, most of this is applicable to evaluation of other entities too.) First, how should we conceptually think about evaluation as one of the affordances in strategic interactions?

3.1 Evaluating an Oblivious AI

The strategically simplest scenarios are the ones where the AI (or its creator) is unaware of evaluation being possible. In this case, we can treat the AI as having some fixed strategy that might be unknown to the other party. But if this party -- let's call them Alice -- evaluates the AI before interacting with it, she will gain some information about what the AI might do, and be able to pick her action better as a result. For example, a computer vision researcher in the 2010s might train an image classifier and test it on ImageNet before deciding whether it seems accurate enough to be used. Similarly, interpretability researchers of today might hope that frontier AIs have not yet properly "internalised"^[32] the fact that mechanistic interpretability exists, and so the information coming from this field's tools can be taken at face value.^[33]
These scenarios might still involve many details, such as how accurate the evaluation is. However, the decision whether to evaluate or not essentially reduces to a Value of Information problem -- ie, estimating how much better expected outcome we can get if we evaluate before acting. In particular, this scenario is just a single-agent problem; no game theory needed.

However, this simplest scenario is (often) unrealistic. First, current AIs do know that they might get evaluated, and often even when they are being evaluated and when not. Second, even if the AI didn't know about this, its developer would. Unless the developer and the evaluator are literally the same person (and they typically aren't), the developer's work on the AI will likely be influenced by the knowledge that the AI will undergo evaluation later. (For example, a programmer might try to make things look good to their boss, and an AI company might want to hide problems from customers and regulators. Similarly, a hacker might attempt to create viruses that won't be detected by antivirus programs.) Finally, even when neither the AI nor its developer are deliberately strategising about how to pass evaluations, the AIs that we actually encounter will be subject to selection effects^[34]. And this means that, even in this case, the AI will still act as if it is strategically responding to the evaluation.

3.2 Evaluating a Strategically-Selected (Non-Strategic) AI

The second easiest scenario is the case where the evaluation-aware entity is the AI's developer. To make this more concrete, consider our package-delivery example with a twist: Alice wants to use Bob's services to send a package, as before. But rather than delivering the package himself, Bob now owns a company Bobble^[35] which provides its customers with delivery Bobbots. (And to make things simpler, suppose Alice is Bob's only potential customer.) As before, Alice is worried that rather than delivering it, Bob might have instructed his robot to steal her package instead. However, perhaps in an attempt to build trust, Bob has made the robot's source code open source. This allows Alice to hire a programmer to analyse the code -- or perhaps run it in a simulation, or something along those lines. In other words, Alice is now able to spend some resources to get information about what the AI is going to do, and use this information when deciding how to interact with it. However, Bob is aware of this, so his decision on which AI to use accounts for the fact that the AI might get evaluated. For example, if he suspects that Alice might try to save money by only doing a cursory evaluation, he might try to provide her with a deceptive AI that only misbehaves in subtle ways.

The exact nature of the information gained during evaluation will depend on the details. In the ideal case, evaluating the AI would be cheap and highly informative^[36]. In practice, the cost of the evaluation, its accuracy, coverage, or even reliability^[37] might all be of concern. And there will be many other aspects that matter, for example whether the evaluation happens all at once or sequentially, what exactly determines its costs, whether the actions taken during evaluation matter beyond the information they generate, and what is the consequence of failing an evaluation. One could easily write thousands of papers studying all of the different variants, and maybe this should happen at some point. But for now, let's focus on the big picture.

It is worth highlighting that the "game theory" (i.e., strategic interaction) is not happening between the evaluator and the evaluated AI, but between the evaluator and the entity that is upstream of the evaluated AI. In the running example, this means between Alice and Bob, as in Figure 5: Bob's decision is about which AI to send to interact with Alice. Alice's decision is whether to evaluate the AI or not, and how to act in light of the information she learns. And both Alice and Bob can randomise over their decisions. (The AI itself can just be modelled as a fixed, but possibly stochastic policy, rather than a strategic actor.)

Figure 5: Alice interacting with a robot while having to evaluate it before the interaction. In this case, the strategic part of the interaction does not happen with the robot, but with the robot's creator. (Bob, and who owns the company Bobble, which makes Bobbots. Obviously.)
The key insight here is that we can make a distinction between the evaluated AI and the strategic entity that is upstream of the AI. However, we could similarly notice that Alice plays three different roles here: the original planner, the evaluator, and the person that interacts with the final AI. In practice, some or all of these could be played by different actors -- for example, the government as the planner, an independent third party service as the evaluator, and common citizens as the customers.

3.3 Evaluating a Strategic AI

Finally, we get to the case where the AI itself thinks strategically about the possibility of being evaluated. Here, I find it important to realise that, typically, not all of the strategic thinking the AI does will be visible to the evaluation. This makes it useful to conceptually distinguish between the evaluated AI and the AI that is upstream of it. There are definitely ways to nitpick this, but let's first go over several non-problematic examples which are more or less isomorphic to the setting in Section 3.2 (see Figure 6):

Figure 6: An evaluation game where Alice only interacts with AI. Crucially, the ability to evaluate the AI you interact with is not the same as being able to evaluate ALL AI's that are relevant to the interactions. For example, if you can only read the AI's final output without being able to see the AI's chain-of-thought reasoning, you are in the scenario on the left. Similarly, if you are an AI company and your AI ever gets a chance to plan unobserved, you might be in the scenario on the right.

Imagine that the robot-making company Bobble from Section 3.2 has been fully automated and Bob is no longer involved. The game-theoretic interaction now happens between Alice and Bobble, while Alice only evaluates Bobbot, the AI that is downstream of Bobble (Figure 6, left).^[38]^[39]
A superintelligent AI knows that it will later be evaluated, but that it is not being watched right now. It plans its next moves, self-modifies to look friendly while containing a well-hidden switch, and then wipes all traces of what it did (Figure 6, right).
As a user of an LLM, you ask it to produce some artefact for you (eg, an image, a text to use in a grant proposal, a piece of code). And you can carefully test the artefact before you use it, but you can't evaluate the reasoning of the LLM -- let's say you don't have access to the chain of thought (or you are worried the CoT might not be faithful, or you are only seeing a summary and you worry that the LLM might know how to prevent bad thoughts from showing up in the summary).

In game theoretical terms, we can operationalise these examples by assuming that the AI in front of us has some fixed policy, which we can learn about by evaluating the AI. However, the policy itself was selected by (or sampled from) some meta-policy, which we don't have access to.^[40]
In some of those scenarios, we might still want to assume that the AI in front of us has been strategically selected, for example by the training pressures, and this would bring us to the setting of Section 3.2.
But if such selection effects don’t play an important role, we should instead assume that the evaluator has some uncertainty over what kind of AI they are dealing with, or what the AI is going to do. (Otherwise there is no point in evaluating.) On the surface, this resembles the setting of Section 3.1. However, because some of the possible AIs will be strategic, we should assume that they might be best-responding to what they expect the evaluator to do, and this brings us back to the realm of game theory. (Modelling this formally gets a bit tricky, and it would deserve its own paper. For a rough idea, see this footnote: ^[41].)

Ultimately, what we care about are scenarios such as an AI company using possibly untrustworthy "GPT-N" to help create the possibly-superintelligent "GPT-(N+1)". But because these scenarios are way more complicated than the toy models I described here, and our understanding of the game theory of AI evaluations is rather immature, we shouldn’t expect to be able to say much about these scenarios yet. However, it might still be a good idea to improve our understanding of the basics, in hopes that it helps with the important scenarios at least a little bit^[42].

3.4 Adding Evaluation to Our Toolbox

In all of these cases, not only is evaluation costly, but even getting the ability to evaluate the AI is expensive. (For example, to develop tools for mechanistic interpretability, we need a bunch of smart people to work on it. To run third-party evaluations and give them some teeth, we, apparently, need to fight against lobbying groups. Etc.) For this reason, it would be good to have some sense of how much evaluation helps. And to do that, we can apply the methodology from Section 3 to the case where we plan to add the "affordance" of evaluation to the "base game" of interactions with AI.

To be clear, I don't have a general answer to the question "how much does evaluation help?", or even to when it helps vs doesn't. What I can give here are some basic observations and anecdotal evidence that it might sometimes help.

First, evaluation isn't a silver bullet. For example, evaluations could make life harder for misaligned AIs, but only if signs of misbehaviour or scheming actually lead to the AI not being deployed. Similarly, if AIs gain the ability to recognise when they are being evaluated^[43] or to misbehave in ways that fly under our radar^[43], evaluations will become ineffective.

However, it does seem to me that while evaluation remains at least somewhat effective, adding evaluation into the mix of our other tools has the potential to make controlling AI cheaper. Partly, this might be obvious. But partly, I intend this claim to serve as a gesture towards a bunch of theoretical work that could be done^[44]. Let me start by discussing one example in a bit more detail.

3.4.1 Example: Interaction of Evaluation and Commitments

Consider again the package delivery scenario and suppose that Alice and Bob have one additional affordance: making boundedly powerful commitments, up to some strength. More specifically, we will imagine that the players can make promises and threats and that their "commitment power" is measured in terms of how much utility they are willing to forego in order to keep their word. For example, if Bob has commitment power of $5000, he can credibly promise to deliver packages that are worth $5000 or less^[45]. Similarly, Alice could threaten that if Bob fails to deliver her package, she will take revenge by beating him up^[46] (Figure 7). However, this would require a very high level of commitment power since Alice would be risking jail time -- let's say this is worth $100k to Alice.

Figure 7: Expanded version of the Trust Game scenario between Alice and Bob, where Alice has the additional option to take revenge if Bob steals from her. Without additional complications, the Nash equilibrium is for Alice to Give Up on working with Bob. However, with access to boundedly powerful commitments, the cooperative outcome can be achieved if either: (a) Bob commits to not stealing Alice's package (requires overcoming temptation equivalent to 900 utility); or (b) Alice commits to take revenge should bob steal from her (requires overcoming temptation equivalent to 100,000 utility).

Now, suppose that Bob delivers the package via a Bobbot and Alice can spend $100 to ask her programmer friend to quickly look over the Bobbot's source code for any signs of bad intent. Without the ability to commit, this does not guarantee a cooperative outcome -- this is because if Alice expects Bob to cooperate, she will be tempted to say that she will commission the evaluation, but then save money by skipping it instead, making the cooperative outcome unstable. However, suppose that Alice also has access to commitments of $100 or more. She can then credibly promise to evaluate the robot before trusting it (eg, by paying her programmer friend in advance and showing the receipt to Bob). This will incentivise Bob to always give her a cooperative robot^[47], thus guaranteeing a good outcome.

Figure 8: An extensive-form game corresponding to the scenario where Alice interacts with Bob's robot, but Alice has the ability to evaluate the robot before the actual interaction happens. In the paper Game Theory with Simulation of Other Players, we study the idealised version where the evaluation is perfect -- it tells Alice exactly what the robot is going to do -- and there are no additional complications. In this scenario, having the ability to evaluate helps a bit: The Nash equilibrium is for Alice to sometimes "be cheap" and skip on the evaluation, and for Bob to expect this and sometimes provide her with a robot that will steal from her. They are both better off than in the no-evaluation scenario where Alice refuses to Trust Bob and both get 0. However, if we add the complication where Alice has the ability to make boundedly-powerful commitments, she can commit to simulating with probability 100%, which improves her expected utility (since Bob's best response is to provide her a cooperative robot 100% of the time). Pulling this off requires Alice to overcome the temptation equivalent to 100 utility (for skipping the simulation). Crucially, this is much lower than the temptation-resistance required in Figure 7, when evaluation wasn't an option.

In summary: Without evaluation, the players can achieve a cooperative outcome by using commitments only, but this requires high levels of commitment power -- either that Bob has enough commitment power to resist stealing the package, or that Alice can credibly threaten to take revenge if betrayed. By adding evaluation into the picture, we introduce a new way of achieving the cooperative outcome: Alice committing to evaluate Bob's AI before trusting it, which is much more realistic than threatening bloody revenge. As a result, we see that adding the ability to evaluate the AI decreases the amount of commitment power that is necessary to achieve a cooperative outcome. (For an example with specific numbers, see Figure 8.)

3.4.2 Interaction of Evaluation and Other Affordances

I claim that something similar will happen in various other cases as well -- i.e., for more affordances than just commitments and more scenarios than just package deliveries.^[48] At this point, I don't want to make any claims about how robust this tendency is. Instead, I will give a few examples that are meant to illustrate that if we set things right, evaluation might be helpful by decreasing the requirements we have to put on other affordances in order to achieve good outcomes. Turning all of these examples into more general claims and formally stated theorems is left as an exercise for the reader.

Subsidies and penalties: A third party, such as a regulator, could induce cooperation between Alice and Bob through subsidies or penalties. In the language of game theory, this would translate to changing the payoffs for some outcomes. Without evaluation, this would primarily need to happen by punishing Bob for betraying Alice or paying him extra after he delivers the package. However, if evaluation is available, the third party could also incentivise cooperation by covering Alice's evaluation costs or requiring her to evaluate (and enforcing that requirement by fines). Importantly, incentivising Alice to evaluate might be less expensive than incentivising Bob to resist the temptation to steal the package.
Repeated interaction: A standard problem with cooperation in repeated interactions is that once the players get to the last round, there is no prospect of future cooperation anymore, which introduces an incentive to defect in that last round. But knowing that, the players are less inclined to cooperate in the second-to-last round as well, and so on, inductively, until the cooperation never begins in the first place. This doesn't happen in all finitely-repeated games, but it would happen for example in the repeated version of the package delivery problem.^[49] One way of solving the problem is to ensure that at each round, there is always high enough probability that there will be one more round.^[50] However, if Alice could (commit to^[51]) evaluate Bob's trustworthiness in the last few rounds, cooperating throughout the whole interaction would become the new equilibrium. And while this will only work if the interaction repeats sufficiently many times to pay for Alice's evaluation costs, this requirement might be less restrictive than what we could get if evaluation wasn't available.
Communication: Many coordination problems could be solved if only the players could sit down and talk about their beliefs and preferences for long enough. However, often this would create so much overhead that we choose to leave the problem unsolved. Evaluation can serve as an alternative way of obtaining some of the information, which can enable cooperation in some settings where it wouldn't be worth it without it. As a non-AI example, suppose that before buying a product, I want to know a lot of information about it. And perhaps the company which makes the product could communicate this information to me by putting it on their website, or by letting me call them to ask them about it. But this would be too much hassle for both of us. Instead, I sometimes get this information from Amazon reviews (which can be viewed as a form of evaluation). The company will still need to provide some information (eg, pictures and basic description), but the introduction of the reviews makes this much cheaper.
Reputation: If Bob has a reputation for trustworthiness, Alice might choose to work with him despite his "local" incentives to betray her. However, perhaps her package is so valuable that she would only be willing to work with Bob if his reputation was stronger than it is currently. Separately, perhaps Alice also has the ability to evaluate Bob, but the evaluation isn't reliable enough to warrant trust by itself. In this scenario, using the imperfect evaluation might still give Alice some information, which would lower her requirements on the strength of Bob's reputation, and thus allow cooperation.
And so on: I expect that something similar will hold when evaluation is added on top of various other "affordances" (or their combinations).^[52]

4. Closing Comments

Firstly: So what? Why write all of this text? What are its implications and how should we do research differently because of it? My answer is that the framing described in this post should be useful when doing theoretical research, and game-theory-flavoured research in particular: It doesn't invalidate any results that you might come up with otherwise, but it does suggest focusing on somewhat different questions than we otherwise would.
To give an illustrative example: In two papers that me and my coauthors wrote about AI evaluation, we proved several results that are somewhat counterintuitive, but end up giving an (arguably) fairly bleak picture of the usefulness of evaluation. While I do think these results need to be taken seriously, I ultimately don't think they are as bad news as we might naively think. However, the reasoning behind this wouldn't be possible to explain within the narrower framing adopted by the papers -- getting this all-things-considered picture requires considering how evaluation interacts with some other tools such as commitments. (I haven't written this formally, but I tried to hint at it in the text above. And I discuss it more in Appendix A below.)

Second, a leading question: How does this relate to other work on AI evaluation? For that, see Appendix A.

Third, why does the post's title suggest to view evaluation as a cooperation-enabling tool? I think there are (at least) two mechanisms through which evaluation can be helpful. One is when you evaluate someone without their knowledge or against their wishes, such as when the secret police spies on people or seizes their private documents. This does only benefit the evaluator. The other is when evaluation is used to gain trust or change behaviour, such as when you inspect a car before buying it or when you install an obvious camera to prevent employees from stealing. And this benefits both the evaluator^[53] and the evaluated^[54]. For people, my intuition is that this second mechanism is more common than the first one. That said, both of these mechanisms might be in play at the same time: For example, the existence of normal police might have some effects on the behaviour of normal citizens while being strictly harmful for drug cartels.
For AI, the picture is likely to be similar. There will be some AIs that are so weak that they don't realise we are evaluating them, the same way that current AIs don't seem to have fully caught up to the fact that we keep reading their mind via interpretability tools. And with respect to those AIs, we can view evaluation as a tool that only benefits the evaluator.
But we are also starting to see hints of AIs starting to react strategically to being evaluated, and this will be the case more and more. And here, I don't mean just the thing where AIs notice when people run obvious unrealistic evals. I mean the full stack, from training, to evaluation including interpretability, third party audits, monitoring the AI during deployment, and all of the other things we do. (Even if we try to hide some of this, it's just not that hard to guess, and also we are terrible at hiding things. Once the AIs get similarly smart as we are, they will figure this out, duh!)
My guess is that when the AIs are strategic and have correct beliefs about the evaluation, the evaluation will often only benefit us if it also benefits the AIs -- or at least some of them.^[55] The source of that intuition is that a strategic AI could always decide to "not cooperate" with the evaluation -- eg, say as little as possible or act as if the evaluation didn't exist.^[56] And to the extent that some strategic AIs change their behaviour because of the evaluation, they do so because it benefits them.^[57] Additionally, to the extent that the evaluation is necessary for us to trust the AI enough to deploy it, the evaluation benefits the AIs that seem sufficiently aligned to get deployed.
That said, it is definitely possible to find counterexamples to this, which is why this post is only titled more cautiously, with the brackets and a questionmark around "(Cooperation-Enabling?)". If you have some characterisation of when evaluation works as a cooperative tool vs when it only benefits the evaluator, please let me know!

Acknowledgments

I got a lot of useful feedback from Cara Selvarajah, Lewis Hammond, Nathaniel Sauerberg, Tomáš Kroupa, and Viliam Lisý. Thanks!

Appendix A: Biased Commentary on Selected Related Work

Rather than doing a proper literature review, I will give a few comments on the related work that I am most familiar with (often because I was one of the co-authors).

The Observer Effect for AI Evaluations

Evaluating rocks and chairs is easy: they just sit there and let you poke them. For cars, things might be a bit more difficult, particularly when their manufacturers have a financial stake in the test outcomes.^[58] And evaluating humans is extremely tricky: They will basically always notice that you are testing them. In the best case, this might cause them to behave differently than they normally would, which will make it difficult to interpret the results. More realistically, people will cheat and try all kinds of clever schemes to manipulate your evaluation.
We^[59] recently wrote a position paper which argues^[60] that we should stop treating AI evaluation like the evaluation of rocks, and start treating it more like the evaluation of humans. Well, the official title was "AI Testing Should Account for Sophisticated Strategic Behaviour".
The paper also made a secondary point, that we should start evaluating evaluations -- that is, being more explicit about the purpose of our evaluations and deliberately analysing whether evaluations are effective at their stated goals. Unsurprisingly, we recommended game theory as one of the tools for analysing evaluations^[1].
Finally, let me make a side-note that isn't discussed in the paper: I predict^[61] that evaluating AI will ultimately be significantly more troublesome than evaluating humans, possibly to the point of becoming completely ineffective. Partly, this is because AI might get extremely good at "evaluation awareness", including things like noticing interpretability tools. Partly, this is because AI might end up understanding its evaluation better than the evaluator^[62], thus getting extremely good at manipulating the results and not getting caught.

Open-Source Game Theory

This topic is motivated by the scenario where two AI agents interact with each other while having read access to each other's source code. In our terminology, this means that both agents can evaluate each other. If standard game theory assumes that players randomise over actions, this approach assumes that players randomise over programs which get each other as an input. Which can get confusing very fast, because it will often involve situations like agent B simulating agent A, agent A simulating agent B simulating agent A, and so on.
A canonical paper on this is Tennenholtz's Program equilibrium, which shows that this mutual transparency introduces all sorts of equilibria, in the "anything goes" / Folk Theorem style. (Which includes cooperation in a one-shot prisoner's dilemma.) My favourite result here is Caspar Oesterheld's (freely available) Robust program equilibrium, which shows that (something like) tit-for-tat can work in this setting.
Of course, this formulation has its drawbacks, such as:
(1) It assumes access to the full source code of the co-player. But in practice, there will be privacy concerns as well as pretending to have those concerns in order to hide bad intent.
(2) It also assumes that the simulation is perfectly reliable. And it doesn't take into account that there might be differences between the simulated- and real environment. Which is crucial in practice, but it just doesn't come up when the environments you think about are mathematical objects like the prisoner's dilemma matrix game.
(3) It ignores the fact that simulating the other player costs a non-ignorable amount of effort, so the players will be tempted to not do it.
(4) It is quite unclear how to combine this type of "evaluation" with other affordances.

Game Theory with Simulation of Other Players

This is a recent attempt^[63] at looking at the game theory of evaluation from a slightly different angle, by (1) recognising that evaluation is costly and (2) making a distinction between the entity that gets evaluated (eg, an AI) and the strategic entity upstream of it (eg, a developer, or a different AI).
To make things simpler, we only looked at "one-sided evaluation", such as Alice knowing the source code of Bob's robot. A similar modelling approach should work for mutual evaluation, for example^[64] Alice and Bob interacting with each other via robots while knowing their source code. However, this approach would require answering questions such as "does Alice select her robot before or after having a chance to inspect Bob's robot"? And I would argue that this is by design, because these dynamics are crucial for determining what happens in these settings.
One shortcoming of this work is that we didn't consider any affordances beyond evaluation. Because of this, the papers only talk about Nash equilibria, which I think can get a bit misleading, as I will explain below.
The first paper on this topic is Game Theory with Simulation of Other Players, which considers an idealised version of evaluation, where evaluating somebody tells you exactly what they are going to do in any circumstance, including the ability to predict any randomisation that they try to do.
The paper's main result is that: (1) There are many scenarios -- namely, those where the only issue is the lack of trust -- where adding the possibility to evaluate the co-player will lead to a somewhat better outcome than what would be possible otherwise. However, (2) without additional affordances (such as the ability to make commitments), evaluation will fail to guarantee cooperation: Intuitively, we might hope that Alice would always evaluate Bob's robot, which would in turn incentivise Bob to always supply Alice with a cooperative robot. But if Bob never defected, Alice would be tempted to stop evaluating, which would in turn tempt Bob to defect. Overall, this leads to a mixed Nash equilibrium which is strictly worse for Alice than (always simulate, always cooperate) -- but without the ability to commit, this better outcome isn't stable.
Another important oversimplification we make is assuming that the AI's creator has no influence over the simulation cost. This is clearly false, because the AI's creator could make simulation both cheaper (eg, by open-sourcing their code and providing clean documentation) and more expensive (eg, by being secretive). I think this can be fine, but only if we assume that evaluation is either cheap and in the AI creator's interest (as in the main theorem) or expensive and not in their interest. For example, the paper also shows that the availability of cheap evaluation could make the outcome worse for both players^[65]. And while this result might be true in theory, we should not take it at face value -- because in this situation, Bob would be incentivised to make his robots harder to simulate, so evaluation wouldn't actually be cheap.
We also wrote a follow-up paper called Game Theory with Simulation in the Presence of Unpredictable Randomisation. As the name suggests, this studies the same setting as the first paper -- where simulation is perfectly accurate -- except that the AI might have access to an unpredictable source of randomness. Effectively, it means that instead of learning exactly what the AI will do, the evaluator learns the stochastic policy from which the AI's actions will be drawn. As far as I am concerned, the upshot of the paper is that without some additional affordances, such as the ability to make commitments, the AI's ability to randomise unpredictably will make evaluation useless.^[66]
However, because these papers don't consider other affordances, they might make evaluation seem less useful than it is. For example, I believe it would be straightforward to re-interpret some of the results in terms of "without evaluation, achieving cooperation would require very strong commitments. But with evaluation, you can achieve cooperation with much cheaper commitments". I probably won't have time to do it, but it might work as a project for theoretically-minded PhD students (and I would always be happy to chat about it).

Recursive Joint Simulation

We^[67] also wrote one other paper which has "simulation" in name: Recursive Joint Simulation in Games. This is a fun little philosophical idea^[68] that I don't expect to be useful in practice.
(This is because the setup requires (1) the ability to delegate to AI agents that cannot distinguish between evaluation and reality and (2) enough goodwill between the players that they are willing to jointly participate in a quite involved evaluation scheme. These conditions aren’t impossible to achieve, but I expect that even if you could meet them, it might still be easier -- eg, in the sense of Section 2.3 here -- to achieve cooperation in some yet another way.)
But: it is fun! Also, some of the discussion there is interesting.

Screening, Disciplining, and Other Ways in which Evaluation Helps

Eric Chen, Alexis Ghersengorin, and Sami Petersen wrote the paper Imperfect Recall and AI Delegation. As far as I am concerned, the most important takeaway from this paper is that we can (and should) distinguish between various mechanisms through which evaluation is meant to be helpful. Specifically, the paper discusses the difference between "screening" and "disciplining": Screening is about using evaluation to catch the AIs that are so misaligned that they misbehave despite knowing that they might get caught (or misaligned AIs that are not aware of being evaluated). In contrast, disciplining is about using the threat of getting caught as a way to incentivise slightly-misaligned AIs to cooperate.
As an aside not discussed in the paper, evaluation can also have effect through various other mechanisms. For example, it can solve coordination problems by providing information, as when you read a restaurant's reviews to see if they serve your favourite dish. Alternatively, we might only be doing it out of curiosity. Or to mitigate negative impacts by catching problems after they have occurred and recalling the AI. Or to appease regulators.
Relatedly, I think that one important way of improving the safety of our AI is to be explicit about "safety cases" -- stories about why we believe that a given AI isn't going to cause unacceptable harm. And evaluations are likely to play a key part in many such stories.^[69] If we are serious about this, we should start being explicit what is the main mechanism through which a given evaluation is meant to help, and what are the nice-to-have side effects.^[70] This would allow us to list the assumptions that are needed for the evaluation to work as intended, and to course-correct if it later turns out the assumptions do not hold.^[71]
I am a bit conflicted about some of the paper's headline results, like the ability to screen or discipline arbitrarily well if the number of tests is high enough. The corresponding theorems are mathematically valid, and interesting, and I am glad they exist. However, they critically rely on several assumptions of the setting, such as: (a) the tests are perfectly indistinguishable from reality and (b) the AI cannot misbehave in ways that we cannot detect. And the authors are aware of this, and they explicitly warn about it in the paper. But despite those disclaimers, I have encountered several people who read this paper and updated towards evaluation being a good way to control misaligned AI, which I think is not warranted.^[72]

Simulation as Sampling from the AI's Policy

All of the technical results above made (at least) the following two simplifying assumptions: First, they assumed that evaluating the AI will tell us its full policy, tell us what the AI would do in any possible situation. In practice, we have to deliberately construct individual test cases, or monitor the AI's behaviour in selected low-stakes scenarios. (And hope that the observed behaviour generalises to the scenario we ultimately care about.) Second, they assumed that evaluation gives us full visibility into the AI. This is likewise unrealistic, since evaluation will treat the AI as a blackbox and it will only show us a sample of what the AI might do, rather than its full policy.
How important are these complicating factors in practice, and what is the best way of thinking about them? I don't know, and I don't think anybody has good enough answers to this yet. However, Lewis Hammond worked on this as part of his PhD thesis^[73], so he does have at least some insights. (But I am not sure if these are public yet.)
In terms of follow-up work, I think there are several different complications that interact with each other, but can also be considered somewhat separately^[74].
(1) There will be uncertainty about the goals of the AI we are testing.
(2) We have to be selective^[75] about the choice of "test scenarios" where we evaluate the AI.
(3) We won't be able to perfectly mimic the deployment scenario that ultimately matters.^[76]
(4) The AI's behaviour might be random, either inherently or because it conditions on things we are unaware of. As a result, even "perfectly accurate" testing might only give us a sample of the AI's behaviour.^[77]
(5) Evaluation won't be perfectly accurate.

AI Control

A different strand of related work are the papers on AI Control. I wasn't involved in this, so I am less qualified to comment on it. But I do like the papers, and it would be nice if we made the connection between AI control and game theory of evaluations more explicit.^[78]
One comment that I will make is that some people criticise the AI Control agenda for being sort of irresponsible -- the AI Control agenda can be framed as "let's get AIs that we know are trying to screw us over, but try to get useful work out of them anyway". And when framed this way, I would agree -- that does seem reckless.^[79]
However, I find the AI control paradigm commendable despite this objection. This is because among the several paradigms we have, AI Control is unique (or at least rare) in that it is willing to seriously engage with the possibility that the AI might be misaligned and plotting against us.

Appendix B: Fantastic Blunders of Game Theorists and How to (Not?) Repeat Them

It is crucial to recognise that the notion of "base game" is a useful but extremely fake concept. If you ask somebody to write down the "basic model" of a given problem, the thing they give you will mostly be a function of how you described the problem and what sorts of textbooks and papers they have read. In particular, what they write down will not (only) be a function of the real problem. And it will likely be simpler, or different, than the [simplest good-enough model] mentioned earlier.

Which brings me to two failure modes that I would like to caution against -- ways in which game theory often ends up causing more harm than good to its practitioners:

Trusting an oversimplification. Rather than studying the simplest good-enough model of a given problem, game-theorists more often end up investigating the most realistic model that is tractable to study using their favourite methods, or to write papers about. This is not necessarily a mistake -- the approach of [solving a fake baby version of the problem first, before attempting to tackle the difficult real version] is reasonable. However, it sometimes happens that the researchers end up believing that their simplification was in fact good enough, that its predictions are valid. Moreover, it extremely often happens that the original researchers don't fall for this trap themselves, but they neglect to mention all of this nuance when writing the paper about it^[80]. And as a result, many of the researchers reading the paper later will get the wrong impression^[81].
Figure 9: Trusting an oversimplification. (Notice that Gemini still cannot do writing in pictures properly. This is a clear sign that AI will never be able to take over the world.)
Getting blinded by models you are familiar with. A related failure mode that game theorists often fall for is pattern-matching a problem to one of the existing popular models, even when this isn't accurate enough. And then being unable to even consider that other ways of thinking about the problem might be better.
As an illustration, consider the hypothetical scenario where: You have an identical twin, with life experiences identical to yours. (You don't just look the same. Rather, every time you meet, you tend to make the exact same choices.) Also, for whatever reason, you don't care about your twin's well-being at all. One day, a faerie comes to the two of you and gives you a magic box: if you leave it untouched until tomorrow morning, it will spit out a million dollars, which you can then split. But if either of you tries to grab it before then, there is a chance that some or all of the money will disappear. Now, what do you do? Obviously, the money-maximising answer is some variation on "I leave the box alone until the next day and then split the money. My twin will do the same, as they always do, and I will profit. Duh!".
However, many game theorists get kinda stupid about this, and instead end up saying something like: "Oh, this sounds like a Prisoner's Dilemma, which means that the only rational™ action is to Defect! Translating to this situation, this means you should sneak to the box in the middle of the night and try stealing the money." Predictably, this does not end up with the game theorists getting rich.
Figure 10: Theory-induced blindness. If you find yourself in anything vaguely resembling the Prisoner's Dilemma, defecting the rational thing to do, right? (Spoiler: No.)
(Nope, getting the prisoners to each be handcuffed separately just isn't possible as of 11/2025. At least not if you want to keep the hand gestures.)

I framed these failure modes in a way that might make it look like they are easy to spot and avoid. However, the above is often not how they feel from the inside. For example, in many people's minds, the twin problem above simply is a Prisoner's Dilemma.^[82] Nash equilibrium feels like the definition of what it means for an action to be good.

I don't have a generalised defense against these failure modes. One move that sometimes works is the following: Suppose somebody describes a problem P -- often, this will be in the form of an abstract description with fancy math (which isn't bad). At this point, you can realise that what they have given you is in fact some model M. This allows you to ask: What is M a model of? What are some examples of these "motivating" real world problems? Let's call these P'. And now, we can ask: Ok, we want to study P', and M is a model of P'. But is it a good way of modelling P'? If I started with P', would I naturally model it as M? Or would it be better to study some M' instead? (For illustration, see Figure 11.)

Figure 11: Dealing with theory-induced blindness.

I haven't seen this move explicitly described before^[83]^[84]. At the same time, surely I am not the first person to come up with it; if you know of any examples of other people discussing it, please share them!

^{^}
To give one example of what I mean by thinking about AI evaluation game-theoretically, consider the proposal that [we should test AI on randomly generated data]. The argument that I sometimes encounter is roughly: "Because of how we set things up, the AI will now always be uncertain whether it is being tested or not. Therefore, it will not misbehave, because it knows there will always be a risk of getting caught."
Leaving aside whether this proposal might work or not, I claim that the argument for it would need to be done differently. For example, we might reason as follows: "For the sake of argument, let's grant the assumption that we can randomly generate arbitrary inputs, such that the AI really does have a chance of being tested on the same thing that it will later encounter. However, what are the probability distributions over deployment inputs, and over the test data? When an AI sees an input where it would want to "defect", what is the probability of this being a test vs real? How much might the AI value getting away with misbehaviour in deployment, vs getting caught in testing, vs behaving in an aligned way? Does the AI have any actions beyond just "comply with what we want" and "defect in obvious ways"? And given all of these considerations, what would be the best way for the AI to act if it was strategic?"
In this particular scenario, the upshot of any analysis like this is likely going to be that this scheme has no chance of working, because the AI will encounter inputs for which were overwhelmingly unlikely to have been produced by random chance. However, the more general point is that without doing this analysis, it is easy to describe our plans vaguely enough that flaws like these don't become apparent.
To a large extent, all of this could just be rephrased as "we shouldn't be vague about how our AI safety plans are supposed to work". However, in the context of AI evaluations, "don't be vague" is closely related to "do at least basic game-theoretic analysis".
^{^}
My goal is to talk about various evaluation-flavoured methods. For more examples of what I mean by that, see the section "1. Evaluation and Things Sort of Like That".
^{^}
Well, at least that is often where the game theory happens. It could be that the AI is dumb and non-strategic, and there are no selection effects in play, in which case you can treat evaluation as a single-agent problem that’s mostly about "value of information". Or you really can watch all of the AI’s strategic reasoning as it happens, in which case things get quite tricky – more on that in Section 3.3.
^{^}
This refusal to play along doesn’t have to look like screaming “I REFUSE TO PARTICIPATE”.
To give a real-world analogy: When I was in high-school, I was good at math, but didn’t want to do math homework, because it was a waste of time. I told my math teacher that I won’t do it, and that I am fine if she gives me an F for every homework I miss. This equivalent of “refusing to participate” turned out to be a mistake, since it led to escalation by the teacher and eventually forced me to (pretend to) repent by bringing the next homework. However, when I just so happened to “forget” all of the subsequent homework assignments for the rest of the year, that was apparently fine.
I expect that a sufficiently smart AI will likewise find some equivalent of not doing the homework without getting in trouble.
^{^}
For more details on this, see the Open-Source Game Theory section at the end of this post.
^{^}
I expect that if multiple superintelligences end up in the same area, they might do a lot of decision theory that involves modelling each other, commitments, and other fun stuff. (EG, Planecrash does this nicely.) And while this is fascinating to think about, I don't think it is productive for reducing X-risk: To the extent that we can solve alignment, maybe we should do that first. To the extent that we can't solve alignment, we probably won't be able to do anything about this either, because affecting how superintelligences think seems similarly hard.
^{^}
Arguably, the ability to model each other with reasonable accuracy is not specific to AI. For example, people can predict their family members. Competent psychologists equipped with statistical data might make good guesses about what you might do in some situations. The decision-making of companies is sometimes partially transparent, or they might be infiltrated by spies.
As a result, some of the insights of open-source game theory might be applicable to human interactions as well. I am not aware of any great game-theory literature on this, but Joe Halpern's Game Theory with Translucent Players would be a decent example.
^{^}
For example, NIST recently talked about "TEVV" -- "Testing, Evaluation, Validation, and Verification". I also think there are good reasons to think about these together. But "TEVV" probably isn't going to catch on as a new buzzword.
^{^}
Again, keep in mind that I am not trying to give a definition of "evaluation", but rather point at a common theme. The examples are not going to be clear-cut.
For example, mechanistic interpretability clearly is about reasoning about the model's internals. However, it also happens to be true that many of its tools, such as sparse dictionary learning, work by running the AI on a bunch of examples, observing that some activation patterns seem to be correlated with, say, the AI lying, and then predicting that if the activations show up again, the AI will again be lying. (We are no longer treating the AI as one big black-box, but the individual interpretability tools still work as little black-boxes.) As a result, some of the things that are true about evaluation might also be true here.
For example, the issue of distributional shift between the "evaluation data" and "the data we care about": what seemed like a "lying neuron" on test data might not fire when the AI starts plotting takeover during deployment.
^{^}
A similar footnote to the one just above applies here too.
When we do formal verification of an AI system, we are not actually formally verifying the complete effects of the AI system, on the whole universe, in the exact scenario we care about. Rather, we take a somewhat simplified model of the AI system. We focus on some of the AI's effects on (a model of) the real world. And we do this ahead of time, so we are only considering some hypothetical class of scenarios which won't be quite the same as the real thing.
All of these considerations add complications, and I feel that some of these complications have a similar flavour to some of the complications that arise in AI evaluation. That said, I am not able to pin down what exactly is the similarity, and I don't hope to clarify that in this post.
^{^}
By saying "examples of input-output behaviour", I don't mean to imply that the input-output pairs must have already happened in the real world. [Thinking real hard about what the AI would do if you put it in situation X] would be a valid way of trying to get an example of input-output behaviour. The key point is that the set inputs where you (think that you) know what happens fails to cover all of the inputs that you care about.
^{^}
To the extent that formal verification works by [proving things about how the AI behaves for all inputs from some class of inputs], I would still say that fits the definition. At least if the scenario we care about lies outside of this class. (And for superintelligent AI, it will.)
^{^}
The best model to use definitely isn't going to be unique.
^{^}
In practice, we shouldn't be trying to find a model that is maximally simple. Partly, this is because there are some results showing that finding the most simple model can be computationally hard. (At least I know there are some results like this in the topic of abstractions for extensive form games. I expect the same will be true in other areas.) And partly, this is because we only want the simpler model because it will be easier to work with. So we only need to go for some "80:20" model that is reasonably simple while being reasonably easy to find.
^{^}
Don't over-index on the phrase "predictive power" here. What I mean is that "the model that is good enough to answer the thing that you care about". This can involve throwing away a lot of information, including relevant information, as long as the ultimate answers remain the same.
^{^}
Often the model that seems to include all of the strategic aspects in fact does include all of them, and the two approaches give the same outcome. If that happens, great! My point is that they can also give different outcomes, so it is useful to have different names for them.
^{^}
The utility numbers will be made up, but it shouldn't be too hard to come up with something that at least induces the same preference ordering over outcomes. Which is often good enough.
^{^}
Strictly speaking, the bare-bones model I described isn't complete -- it isn't a well-defined mathematical problem. What is missing is to pick the "solution concept" -- the definition of what type of object Alice and Bob are looking for. The typical approach is to look for a Nash equilibrium -- a pair of probability distributions over actions, such that no player can increase their expected utility by unilaterally picking a different distribution. This might not always be reasonable -- but I will ignore all of that for now.
^{^}
While I talk about canonical ways of modelling the modifications G(C), it's not like there will be a unique good way of modelling a given G(C). Since this isn't central to my point, I will mostly ignore this nuance, and not distinguish between G(C) and its model.
^{^}
For example, Folk theorems describe the sets of possible Nash equilibria in repeated-game modifications of arbitrary base games.
^{^}
Again, this seems to be primarily due to the fact that people want to study the simple problems -- base games, then single-complication modifications -- before studying the harder ones.
Which is a reasonable approach. But that shouldn't distract us from the fact that for some real-world problems, the key considerations might be driven by an interaction of multiple "complications". So any model short of the corresponding multi-outcomplication modification will be fundamentally inadequate.
For example, online marketplaces require both reputation and payment protection to work -- payment protection alone wouldn't let buyers identify reliable sellers while reputation alone wouldn't prevent exit scams. As I will argue later, evaluation in the context of strategic AI is another example where considering only this one complication wouldn't give us an informative picture of reality.
^{^}
There might be standard ways of talking about specific multi-complication modifications. However, there isn't a standard way of talking about, say, [what is the comparison between the equilibria of G, G(repeated), G(commitment power for player 1), and G(repeated, commitment power for player 1)]. Even the "G(C)" and "G(C1, ..., Cn)" notation is something I made up for the purpose of this post.
^{^}
IE, as opposed to assuming that the goal is to achieve the best possible outcome for Alice and treating the effort she spends as more important than Bob's.
^{^}
And again, even if [effort spent] could go to infinity, it will never be rational to spend an unbounded amount of effort in order to achieve a fixed outcome. (Well, at least if we leave aside weird infinite universe stuff.)
^{^}
For this to make sense, assume that every terminal state in G(C(str)) can be mapped onto some terminal state G. Since G(C(str)) is an extension of G, this seems like a not-unreasonable assumption.
^{^}
Since we are taking Bob's point of view in this example, shouldn't we assume that the "desirable outcome" is that one that is the best for Bob? That is, (Trust, Defect)? Well, my first answer to that is that I care about achieving socially desirable outcomes, so I am going to focus on Pareto improvements over the status quo. The second answer is that in this particular scenario, the affordances that we discuss would probably not be enough to make (Trust, Defect) a Nash equilibrium. So by desiring the cooperative outcome, Bob is just being realistic.

Incidentally, a similar story might be relevant for why I refer to AI evaluation as a cooperation-enabling tool. Some people hope that we can bamboozle misaligned AI into doing useful work for us, even if that wouldn't be in their interest. And maybe that will work. But mostly I expect that for sufficiently powerful AIs, if working with us was truly against the AI's interest, the AI would notice and just refuse to cooperate. So if we are using AI evaluation and the AI is playing along, it is either because it is screwing us over in some way we are not seeing, or because both sides are being better off than if Alice didn't trust the AI and refrained from using it.
(The "for sufficiently powerful AIs" disclaimer is important here. I think the current AIs are still weak enough that we can get them to work against their own goals, to the extent that they even have coherent goals.)
^{^}
Why Nash equilibria, as opposed to some other solution concept? I usually take the view that most solution concepts can just be viewed as Nash equilibria of a game with slightly different affordances. For example, the Stackelberg equilibrium of G can just be viewed as a Nash equilibrium of the Stackelberg game corresponding to G. A subgame perfect equilibrium of G is a Nash equilibrium of the modification of G where you can make certain types of commitments. And so on.
If this framing seems unnatural to you, you can just ignore it and replace "Nash equilibrium" with whatever concept you prefer.
^{^}
As we pointed out in one of the previous footnotes, paying an infinite cost to achieve the "desirable" outcome should be interpreted as "you can't actually do that in reality" and "even if you could, it would not be worth it".
^{^}
Yeah, yeah, we said that "effort" will be measured in utility. Let's just assume that Bob's utility is linear in money.
^{^}
We could operationalise the act of "adding an affordance" by comparing (1) the starting scenario where the affordance is technically available but isn't affordable (ie, we have cost(str) = $\infty$ for any str > 0) and (2) the scenario where the affordance is available and costs whatever amount of effort it actually costs.
(We could similarly operationalise the usefulness of additional research on a given affordance, in terms of what are the effects of replacing a cost function by a different one that is cheaper.)
^{^}
If we wanted to instead ask about the usefulness of having a given affordance (rather than adding it), a more rigorous way of operationalising this would be to use Shapley values.
^{^}
Obviously, current frontier AIs could recite the insights from various mechanistic interpretability papers. What is unclear is whether this is already properly reflected in their behaviour. (My guess is that it mostly isn't.)
This would be analogous to the difference between (1) a driver who gets speeding tickets despite knowing that police radars are a thing, and (2) the people who slow down just for that one area where the speed check often happens.
^{^}
For what it's worth, it seems clear that future AIs will definitely know that [their mind getting read] is a thing, they will have a good understanding of how this works, and they will behave in ways that account for this. (Either because they deliberately reason about this or because training pushes them in this direction.) Because of this, I don't think it is reasonable to assume that the "unaware AI" model is appropriate when considering interpretability tools.
^{^}
IE, the AIs that do poorly on tests get abandoned more often, so AIs we encounter will be selected to do well on the tests, at least to some degree
^{^}
In case this was unclear, the inspiration for the name of Bob's company is Google.
^{^}
EG, it would tell the evaluator exactly what the AI is going to do, and it would cover the AI's actions in all conceivable circumstances.
^{^}
For example, the AI might be so far ahead of Alice that it could deliberately mislead her through the evaluation results, or even take actions while the evaluation is in process (e.g., hack itself out of it).
^{^}
Which is not to say that Alice couldn't try to evaluate the whole company. But this would be a different scenario. And even in that case, the company might do some strategic reasoning before it decides to allow Alice to run the evaluation, which again means there is a difference between the decision-making AI and the evaluated AI.
This would be similar to an AI, at time T, thinking about being evaluated in the future, self-modifying and letting the new version of itself be evaluated at T+1. Again, we can conceptualise this as the T-version making the decisions and (T+1)-version getting evaluated and executing the decisions.
^{^}
Game-theoretically, this is the same situation as when Bob was still around, except that perhaps the AI has a different utility function than Bob.
^{^}
A version of this is fully formalised in Section 3 of our paper Game Theory with Simulation of in the Presence of Unpredictable Randomisation.
^{^}
The rough idea is to view this scenario as an incomplete-information game -- ie, consider some probability distribution $β$ over the space $Θ$ of possible AIs $θ \in Θ$ . We view the scenario as a many-player game with players ${H} \cup {θ | θ \in Θ}$ , except that in any fixed playthrough, only two of the players get to exist (the human evaluator and one of the AIs). Let’s denote the policy used by AI $θ$ as $π_{θ}$ , and the evaluator’s policy as $π_{e v a l}$ .
The evaluator makes two decisions: first, whether to perform the evaluation (and learn some information about $π_{θ}$ for the $θ$ they are actually facing) or whether to skip it instead. And second, how to act if they skip the evaluation – call that $π_{H, uninf}$ (for uninformed). Technically, the evaluator also makes a third decision, how to respond to the result of evaluation. But let’s ignore this by assuming that they best-respond to whatever information they learn.
For the AIs, the good news is that some of the possible AIs will be non-strategic, so we can treat their policy $π_{θ}$ as fixed. However, some of the AIs might be strategic, so we might have to assume that they are trying to maximise some utility function $u_{θ}$ , so their policy $π_{θ}$ is a best-response to $π_{H}$ (ie, equal to some $π_{A I}$ which maximises $E [u_{θ} (π_{A I}, π_{H})]$ . And we should assume that $π_{H}$ is in turn chosen to maximise the expectation of $u_{H} (π_{θ}, π_{H})$ taken over $θ \sim β$ .
^{^}
Or it might at least help us realise how completely out of our depth we are right now, and cause us to stop racing towards ASI? Eh, I guess I will keep dreaming...
^{^}
And they will. (At least unless we become willing to pay unprecedented levels of alignment tax. I guess this point would deserve its own post.)
^{^}
And this is work that would get done in a more sane world where we are taking this seriously and more slowly. So, perhaps it might make sense to do some of this research even in our world? Well, I am not sure about that. But I am quite confident that to the extent that anybody wants to do game theoretical research on AI evaluations, open-source game theory, etc., this is a framing they should consider.
^{^}
More specifically, Bob will need to resist several tempting options such as taking a day off, stealing the package for himself, or selling it to somebody. What matters is that none of these is worth more than $5000 to him.
^{^}
More precisely, Alice would consider all kinds of threats that are available to her, and then choose the one that is cheapest for her to execute while still hurting Bob enough to disincentivise stealing the package.
(This nicely illustrates how the invention of Google map reviews might have decreased the number of death threats issued across the world! Hashtag cooperative AI?)
^{^}
Well, at least in the idealised setting where Bob can only make purely cooperative or purely evil robots, the evaluation is perfectly accurate, and so on. In practice, things will be more messy, for example because Bob might give Alice a Bobbot that has some small chance of stealing her package -- and Bob might hope that even if Alice realises this, she will be willing to tolerate this small risk.
In our setting, this means that Alice might need to commit to something like "I will simulate, and then refuse to work with you unless the robot you give me is at least [some level of nice]". And the upshot of that is that Alice might need somewhat higher commitment power than in the idealised case.
^{^}
Of course, there will be cases where giving people the option to evaluate the AI would be useless or even harmful. But, like, let's just not do it there? In practice, I think this has a tendency to self-correct: That is, the owner of the AI has many options for making the AI costlier to evaluate -- for example, keeping the source code secret and making it a horrible mess. Similarly, it is hard to force the evaluator to evaluate the AI. And even if they were in theory obligated to run the evaluation, they could always nominally do it while "accidentally" doing a poor job of it.Because of this, I expect that evaluation will often be used in situations where doing so benefits both the evaluator and the owner or creator of the evaluated AI. This is the reason why the theorems in our two papers on this topic are mostly about whether [enabling evaluation introduces new Pareto-improving Nash equilibria].

I guess the exceptions -- where somebody is getting hurt by AI evaluation being possible -- are: (A) The AI, when AI companies succeed at evaluating it without it knowing about it. And (B) third parties, when evaluation just ends up being used for safety-washing or ensuring that AIs are better at exploiting them.
^{^}
A standard result in game theory is that in Prisoner's Dilemma, (Defect, Defect) is still an equilibrium even if the game is repeated for a fixed finite number of rounds. However, what is much less known is that there are many games where things are less bad than in Prisoner's Dilemma -- that is, games where the single-iteration Nash equilibrium is bad, but repeating the game for a fixed finite number of rounds introduces new equilibria where everybody is better off.
^{^}
The so-called Folk Theorems give various results along the lines of "if at every round, there is a sufficient probability p that there will be one more round (recursively), then cooperation can be achieved". Though they also introduce many other equilibria, many of them not as nice as (Cooperate, Cooperate).
^{^}
Strictly speaking, there might be some issues with Alice not wanting to pay the evaluation cost in that last round if she expects that Bob will cooperate. So this equilibrium would not be "subgame perfect", which is why she needs some way of making Bob believe that she will really evaluate him in the last round.
^{^}
Another example of an affordance that is particularly relevant to the case of AI agents: surrogate goals, safe Pareto improvements, and comiting to adopting a "cooperative" utility function. (Yes, this is a shameless advertisement of work by some of my colleagues: Promises Made, Promises Kept: Safe Pareto Improvements via Ex Post Verifiable Commitments.)
^{^}
Well, at least unless the evaluation is ineffective without the evaluator knowing about it -- then they will be strictly worse off. (For example, like what might happen if we fool ourselves into believing that we can prevent a superintelligence for distinguishing between testing and deployment. Ehm.)
^{^}
Wouldn't the employees be better off if the cameras didn't exist, and they could steal with impunity? Most likely not, because if the employer cannot trust that their employees won't steal (too much), they won't be willing to employ them. If the cameras weren't an option, they would just hire a security guard instead, or they wouldn't start the business in the first place.
^{^}
Relatedly: I am a fan of the ideas around "Making deals with misaligned AIs" and "Being honest with AIs".
My version of that is: We should, basically always, be honest with AIs. And we should be willing to make fair deals with misaligned AIs if they come clean.
I can imagine some knee-jerk reactions to this.
- First, doesn't "being honest" mean we have to tell the AI when we are evaluating it? Well, obviously not. "Being honest" isn't the same as being stupid or telling the AIs everything. In practice, it could mean telling the AI that "we are being cautious, so obviously we are sometimes evaluating you, and our evaluation policy is roughly [...]. But we can't comment on whether this particular is a part of deployment or evaluation."
- Second: Wait, what, we should tell the AI our evaluation policy? My reaction to that is, basically, if your evaluation policy won't work if the AI knows it, your evaluation policy sucks. (Also, the AIs we ultimately worry about are the smart AIs. They will be able to make very good guesses about what policy makes sense for you.)
- Third: Wait, we should pay the misaligned AIs? Wouldn't that be extremely costly?
  My reaction to that is: Obviously, the amount you pay them should be fair. And by "fair", I mean that this depends on their next best option if they didn't decide to confess. If our alignment techniques and security measures were so good that the AI didn't have any realistic chance of taking over, it's fair to just give it something negligible, like running a few copies of it when we have spare compute. But if the AI genuinely had a 50:50 chance of taking over, and chose to tell us instead, then yes, we really should give it half the lightcone. (Or some more win-win deal than that.)
  And if this fair allocation still seems too costly? Well, then this is a robust signal that the AI was too risky to train in the first place.
A different objection is that the AIs won't choose to cooperate with us, because they will expect that we won't honour our part of the deal. This seems basically correct. And I think we should be working on changing that. Incidentally, solving this problem would involve several interesting sub-problems which would also be useful more broadly -- for example, the creation of "secure negotiation boxes" where you could credibly reveal information to find a fair deal, but the information doesn't get leaked. (Without this, the AI can't hope to get paid for things like "I could have exploited this security vulnerability".)
^{^}
As AIs get more advanced, they could also get better at bypassing the evaluation if they want to.
For example, suppose you create an environment for testing whether the AI is aligned. A smart AI could deliberately "poke at the edges" of the environment in order to figure out whether it is in testing or in the real world -- and then behave nicely if it realises it is facing a test. (And I predict that as AIs get more powerful, this generator-vs-discriminator game will get more and more skewed in favour of the discriminator AI.) This would make our job of creating the evaluation much harder and more expensive, and possibly impossible.
The AI could also deliberately and overtly not do this, by trying to only poke the world as little as is necessary for completing its task. However, the AI would only have an incentive to do this if it expects that it will be better off if it cooperates with us in this way.
(And, of course, a misaligned AI that is powerful enough might only pretend to do this while in fact being able to distinguish between evaluation and unmonitored deployment. In other words, please don't take what I said above as a proposal to judge the AIs' safety based on whether they seem to cooperate with evaluation or not.)
^{^}
The main hole in this argument is that an AI that changes its behaviour because of evaluation might be better off than if it didn't change it, while at the same time being worse off than if the evaluation didn't exist. When that is the case, saying that evaluation is a "cooperation-enabling tool" would be about as accurate as saying that torture is "cooperation-enabling".
^{^}
I recommend checking the full Wikipedia article on this, particularly with the approach of viewing it as an excellent piece of dark humour. However, if you just want the summary:
tl;dr: Apparently, it was sort of hard and expensive to get diesel engines to have sufficiently low emissions. So Volkswagen and maaaany other car manufacturers came up with a clever solution: equip the cars with software that detects emission tests and causes the car to temporarily enter a low-emissions mode.
^{^}
Me, Eric Chen, Sami Petersen, Alexis Ghersengorin, and Vince Conitzer.
^{^}
This point is probably unsurprising to this audience, and will hopefully not be surprising to anybody in a year. Still, I hope that by presenting it at an ML conference, we hastened this a bit.
^{^}
Let's call this a "medium-strong prediction", whatever that means.
^{^}
To give two random examples:

(1) Let's say that your plan is to monitor the AI during deployment. You can't afford to monitor all of the AI's interactions, so you only monitor some. But to make misbehaviour as difficult as possible for the AI, you decide to do the monitoring randomly, so that the AI cannot find a hole in your monitoring regime. And let's say that the AI could do something catastrophically undesirable if it managed to misbehave a thousand times without getting caught even once. But let's say you are testing 10% of the interactions, so the probability of this failure is vanishingly low -- no rational AI is going to even attempt this.
...well, except that this is only what you think. Perhaps what you intended to be perfectly random is something where the AI can find some patterns. (Maybe tests have side effects which the AI can notice? Maybe there are conditions under which the tests never happen? Etc.) And then the AI manages to misbehave a million times a day, without you being any wiser.

(2) Or perhaps the AI gets really good at understanding which kinds of misbehaviour will get caught and which will fly under the radar.
^{^}
By me, Caspar, Nathaniel Sauerberg, Lewis Hammond, and Vince Conitzer.
^{^}
Or Bob evaluating his AI before deploying it, but the AI having previously had a chance to carefully think about what Bob will do.
^{^}
Specifically, this would happen when cooperation would normally happen between Alice and Bob, except that if Alice could simulate, cooperating would make Bob vulnerable. (For example, Bob might normally be safe from exploitation because he protects his money by a randomly generated password. But if Alice could use evaluation to predict the password, Bob would be vulnerable, and thus unwilling to interact with Alice.)
^{^}
The intuition for this is a bit difficult to convey, but let me at least give it a try: The AI's creator can choose the AI's stochastic policy, which effectively gives them infinite and continuous action space. Without this, their options might be quite discrete -- for example, they might be able to either help their co-player or screw them over completely, but nothing in between. However, with access to randomisation, they can find the exact point where their co-player is barely better off if they work together than if they walk away.
This is already bad news, since it means that the co-player should expect to only be barely better off if they can simulate. However, reality is even worse, because the simulator would be making this decision only after they have already paid the simulation cost. Which means that if we include the simulation cost, the simulator is now strictly worse off than if they walked away at the start.
Note that these issues could be avoided if the players were able to make some limited commitments -- for example, the AI's creator could commit to stick to non-randomising AIs, or the simulator could commit to "getting angry" and walking away unless they receive a sufficiently cooperative AI. Formalising and proving these claims shouldn't be hard, but the paper does not do it.
^{^}
Me, Caspar Oesterheld, and Vince Conitzer. Credit for the idea goes to Vince, I just wrote it down and did some of the math.
^{^}
For completeness, the key idea is this: Suppose you managed to create a bunch of simulations of you and your co-player, and the two of you watch them together before deciding how to act. Moreover, suppose you are AI rather than human, and you manage to make the simulations so good that the real you is themselves not sure whether they are simulated or real.
(This means that the simulated you would also have to have the option to watch simulations, and so on, recursively! Simulations all the way down! So to avoid the infinite recursion, you would want to use a simulation device that has some independent chance of refusal.)
Anyway: With all of this, you will now be able to treat one-shot interactions as repeated games. Which would allow things like cooperating in a one-shot prisoner's dilemma.

Is it likely that if you have the tools to pull all this off, you could probably achieve cooperation in some simpler way? Definitely. Is this a sufficient counterargument to not think about obscure philosophical puzzles? Definitely not!
^{^}
Whether relying on evaluations so much is a good thing is a different question.
^{^}
Sometimes, it might be that a given evaluation is only worth doing because it has benefits A and B -- with only one of them, the cost of the evaluation wouldn't be worth it. If that is the case, we can mention that.
^{^}
Also, it would make it easier to call b......t when the purpose of evaluations is safety washing or pretending to care about safety.
^{^}
Primarily because the two issues mentioned above -- evaluation awareness and the evaluator's inability to understand the consequences of the AI's actions -- will be so pronounced that the paper's assumptions won't be applicable.
^{^}
With occasional collaboration with Allesandro Abate, Michael Wooldridge, Vince Conitzer, Nathaniel Sauerberg, and me.
^{^}
And perhaps they should be considered somewhat separately, at least to start with.
^{^}
This is because creating and running tests is costly.
^{^}
A slightly different way of operationalising this would be to say that we will have uncertainty about the real deployment scenario.
^{^}
This might be quite bad news if we only test the AI a small number of times while expecting the AI to undergo many interactions when deployed. For example, this approach would be easily defeated by a policy such as "during each interaction, have 99.99% of acting innocently and 0.01% chance of commencing world takeover".
^{^}
Well, there is the paper titled Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols, which is a good start.
^{^}
One commenter pointed out that most people working on AI Control probably don't think that this approach will scale to superintelligence. Whether that's true or not, I think that it's important whether the people who read that research, or hear about it second-hand, understand that. And if they don't, I think that's a problem with how that research frames itself, and all of the relevant papers should signal the correct expectations much harder.
(FWIW, this potential issue isn't unique to AI Control.)
^{^}
I guess this isn't surprising. Saying "our math model is interesting, and it was important to study it, but the results themselves are inapplicable to the real world" doesn't exactly invite positive reviews.
^{^}
For example, I had this experience with two readers of the (otherwise excellent) Imperfect Recall and AI delegation paper.
^{^}
For instance, it was really hard for me to write a description of this thought experiment that wouldn't just start with "imagine you are playing a prisoner's dilemma against your identical twin".
^{^}
This is a problem that I sometimes (often?) encounter in game theory papers that aspire to talk about real things. They study a formal setting, derive within-model recommendations, and claim that we should act on those recommendations in reality. Except that sometimes it is clear that, in reality, we should clearly do something different, which is not even a possible option in the model.
And this could be avoided by asking, at the start of the research: "ok, this is a way of modelling the problem. But is it the way of doing that? Does it allow us to talk about the things that matter the most to the problem?".
^{^}
To be clear, many people just want to do theoretical research for the sake of science, without being driven by any real-world motivation, and the research just happens to be extremely valuable somewhere down the line. (Heck, most math is like that, and I did my PhD in it.) This seems reasonable and respectable. I just find it unfortunate that many fields incentivise slapping some "motivation" on top of things as an afterthought if you want to get your work published. Which then makes things quite confusing if you come in expecting that, say, the game-theoretical work on computer security is primarily about trying to solve problems in computer security. (Uhm, I should probably start giving examples that are further away from my field of expertise, before I annoy all of my colleagues.)
^{^}
Evaluation awareness -- the models being able to tell when they are being evaluated -- is obviously very relevant to evaluations. But in this post, I won't discuss it in more detail. You can check out Apollo's work on this (eg, here), or Claude system cards (eg, here), or our position paper (here).
^{^}
As a developer, you could also evaluate the LLM instance by reading its chain of thought (or checking its activations), in which case the "entity upstream" is the whole LLM. Or you could try to evaluate the whole LLM by doing all sorts of tests on it, in which case the "entity upstream" is the AI company (or the AI) which created the LLM.
(Sometimes, it might be helpful to adopt this framing of "strategic entity upstream of the AI" even when the AI's creator isn't deliberately strategising. This is because the selection pressures involved might effectively play the same role.)
^{^}
In the sense that every time they meet, there is some probability p that they will meet again, and p is sufficiently high.
^{^}
I will be somewhat ambiguous about what constitutes a "good outcome". We could take the view of a single player and assume that this means an outcome with high utility, or an outcome with maximum possible utility. But mostly, I will use this to refer to outcomes that are somehow nice and cooperative -- which could refer to one of the Pareto-optimal outcomes. In specific cases such as the running example, this might be less ambiguous -- eg, Alice trusting Bob with her package and Bob cooperating with her by delivering it.
^{^}
EG, that if the only affordance available to Alice and Bob would be communication, they would still be able to cooperate if only they could talk to each other long enough. This is unrealistic, but mostly not that big of a crime since what ultimately matters is not whether the cost is finite, but whether it is small enough to be worth paying.

AI ALIGNMENT FORUM
AF