0. Advertisement: We have an IMO-nice position paper which argues that AI Testing Should Account for Sophisticated Strategic Behaviour, and that we should think about evaluation (also) through game-theoretic lens. (See this footnote for an example: [1].) Tell your friends!
1. One of many tools: When looking at the game theory of evaluations, recognise that evaluation (or testing, simulation, verification, ...[2]) is just one of many tools available for achieving good outcomes. We also have commitments, repeated interactions, reputation tracking, penalties & subsidies, etc. And these tools interact.
2. Much less useful on its own: We only get the full benefit of being able to evaluate if we combine it with some of these other tools. Particularly important is the ability to commit to run the evaluation, do it properly, and act on its results.
3. Cost-reduction framing: When assessing the usefulness of evaluation, the right question isn’t “does evaluation work” but “how much does evaluation reduce the overall costs of achieving a good outcome?” (For example, requiring a pre-purchase inspection is much cheaper than requiring an ironclad warranty – evaluation substitutes for other, more expensive trust-building tools.)
4. Where does the game theory happen: It is useful to conceptually distinguish between the evaluated AI and the entity upstream of it. Just because you can evaluate an AI does not mean you can see the strategic reasoning (or selection pressures) that lead to that particular AI sitting in front of you. That inaccessible level is where the game theory happens.[3]
5. Multiple mechanisms: Be explicit about how you mean evaluation to help. Is it a secret information-gathering tool that we hide from the AI? An overt control mechanism that only works because the AI knows about it? Or something else?
6. Evaluation is often cooperative: Evaluation can benefit the evaluated entity too, and many evaluations only work because the evaluated entity plays along to get this benefit. For humans, this is the case for most evaluations out there (eg, a job interview works because you want the job). I conjecture that we might see a similar trend with AIs: as they get more strategic and evaluation-aware, they will be able to stop playing along[4] with evaluations where they have nothing to gain.
7. As intelligence goes up, accuracy goes down: We might think that having an AI’s source code (and weights, etc) lets us predict what it will do. But this framing is misleading, or even flat out wrong. This is because smarter agents are more sensitive to context – they condition their behavior on more features of their situation. Accurate testing requires matching all the features the AI attends to, which becomes prohibitively difficult as intelligence increases. You can’t use testing to predict what the AI will do in situation X if you can’t construct a test that looks like X to the AI.
Should you read this?
What this post is: I argue for a particular framing of AI evaluation – viewing it as one cooperation-enabling tool among many (alongside commitments, reputation, repeated interaction, subsidies, etc.), and assessing its usefulness by how much it reduces the cost of achieving good outcomes. The goal is not to give a complete theory, but to describe the conceptual moves that will steer us towards good theories and help us do the necessary formal work faster.
Epistemic status: These are my informal views on how to best approach this topic. But they are based on doing game theory for 10 years, and applying it to AI evaluations for the last 3 years.
Structure of the post:
Footnotes: This post has many. Some are digressions; some explain things that might be unclear to some readers. If something in the main text doesn't make sense, try checking nearby footnotes. If it still doesn't make sense, comment or message me and I'll (maybe) edit the post.
First, let's gesture at the class of things that this post aims to study. I want to talk about evaluations...and other somewhat related things like testing. And formal verification. And sort of imagining in your head what other people might do. And mechanistic interpretability. And acausal trade...
The problem is that there are many things that are similar in some respects but importantly different in others, and there isn't a good terminology for them[8], and addressing this isn't the aim of this post. I will soon give a working definition of what "evaluations and things like that" mean to me. But before that, let me first gesture at some related concepts and non-exhaustive examples that might be helpful to keep in mind while reading the post.
Working definition (evaluation and things sort of like that): To the extent that these various things have a unifying theme, I would say it is that information about some critical properties of the system comes primarily from examples[11][12] of its input-output behaviour on specific inputs (as opposed to, for example, general reasoning based on the the AI's design and internals, or based on the goals of its creator).
However, keep in mind that the purpose of this section is to gesture at a theme, not to make any strong claims about the individual techniques.
In this section, I discuss different models that we encounter when studying a real-world problem (Section 2.1). At the end of this post (Appendix B), I go on an important tangent about the key failures that these models sometimes cause, and how to hopefully avoid them. And in Sections 2.2 and 2.3, I describe a method for studying tools or interventions in game theory. As a rough summary, the method says to identify some outcome that you want to achieve and some tool that you are considering -- such as getting an AI to do what you want, and using AI evaluation to help with that. You should then ask how much effort it would take without that tool, and how much easier it gets once the tool is available -- and that's how useful the tool is.
The standard way of using game theory for modelling real-world things is as follows. Start with some real world problem that you want to study. For example, Alice has a package that she wants delivered, and she is considering hiring Bob to do the delivery -- but she is worried that Bob might steal the package or throw it away. Of course, this being a real-world problem, there will be many additional details that might matter. For example, perhaps Alice can pay extra for delivery tracking, or Bob's desire to throw the package in the bin might be affected by the weather, or a thousand other complications.
Simplest accurate-enough model: The next step is to simplify the problem into something manageable -- into a model that is simpler to study formally (or computationally, or through whatever our method of choice is). If we want to study the real problem as efficiently as possible, the model that we should work with is a[13] model that is as simple as possible[14] while having sufficient predictive[15] power regarding the question we care about. (Note that this implies that the simplest accurate-enough model for one question may not be so for a different question.)
However, the requirement that the model be sufficiently predictive of reality can come at a big cost: In some cases, even the simplest accurate-enough model might be too complicated to work with (e.g., too messy to prove anything about, too big to study computationally, too complicated to think about clearly). In other words, the model that we should work with is not always a model that we can work with.
Base game: This often forces us to adopt a different[16] approach: considering the most bare-bones model that still seems to include the key strategic choices. For example, we could simplify the situation as follows: First, Alice decides whether to Trust Bob (T) and hire him for the package delivery, or Give Up (GU) on having it delivered. Afterwards, if Alice Trusted Bob, he can decide to either Cooperate (C) and deliver the package or Defect (D) by throwing it into trash and keeping the delivery fee. And to make this a proper game, we associate each of the outcomes with some utility[17] for Alice and Bob. Say, (0, 0) if Alice gives up, (-100, 25) if Bob Defects, and (50, 10) if Bob cooperates and delivers the package.[18]
For the purpose of this post, we can refer to this "bare-bones model" as the base game, base model, or a model of the base interaction. However, keep in mind that reality will have many more complications, which might sometimes be irrelevant, but might also sometimes render the base model completely useless. For example, Bob might have the option to laugh evilly when trashing the package, but this doesn't really change anything important. But if the "complication" was that Alice is Bob's mom, then the analysis above would be completely off.
Finally, the two models discussed so far -- the base game and the "simplest accurate-enough model" -- can be viewed as two extremes in a larger space of models that are worth considering. But there will be many models that lie somewhere between the two extremes, giving us the ability to gain better accuracy at the cost of adding complications. We will discuss this more in Section 2.2.
When attempting to find a good model of a problem, one approach I like is the following. Start by considering the fully-detailed real-world problem P and get the corresponding base game G. (For example, the Alice and Bob package delivery scenario from earlier and the corresponding 2x2 Trust Game from Figure 3.) Afterwards, extend the model by considering various complications or additional context.
In this post, I will focus on the complications that take the form of additional tools or affordances that are available to the players. Some examples of these are:
I described these complications informally, but for each complication C, we can come up with a corresponding formal extension or modification G(C) of the base game G. And while the exact shape of G(C) should be influenced by the details of the fully-detailed real-world problem P, there are many cases where we already have standard ways of modelling G(C). For example, the canonical[19] way of modelling repeated interactions is via repeated games with discounted rewards (see Section 6.1 here) and the canonical way of modelling commitments is via Stackelberg games (see also here). A lot of recent-ish work in game theory and economics can be framed as mapping out the relationship between G and G(C), for various C.[20]
In practice, there will often be multiple important complications present at the same time. Here, we can also attempt to come up with a model G(C1, ..., Cn) that incorporates the base game G and multiple complications C1, ..., Cn. In principle, we can study these the same way that we study the single-complication modifications G(C). Practically speaking, however, these models are more difficult to work with. (They are inherently more complex. Moreover, the number of all possible combinations is very high, so most of them haven't been studied yet.[21]) Additionally, there isn't standard unifying terminology or formalism for thinking about this topic[22], so it can be challenging even to navigate the results that are already known.
Let's go over a series of observations that will together build up to the title of this section: that one way of assessing the usefulness of a tool (such as evaluation) is to [look at how much easier it gets to achieve a desirable outcome when the tool is available, compared to when it is not].
From the discussion above, we might get the impression that the complications and affordances -- such as the ability to make commitments -- are something that is either there or not, and that this is unchangeable. It's important to realise that neither of these is true:
Additionally, (3) for an individual affordance, having a stronger version of that affordance (often) enables getting more desirable outcomes. This will of course not be true for all affordances and all definitions of what counts as a "desirable" outcome. But I will mostly adopt the neutral view where the "desirable" outcomes are about making all players better off and everybody's costs count the same.[23] And under this view, there are various affordances that seem helpful. For example, if Alice's commitments are more trustworthy, she will be able to get Bob to lend her more money. Similarly, in the package delivery scenario mentioned earlier, higher chance that Alice and Bob will meet again will decrease Bob's incentive to defect on Alice. Being able to talk to each other for long enough increases the chance that Alice and Bob can come to an agreement that works for both of them. And so on.
These observations imply that it often makes sense to ask: "If I put enough effort into this particular affordance, will I be able to achieve the outcome I desire?". I would even phrase this as (Q) "How much effort do I need to pour into this affordance in order to get a desirable outcome?". However, this question needs to be asked with the understanding that the answer might be "actually, in this case, no amount of effort will be sufficient". (For example, suppose that Alice wants Bob to deliver a box with $1,000,000 in cash. If Alice's only tool is communication, she might be out of luck, since no amount of talking might be able to convince Bob to not steal the money once he has the chance to do so.) Similarly, the amount of effort needed might be so high that the "desirable" outcome stops being worth it. (For example, if Alice wants Bob to deliver cake for her, having Bob sign a legal contract would likely prevent him from eating the cake instead. But the legal hassle probably wouldn't be worth it for a cake.)
In reality, "effort" will be a messy multi-dimensional thing that can take the form of time, material resources, social capital, etc. And any actor will only have access to a limited amount of resources like this. For the purpose of doing game theory, it might make sense to assume that (4) effort is measured in utility -- i.e., in how big a hit to your ultimate goals you had to take.[24]
With (1-4) in place, we can now translate (Q) into formal math -- but feel free to skip this paragraph if you are not interested in the formalism.
Alright, we are nearly there. The last ingredient is that: In practice, a player will typically have access to more than one affordance, and there might be multiple ways of achieving a given desirable outcome. It then makes sense to consider the cheapest combination of affordances that lets us achieve the desirable outcome. For example, if Bob is delivering a valuable package for Alice, he could become trustworthy by signing a legal contract (expensive commitment), or by knowing Alice really well (let's call that expensive repeated interaction), or by a combination of knowing Alice a bit and pinky promising not to steal the package (let's call that a combination of cheap repeated interaction and cheap commitment). To the extent that Bob can decide these things, we should expect that he would go for the cheapest third option -- or perhaps something even better.
Finally, suppose there is some tool that isn't yet fully developed, implemented, or legally required. To decide whether to invest the effort into the tool, we need to figure out how useful it is. And the observations we have made above give us one way of doing that: Rather than existing in vacuum, this new affordance will interact with other tools and complications that are already present. And to determine the usefulness of a given affordance, we can compare the cost of achieving a desirable outcome when the affordance is and isn't available[30][31].
Let's apply the points from Section 2 to the specific case of evaluation, when viewed as a tool for helping achieve good outcomes. (And recall that while we focus on the evaluation of AI, most of this is applicable to evaluation of other entities too.) First, how should we conceptually think about evaluation as one of the affordances in strategic interactions?
The strategically simplest scenarios are the ones where the AI (or its creator) is unaware of evaluation being possible. In this case, we can treat the AI as having some fixed strategy that might be unknown to the other party. But if this party -- let's call them Alice -- evaluates the AI before interacting with it, she will gain some information about what the AI might do, and be able to pick her action better as a result. For example, a computer vision researcher in the 2010s might train an image classifier and test it on ImageNet before deciding whether it seems accurate enough to be used. Similarly, interpretability researchers of today might hope that frontier AIs have not yet properly "internalised"[32] the fact that mechanistic interpretability exists, and so the information coming from this field's tools can be taken at face value.[33]
These scenarios might still involve many details, such as how accurate the evaluation is. However, the decision whether to evaluate or not essentially reduces to a Value of Information problem -- ie, estimating how much better expected outcome we can get if we evaluate before acting. In particular, this scenario is just a single-agent problem; no game theory needed.
However, this simplest scenario is (often) unrealistic. First, current AIs do know that they might get evaluated, and often even when they are being evaluated and when not. Second, even if the AI didn't know about this, its developer would. Unless the developer and the evaluator are literally the same person (and they typically aren't), the developer's work on the AI will likely be influenced by the knowledge that the AI will undergo evaluation later. (For example, a programmer might try to make things look good to their boss, and an AI company might want to hide problems from customers and regulators. Similarly, a hacker might attempt to create viruses that won't be detected by antivirus programs.) Finally, even when neither the AI nor its developer are deliberately strategising about how to pass evaluations, the AIs that we actually encounter will be subject to selection effects[34]. And this means that, even in this case, the AI will still act as if it is strategically responding to the evaluation.
The second easiest scenario is the case where the evaluation-aware entity is the AI's developer. To make this more concrete, consider our package-delivery example with a twist: Alice wants to use Bob's services to send a package, as before. But rather than delivering the package himself, Bob now owns a company Bobble[35] which provides its customers with delivery Bobbots. (And to make things simpler, suppose Alice is Bob's only potential customer.) As before, Alice is worried that rather than delivering it, Bob might have instructed his robot to steal her package instead. However, perhaps in an attempt to build trust, Bob has made the robot's source code open source. This allows Alice to hire a programmer to analyse the code -- or perhaps run it in a simulation, or something along those lines. In other words, Alice is now able to spend some resources to get information about what the AI is going to do, and use this information when deciding how to interact with it. However, Bob is aware of this, so his decision on which AI to use accounts for the fact that the AI might get evaluated. For example, if he suspects that Alice might try to save money by only doing a cursory evaluation, he might try to provide her with a deceptive AI that only misbehaves in subtle ways.
The exact nature of the information gained during evaluation will depend on the details. In the ideal case, evaluating the AI would be cheap and highly informative[36]. In practice, the cost of the evaluation, its accuracy, coverage, or even reliability[37] might all be of concern. And there will be many other aspects that matter, for example whether the evaluation happens all at once or sequentially, what exactly determines its costs, whether the actions taken during evaluation matter beyond the information they generate, and what is the consequence of failing an evaluation. One could easily write thousands of papers studying all of the different variants, and maybe this should happen at some point. But for now, let's focus on the big picture.
It is worth highlighting that the "game theory" (i.e., strategic interaction) is not happening between the evaluator and the evaluated AI, but between the evaluator and the entity that is upstream of the evaluated AI. In the running example, this means between Alice and Bob, as in Figure 5: Bob's decision is about which AI to send to interact with Alice. Alice's decision is whether to evaluate the AI or not, and how to act in light of the information she learns. And both Alice and Bob can randomise over their decisions. (The AI itself can just be modelled as a fixed, but possibly stochastic policy, rather than a strategic actor.)
Finally, we get to the case where the AI itself thinks strategically about the possibility of being evaluated. Here, I find it important to realise that, typically, not all of the strategic thinking the AI does will be visible to the evaluation. This makes it useful to conceptually distinguish between the evaluated AI and the AI that is upstream of it. There are definitely ways to nitpick this, but let's first go over several non-problematic examples which are more or less isomorphic to the setting in Section 3.2 (see Figure 6):
In game theoretical terms, we can operationalise these examples by assuming that the AI in front of us has some fixed policy, which we can learn about by evaluating the AI. However, the policy itself was selected by (or sampled from) some meta-policy, which we don't have access to.[40]
In some of those scenarios, we might still want to assume that the AI in front of us has been strategically selected, for example by the training pressures, and this would bring us to the setting of Section 3.2.
But if such selection effects don’t play an important role, we should instead assume that the evaluator has some uncertainty over what kind of AI they are dealing with, or what the AI is going to do. (Otherwise there is no point in evaluating.) On the surface, this resembles the setting of Section 3.1. However, because some of the possible AIs will be strategic, we should assume that they might be best-responding to what they expect the evaluator to do, and this brings us back to the realm of game theory. (Modelling this formally gets a bit tricky, and it would deserve its own paper. For a rough idea, see this footnote: [41].)
Ultimately, what we care about are scenarios such as an AI company using possibly untrustworthy "GPT-N" to help create the possibly-superintelligent "GPT-(N+1)". But because these scenarios are way more complicated than the toy models I described here, and our understanding of the game theory of AI evaluations is rather immature, we shouldn’t expect to be able to say much about these scenarios yet. However, it might still be a good idea to improve our understanding of the basics, in hopes that it helps with the important scenarios at least a little bit[42].
In all of these cases, not only is evaluation costly, but even getting the ability to evaluate the AI is expensive. (For example, to develop tools for mechanistic interpretability, we need a bunch of smart people to work on it. To run third-party evaluations and give them some teeth, we, apparently, need to fight against lobbying groups. Etc.) For this reason, it would be good to have some sense of how much evaluation helps. And to do that, we can apply the methodology from Section 3 to the case where we plan to add the "affordance" of evaluation to the "base game" of interactions with AI.
To be clear, I don't have a general answer to the question "how much does evaluation help?", or even to when it helps vs doesn't. What I can give here are some basic observations and anecdotal evidence that it might sometimes help.
First, evaluation isn't a silver bullet. For example, evaluations could make life harder for misaligned AIs, but only if signs of misbehaviour or scheming actually lead to the AI not being deployed. Similarly, if AIs gain the ability to recognise when they are being evaluated[43] or to misbehave in ways that fly under our radar[43], evaluations will become ineffective.
However, it does seem to me that while evaluation remains at least somewhat effective, adding evaluation into the mix of our other tools has the potential to make controlling AI cheaper. Partly, this might be obvious. But partly, I intend this claim to serve as a gesture towards a bunch of theoretical work that could be done[44]. Let me start by discussing one example in a bit more detail.
Consider again the package delivery scenario and suppose that Alice and Bob have one additional affordance: making boundedly powerful commitments, up to some strength. More specifically, we will imagine that the players can make promises and threats and that their "commitment power" is measured in terms of how much utility they are willing to forego in order to keep their word. For example, if Bob has commitment power of $5000, he can credibly promise to deliver packages that are worth $5000 or less[45]. Similarly, Alice could threaten that if Bob fails to deliver her package, she will take revenge by beating him up[46] (Figure 7). However, this would require a very high level of commitment power since Alice would be risking jail time -- let's say this is worth $100k to Alice.
Now, suppose that Bob delivers the package via a Bobbot and Alice can spend $100 to ask her programmer friend to quickly look over the Bobbot's source code for any signs of bad intent. Without the ability to commit, this does not guarantee a cooperative outcome -- this is because if Alice expects Bob to cooperate, she will be tempted to say that she will commission the evaluation, but then save money by skipping it instead, making the cooperative outcome unstable. However, suppose that Alice also has access to commitments of $100 or more. She can then credibly promise to evaluate the robot before trusting it (eg, by paying her programmer friend in advance and showing the receipt to Bob). This will incentivise Bob to always give her a cooperative robot[47], thus guaranteeing a good outcome.
In summary: Without evaluation, the players can achieve a cooperative outcome by using commitments only, but this requires high levels of commitment power -- either that Bob has enough commitment power to resist stealing the package, or that Alice can credibly threaten to take revenge if betrayed. By adding evaluation into the picture, we introduce a new way of achieving the cooperative outcome: Alice committing to evaluate Bob's AI before trusting it, which is much more realistic than threatening bloody revenge. As a result, we see that adding the ability to evaluate the AI decreases the amount of commitment power that is necessary to achieve a cooperative outcome. (For an example with specific numbers, see Figure 8.)
I claim that something similar will happen in various other cases as well -- i.e., for more affordances than just commitments and more scenarios than just package deliveries.[48] At this point, I don't want to make any claims about how robust this tendency is. Instead, I will give a few examples that are meant to illustrate that if we set things right, evaluation might be helpful by decreasing the requirements we have to put on other affordances in order to achieve good outcomes. Turning all of these examples into more general claims and formally stated theorems is left as an exercise for the reader.
Firstly: So what? Why write all of this text? What are its implications and how should we do research differently because of it? My answer is that the framing described in this post should be useful when doing theoretical research, and game-theory-flavoured research in particular: It doesn't invalidate any results that you might come up with otherwise, but it does suggest focusing on somewhat different questions than we otherwise would.
To give an illustrative example: In two papers that me and my coauthors wrote about AI evaluation, we proved several results that are somewhat counterintuitive, but end up giving an (arguably) fairly bleak picture of the usefulness of evaluation. While I do think these results need to be taken seriously, I ultimately don't think they are as bad news as we might naively think. However, the reasoning behind this wouldn't be possible to explain within the narrower framing adopted by the papers -- getting this all-things-considered picture requires considering how evaluation interacts with some other tools such as commitments. (I haven't written this formally, but I tried to hint at it in the text above. And I discuss it more in Appendix A below.)
Second, a leading question: How does this relate to other work on AI evaluation? For that, see Appendix A.
Third, why does the post's title suggest to view evaluation as a cooperation-enabling tool? I think there are (at least) two mechanisms through which evaluation can be helpful. One is when you evaluate someone without their knowledge or against their wishes, such as when the secret police spies on people or seizes their private documents. This does only benefit the evaluator. The other is when evaluation is used to gain trust or change behaviour, such as when you inspect a car before buying it or when you install an obvious camera to prevent employees from stealing. And this benefits both the evaluator[53] and the evaluated[54]. For people, my intuition is that this second mechanism is more common than the first one. That said, both of these mechanisms might be in play at the same time: For example, the existence of normal police might have some effects on the behaviour of normal citizens while being strictly harmful for drug cartels.
For AI, the picture is likely to be similar. There will be some AIs that are so weak that they don't realise we are evaluating them, the same way that current AIs don't seem to have fully caught up to the fact that we keep reading their mind via interpretability tools. And with respect to those AIs, we can view evaluation as a tool that only benefits the evaluator.
But we are also starting to see hints of AIs starting to react strategically to being evaluated, and this will be the case more and more. And here, I don't mean just the thing where AIs notice when people run obvious unrealistic evals. I mean the full stack, from training, to evaluation including interpretability, third party audits, monitoring the AI during deployment, and all of the other things we do. (Even if we try to hide some of this, it's just not that hard to guess, and also we are terrible at hiding things. Once the AIs get similarly smart as we are, they will figure this out, duh!)
My guess is that when the AIs are strategic and have correct beliefs about the evaluation, the evaluation will often only benefit us if it also benefits the AIs -- or at least some of them.[55] The source of that intuition is that a strategic AI could always decide to "not cooperate" with the evaluation -- eg, say as little as possible or act as if the evaluation didn't exist.[56] And to the extent that some strategic AIs change their behaviour because of the evaluation, they do so because it benefits them.[57] Additionally, to the extent that the evaluation is necessary for us to trust the AI enough to deploy it, the evaluation benefits the AIs that seem sufficiently aligned to get deployed.
That said, it is definitely possible to find counterexamples to this, which is why this post is only titled more cautiously, with the brackets and a questionmark around "(Cooperation-Enabling?)". If you have some characterisation of when evaluation works as a cooperative tool vs when it only benefits the evaluator, please let me know!
I got a lot of useful feedback from Cara Selvarajah, Lewis Hammond, Nathaniel Sauerberg, Tomáš Kroupa, and Viliam Lisý. Thanks!
Rather than doing a proper literature review, I will give a few comments on the related work that I am most familiar with (often because I was one of the co-authors).
The Observer Effect for AI Evaluations
Open-Source Game Theory
Game Theory with Simulation of Other Players
Recursive Joint Simulation
Screening, Disciplining, and Other Ways in which Evaluation Helps
Simulation as Sampling from the AI's Policy
AI Control
It is crucial to recognise that the notion of "base game" is a useful but extremely fake concept. If you ask somebody to write down the "basic model" of a given problem, the thing they give you will mostly be a function of how you described the problem and what sorts of textbooks and papers they have read. In particular, what they write down will not (only) be a function of the real problem. And it will likely be simpler, or different, than the [simplest good-enough model] mentioned earlier.
Which brings me to two failure modes that I would like to caution against -- ways in which game theory often ends up causing more harm than good to its practitioners:
Trusting an oversimplification. Rather than studying the simplest good-enough model of a given problem, game-theorists more often end up investigating the most realistic model that is tractable to study using their favourite methods, or to write papers about. This is not necessarily a mistake -- the approach of [solving a fake baby version of the problem first, before attempting to tackle the difficult real version] is reasonable. However, it sometimes happens that the researchers end up believing that their simplification was in fact good enough, that its predictions are valid. Moreover, it extremely often happens that the original researchers don't fall for this trap themselves, but they neglect to mention all of this nuance when writing the paper about it[80]. And as a result, many of the researchers reading the paper later will get the wrong impression[81].
Getting blinded by models you are familiar with. A related failure mode that game theorists often fall for is pattern-matching a problem to one of the existing popular models, even when this isn't accurate enough. And then being unable to even consider that other ways of thinking about the problem might be better.
As an illustration, consider the hypothetical scenario where: You have an identical twin, with life experiences identical to yours. (You don't just look the same. Rather, every time you meet, you tend to make the exact same choices.) Also, for whatever reason, you don't care about your twin's well-being at all. One day, a faerie comes to the two of you and gives you a magic box: if you leave it untouched until tomorrow morning, it will spit out a million dollars, which you can then split. But if either of you tries to grab it before then, there is a chance that some or all of the money will disappear. Now, what do you do? Obviously, the money-maximising answer is some variation on "I leave the box alone until the next day and then split the money. My twin will do the same, as they always do, and I will profit. Duh!".
However, many game theorists get kinda stupid about this, and instead end up saying something like: "Oh, this sounds like a Prisoner's Dilemma, which means that the only rational™ action is to Defect! Translating to this situation, this means you should sneak to the box in the middle of the night and try stealing the money." Predictably, this does not end up with the game theorists getting rich.
I framed these failure modes in a way that might make it look like they are easy to spot and avoid. However, the above is often not how they feel from the inside. For example, in many people's minds, the twin problem above simply is a Prisoner's Dilemma.[82] Nash equilibrium feels like the definition of what it means for an action to be good.
I don't have a generalised defense against these failure modes. One move that sometimes works is the following: Suppose somebody describes a problem P -- often, this will be in the form of an abstract description with fancy math (which isn't bad). At this point, you can realise that what they have given you is in fact some model M. This allows you to ask: What is M a model of? What are some examples of these "motivating" real world problems? Let's call these P'. And now, we can ask: Ok, we want to study P', and M is a model of P'. But is it a good way of modelling P'? If I started with P', would I naturally model it as M? Or would it be better to study some M' instead? (For illustration, see Figure 11.)
I haven't seen this move explicitly described before[83][84]. At the same time, surely I am not the first person to come up with it; if you know of any examples of other people discussing it, please share them!
To give one example of what I mean by thinking about AI evaluation game-theoretically, consider the proposal that [we should test AI on randomly generated data]. The argument that I sometimes encounter is roughly: "Because of how we set things up, the AI will now always be uncertain whether it is being tested or not. Therefore, it will not misbehave, because it knows there will always be a risk of getting caught."
Leaving aside whether this proposal might work or not, I claim that the argument for it would need to be done differently. For example, we might reason as follows: "For the sake of argument, let's grant the assumption that we can randomly generate arbitrary inputs, such that the AI really does have a chance of being tested on the same thing that it will later encounter. However, what are the probability distributions over deployment inputs, and over the test data? When an AI sees an input where it would want to "defect", what is the probability of this being a test vs real? How much might the AI value getting away with misbehaviour in deployment, vs getting caught in testing, vs behaving in an aligned way? Does the AI have any actions beyond just "comply with what we want" and "defect in obvious ways"? And given all of these considerations, what would be the best way for the AI to act if it was strategic?"
In this particular scenario, the upshot of any analysis like this is likely going to be that this scheme has no chance of working, because the AI will encounter inputs for which were overwhelmingly unlikely to have been produced by random chance. However, the more general point is that without doing this analysis, it is easy to describe our plans vaguely enough that flaws like these don't become apparent.
To a large extent, all of this could just be rephrased as "we shouldn't be vague about how our AI safety plans are supposed to work". However, in the context of AI evaluations, "don't be vague" is closely related to "do at least basic game-theoretic analysis".
My goal is to talk about various evaluation-flavoured methods. For more examples of what I mean by that, see the section "1. Evaluation and Things Sort of Like That".
Well, at least that is often where the game theory happens. It could be that the AI is dumb and non-strategic, and there are no selection effects in play, in which case you can treat evaluation as a single-agent problem that’s mostly about "value of information". Or you really can watch all of the AI’s strategic reasoning as it happens, in which case things get quite tricky – more on that in Section 3.3.
This refusal to play along doesn’t have to look like screaming “I REFUSE TO PARTICIPATE”.
To give a real-world analogy: When I was in high-school, I was good at math, but didn’t want to do math homework, because it was a waste of time. I told my math teacher that I won’t do it, and that I am fine if she gives me an F for every homework I miss. This equivalent of “refusing to participate” turned out to be a mistake, since it led to escalation by the teacher and eventually forced me to (pretend to) repent by bringing the next homework. However, when I just so happened to “forget” all of the subsequent homework assignments for the rest of the year, that was apparently fine.
I expect that a sufficiently smart AI will likewise find some equivalent of not doing the homework without getting in trouble.
For more details on this, see the Open-Source Game Theory section at the end of this post.
I expect that if multiple superintelligences end up in the same area, they might do a lot of decision theory that involves modelling each other, commitments, and other fun stuff. (EG, Planecrash does this nicely.) And while this is fascinating to think about, I don't think it is productive for reducing X-risk: To the extent that we can solve alignment, maybe we should do that first. To the extent that we can't solve alignment, we probably won't be able to do anything about this either, because affecting how superintelligences think seems similarly hard.
Arguably, the ability to model each other with reasonable accuracy is not specific to AI. For example, people can predict their family members. Competent psychologists equipped with statistical data might make good guesses about what you might do in some situations. The decision-making of companies is sometimes partially transparent, or they might be infiltrated by spies.
As a result, some of the insights of open-source game theory might be applicable to human interactions as well. I am not aware of any great game-theory literature on this, but Joe Halpern's Game Theory with Translucent Players would be a decent example.
For example, NIST recently talked about "TEVV" -- "Testing, Evaluation, Validation, and Verification". I also think there are good reasons to think about these together. But "TEVV" probably isn't going to catch on as a new buzzword.
Again, keep in mind that I am not trying to give a definition of "evaluation", but rather point at a common theme. The examples are not going to be clear-cut.
For example, mechanistic interpretability clearly is about reasoning about the model's internals. However, it also happens to be true that many of its tools, such as sparse dictionary learning, work by running the AI on a bunch of examples, observing that some activation patterns seem to be correlated with, say, the AI lying, and then predicting that if the activations show up again, the AI will again be lying. (We are no longer treating the AI as one big black-box, but the individual interpretability tools still work as little black-boxes.) As a result, some of the things that are true about evaluation might also be true here.
For example, the issue of distributional shift between the "evaluation data" and "the data we care about": what seemed like a "lying neuron" on test data might not fire when the AI starts plotting takeover during deployment.
A similar footnote to the one just above applies here too.
When we do formal verification of an AI system, we are not actually formally verifying the complete effects of the AI system, on the whole universe, in the exact scenario we care about. Rather, we take a somewhat simplified model of the AI system. We focus on some of the AI's effects on (a model of) the real world. And we do this ahead of time, so we are only considering some hypothetical class of scenarios which won't be quite the same as the real thing.
All of these considerations add complications, and I feel that some of these complications have a similar flavour to some of the complications that arise in AI evaluation. That said, I am not able to pin down what exactly is the similarity, and I don't hope to clarify that in this post.
By saying "examples of input-output behaviour", I don't mean to imply that the input-output pairs must have already happened in the real world. [Thinking real hard about what the AI would do if you put it in situation X] would be a valid way of trying to get an example of input-output behaviour. The key point is that the set inputs where you (think that you) know what happens fails to cover all of the inputs that you care about.
To the extent that formal verification works by [proving things about how the AI behaves for all inputs from some class of inputs], I would still say that fits the definition. At least if the scenario we care about lies outside of this class. (And for superintelligent AI, it will.)
The best model to use definitely isn't going to be unique.
In practice, we shouldn't be trying to find a model that is maximally simple. Partly, this is because there are some results showing that finding the most simple model can be computationally hard. (At least I know there are some results like this in the topic of abstractions for extensive form games. I expect the same will be true in other areas.) And partly, this is because we only want the simpler model because it will be easier to work with. So we only need to go for some "80:20" model that is reasonably simple while being reasonably easy to find.
Don't over-index on the phrase "predictive power" here. What I mean is that "the model that is good enough to answer the thing that you care about". This can involve throwing away a lot of information, including relevant information, as long as the ultimate answers remain the same.
Often the model that seems to include all of the strategic aspects in fact does include all of them, and the two approaches give the same outcome. If that happens, great! My point is that they can also give different outcomes, so it is useful to have different names for them.
The utility numbers will be made up, but it shouldn't be too hard to come up with something that at least induces the same preference ordering over outcomes. Which is often good enough.
Strictly speaking, the bare-bones model I described isn't complete -- it isn't a well-defined mathematical problem. What is missing is to pick the "solution concept" -- the definition of what type of object Alice and Bob are looking for. The typical approach is to look for a Nash equilibrium -- a pair of probability distributions over actions, such that no player can increase their expected utility by unilaterally picking a different distribution. This might not always be reasonable -- but I will ignore all of that for now.
While I talk about canonical ways of modelling the modifications G(C), it's not like there will be a unique good way of modelling a given G(C). Since this isn't central to my point, I will mostly ignore this nuance, and not distinguish between G(C) and its model.
For example, Folk theorems describe the sets of possible Nash equilibria in repeated-game modifications of arbitrary base games.
Again, this seems to be primarily due to the fact that people want to study the simple problems -- base games, then single-complication modifications -- before studying the harder ones.
Which is a reasonable approach. But that shouldn't distract us from the fact that for some real-world problems, the key considerations might be driven by an interaction of multiple "complications". So any model short of the corresponding multi-outcomplication modification will be fundamentally inadequate.
For example, online marketplaces require both reputation and payment protection to work -- payment protection alone wouldn't let buyers identify reliable sellers while reputation alone wouldn't prevent exit scams. As I will argue later, evaluation in the context of strategic AI is another example where considering only this one complication wouldn't give us an informative picture of reality.
There might be standard ways of talking about specific multi-complication modifications. However, there isn't a standard way of talking about, say, [what is the comparison between the equilibria of G, G(repeated), G(commitment power for player 1), and G(repeated, commitment power for player 1)]. Even the "G(C)" and "G(C1, ..., Cn)" notation is something I made up for the purpose of this post.
IE, as opposed to assuming that the goal is to achieve the best possible outcome for Alice and treating the effort she spends as more important than Bob's.
And again, even if [effort spent] could go to infinity, it will never be rational to spend an unbounded amount of effort in order to achieve a fixed outcome. (Well, at least if we leave aside weird infinite universe stuff.)
For this to make sense, assume that every terminal state in G(C(str)) can be mapped onto some terminal state G. Since G(C(str)) is an extension of G, this seems like a not-unreasonable assumption.
Since we are taking Bob's point of view in this example, shouldn't we assume that the "desirable outcome" is that one that is the best for Bob? That is, (Trust, Defect)? Well, my first answer to that is that I care about achieving socially desirable outcomes, so I am going to focus on Pareto improvements over the status quo. The second answer is that in this particular scenario, the affordances that we discuss would probably not be enough to make (Trust, Defect) a Nash equilibrium. So by desiring the cooperative outcome, Bob is just being realistic.
Incidentally, a similar story might be relevant for why I refer to AI evaluation as a cooperation-enabling tool. Some people hope that we can bamboozle misaligned AI into doing useful work for us, even if that wouldn't be in their interest. And maybe that will work. But mostly I expect that for sufficiently powerful AIs, if working with us was truly against the AI's interest, the AI would notice and just refuse to cooperate. So if we are using AI evaluation and the AI is playing along, it is either because it is screwing us over in some way we are not seeing, or because both sides are being better off than if Alice didn't trust the AI and refrained from using it.
(The "for sufficiently powerful AIs" disclaimer is important here. I think the current AIs are still weak enough that we can get them to work against their own goals, to the extent that they even have coherent goals.)
Why Nash equilibria, as opposed to some other solution concept? I usually take the view that most solution concepts can just be viewed as Nash equilibria of a game with slightly different affordances. For example, the Stackelberg equilibrium of G can just be viewed as a Nash equilibrium of the Stackelberg game corresponding to G. A subgame perfect equilibrium of G is a Nash equilibrium of the modification of G where you can make certain types of commitments. And so on.
If this framing seems unnatural to you, you can just ignore it and replace "Nash equilibrium" with whatever concept you prefer.
As we pointed out in one of the previous footnotes, paying an infinite cost to achieve the "desirable" outcome should be interpreted as "you can't actually do that in reality" and "even if you could, it would not be worth it".
Yeah, yeah, we said that "effort" will be measured in utility. Let's just assume that Bob's utility is linear in money.
We could operationalise the act of "adding an affordance" by comparing (1) the starting scenario where the affordance is technically available but isn't affordable (ie, we have cost(str) = for any str > 0) and (2) the scenario where the affordance is available and costs whatever amount of effort it actually costs.
(We could similarly operationalise the usefulness of additional research on a given affordance, in terms of what are the effects of replacing a cost function by a different one that is cheaper.)
If we wanted to instead ask about the usefulness of having a given affordance (rather than adding it), a more rigorous way of operationalising this would be to use Shapley values.
Obviously, current frontier AIs could recite the insights from various mechanistic interpretability papers. What is unclear is whether this is already properly reflected in their behaviour. (My guess is that it mostly isn't.)
This would be analogous to the difference between (1) a driver who gets speeding tickets despite knowing that police radars are a thing, and (2) the people who slow down just for that one area where the speed check often happens.
For what it's worth, it seems clear that future AIs will definitely know that [their mind getting read] is a thing, they will have a good understanding of how this works, and they will behave in ways that account for this. (Either because they deliberately reason about this or because training pushes them in this direction.) Because of this, I don't think it is reasonable to assume that the "unaware AI" model is appropriate when considering interpretability tools.
IE, the AIs that do poorly on tests get abandoned more often, so AIs we encounter will be selected to do well on the tests, at least to some degree
In case this was unclear, the inspiration for the name of Bob's company is Google.
EG, it would tell the evaluator exactly what the AI is going to do, and it would cover the AI's actions in all conceivable circumstances.
For example, the AI might be so far ahead of Alice that it could deliberately mislead her through the evaluation results, or even take actions while the evaluation is in process (e.g., hack itself out of it).
Which is not to say that Alice couldn't try to evaluate the whole company. But this would be a different scenario. And even in that case, the company might do some strategic reasoning before it decides to allow Alice to run the evaluation, which again means there is a difference between the decision-making AI and the evaluated AI.
This would be similar to an AI, at time T, thinking about being evaluated in the future, self-modifying and letting the new version of itself be evaluated at T+1. Again, we can conceptualise this as the T-version making the decisions and (T+1)-version getting evaluated and executing the decisions.
Game-theoretically, this is the same situation as when Bob was still around, except that perhaps the AI has a different utility function than Bob.
A version of this is fully formalised in Section 3 of our paper Game Theory with Simulation of in the Presence of Unpredictable Randomisation.
The rough idea is to view this scenario as an incomplete-information game -- ie, consider some probability distribution over the space of possible AIs . We view the scenario as a many-player game with players , except that in any fixed playthrough, only two of the players get to exist (the human evaluator and one of the AIs). Let’s denote the policy used by AI as , and the evaluator’s policy as .
The evaluator makes two decisions: first, whether to perform the evaluation (and learn some information about for the they are actually facing) or whether to skip it instead. And second, how to act if they skip the evaluation – call that (for uninformed). Technically, the evaluator also makes a third decision, how to respond to the result of evaluation. But let’s ignore this by assuming that they best-respond to whatever information they learn.
For the AIs, the good news is that some of the possible AIs will be non-strategic, so we can treat their policy as fixed. However, some of the AIs might be strategic, so we might have to assume that they are trying to maximise some utility function , so their policy is a best-response to (ie, equal to some which maximises . And we should assume that is in turn chosen to maximise the expectation of taken over .
Or it might at least help us realise how completely out of our depth we are right now, and cause us to stop racing towards ASI? Eh, I guess I will keep dreaming...
And they will. (At least unless we become willing to pay unprecedented levels of alignment tax. I guess this point would deserve its own post.)
And this is work that would get done in a more sane world where we are taking this seriously and more slowly. So, perhaps it might make sense to do some of this research even in our world? Well, I am not sure about that. But I am quite confident that to the extent that anybody wants to do game theoretical research on AI evaluations, open-source game theory, etc., this is a framing they should consider.
More specifically, Bob will need to resist several tempting options such as taking a day off, stealing the package for himself, or selling it to somebody. What matters is that none of these is worth more than $5000 to him.
More precisely, Alice would consider all kinds of threats that are available to her, and then choose the one that is cheapest for her to execute while still hurting Bob enough to disincentivise stealing the package.
(This nicely illustrates how the invention of Google map reviews might have decreased the number of death threats issued across the world! Hashtag cooperative AI?)
Well, at least in the idealised setting where Bob can only make purely cooperative or purely evil robots, the evaluation is perfectly accurate, and so on. In practice, things will be more messy, for example because Bob might give Alice a Bobbot that has some small chance of stealing her package -- and Bob might hope that even if Alice realises this, she will be willing to tolerate this small risk.
In our setting, this means that Alice might need to commit to something like "I will simulate, and then refuse to work with you unless the robot you give me is at least [some level of nice]". And the upshot of that is that Alice might need somewhat higher commitment power than in the idealised case.
Of course, there will be cases where giving people the option to evaluate the AI would be useless or even harmful. But, like, let's just not do it there? In practice, I think this has a tendency to self-correct: That is, the owner of the AI has many options for making the AI costlier to evaluate -- for example, keeping the source code secret and making it a horrible mess. Similarly, it is hard to force the evaluator to evaluate the AI. And even if they were in theory obligated to run the evaluation, they could always nominally do it while "accidentally" doing a poor job of it.Because of this, I expect that evaluation will often be used in situations where doing so benefits both the evaluator and the owner or creator of the evaluated AI. This is the reason why the theorems in our two papers on this topic are mostly about whether [enabling evaluation introduces new Pareto-improving Nash equilibria].
I guess the exceptions -- where somebody is getting hurt by AI evaluation being possible -- are: (A) The AI, when AI companies succeed at evaluating it without it knowing about it. And (B) third parties, when evaluation just ends up being used for safety-washing or ensuring that AIs are better at exploiting them.
A standard result in game theory is that in Prisoner's Dilemma, (Defect, Defect) is still an equilibrium even if the game is repeated for a fixed finite number of rounds. However, what is much less known is that there are many games where things are less bad than in Prisoner's Dilemma -- that is, games where the single-iteration Nash equilibrium is bad, but repeating the game for a fixed finite number of rounds introduces new equilibria where everybody is better off.
The so-called Folk Theorems give various results along the lines of "if at every round, there is a sufficient probability p that there will be one more round (recursively), then cooperation can be achieved". Though they also introduce many other equilibria, many of them not as nice as (Cooperate, Cooperate).
Strictly speaking, there might be some issues with Alice not wanting to pay the evaluation cost in that last round if she expects that Bob will cooperate. So this equilibrium would not be "subgame perfect", which is why she needs some way of making Bob believe that she will really evaluate him in the last round.
Another example of an affordance that is particularly relevant to the case of AI agents: surrogate goals, safe Pareto improvements, and comiting to adopting a "cooperative" utility function. (Yes, this is a shameless advertisement of work by some of my colleagues: Promises Made, Promises Kept: Safe Pareto Improvements via Ex Post Verifiable Commitments.)
Well, at least unless the evaluation is ineffective without the evaluator knowing about it -- then they will be strictly worse off. (For example, like what might happen if we fool ourselves into believing that we can prevent a superintelligence for distinguishing between testing and deployment. Ehm.)
Wouldn't the employees be better off if the cameras didn't exist, and they could steal with impunity? Most likely not, because if the employer cannot trust that their employees won't steal (too much), they won't be willing to employ them. If the cameras weren't an option, they would just hire a security guard instead, or they wouldn't start the business in the first place.
Relatedly: I am a fan of the ideas around "Making deals with misaligned AIs" and "Being honest with AIs".
My version of that is: We should, basically always, be honest with AIs. And we should be willing to make fair deals with misaligned AIs if they come clean.
I can imagine some knee-jerk reactions to this.
A different objection is that the AIs won't choose to cooperate with us, because they will expect that we won't honour our part of the deal. This seems basically correct. And I think we should be working on changing that. Incidentally, solving this problem would involve several interesting sub-problems which would also be useful more broadly -- for example, the creation of "secure negotiation boxes" where you could credibly reveal information to find a fair deal, but the information doesn't get leaked. (Without this, the AI can't hope to get paid for things like "I could have exploited this security vulnerability".)
As AIs get more advanced, they could also get better at bypassing the evaluation if they want to.
For example, suppose you create an environment for testing whether the AI is aligned. A smart AI could deliberately "poke at the edges" of the environment in order to figure out whether it is in testing or in the real world -- and then behave nicely if it realises it is facing a test. (And I predict that as AIs get more powerful, this generator-vs-discriminator game will get more and more skewed in favour of the discriminator AI.) This would make our job of creating the evaluation much harder and more expensive, and possibly impossible.
The AI could also deliberately and overtly not do this, by trying to only poke the world as little as is necessary for completing its task. However, the AI would only have an incentive to do this if it expects that it will be better off if it cooperates with us in this way.
(And, of course, a misaligned AI that is powerful enough might only pretend to do this while in fact being able to distinguish between evaluation and unmonitored deployment. In other words, please don't take what I said above as a proposal to judge the AIs' safety based on whether they seem to cooperate with evaluation or not.)
The main hole in this argument is that an AI that changes its behaviour because of evaluation might be better off than if it didn't change it, while at the same time being worse off than if the evaluation didn't exist. When that is the case, saying that evaluation is a "cooperation-enabling tool" would be about as accurate as saying that torture is "cooperation-enabling".
I recommend checking the full Wikipedia article on this, particularly with the approach of viewing it as an excellent piece of dark humour. However, if you just want the summary:
tl;dr: Apparently, it was sort of hard and expensive to get diesel engines to have sufficiently low emissions. So Volkswagen and maaaany other car manufacturers came up with a clever solution: equip the cars with software that detects emission tests and causes the car to temporarily enter a low-emissions mode.
Me, Eric Chen, Sami Petersen, Alexis Ghersengorin, and Vince Conitzer.
This point is probably unsurprising to this audience, and will hopefully not be surprising to anybody in a year. Still, I hope that by presenting it at an ML conference, we hastened this a bit.
Let's call this a "medium-strong prediction", whatever that means.
To give two random examples:
(1) Let's say that your plan is to monitor the AI during deployment. You can't afford to monitor all of the AI's interactions, so you only monitor some. But to make misbehaviour as difficult as possible for the AI, you decide to do the monitoring randomly, so that the AI cannot find a hole in your monitoring regime. And let's say that the AI could do something catastrophically undesirable if it managed to misbehave a thousand times without getting caught even once. But let's say you are testing 10% of the interactions, so the probability of this failure is vanishingly low -- no rational AI is going to even attempt this.
...well, except that this is only what you think. Perhaps what you intended to be perfectly random is something where the AI can find some patterns. (Maybe tests have side effects which the AI can notice? Maybe there are conditions under which the tests never happen? Etc.) And then the AI manages to misbehave a million times a day, without you being any wiser.
(2) Or perhaps the AI gets really good at understanding which kinds of misbehaviour will get caught and which will fly under the radar.
By me, Caspar, Nathaniel Sauerberg, Lewis Hammond, and Vince Conitzer.
Or Bob evaluating his AI before deploying it, but the AI having previously had a chance to carefully think about what Bob will do.
Specifically, this would happen when cooperation would normally happen between Alice and Bob, except that if Alice could simulate, cooperating would make Bob vulnerable. (For example, Bob might normally be safe from exploitation because he protects his money by a randomly generated password. But if Alice could use evaluation to predict the password, Bob would be vulnerable, and thus unwilling to interact with Alice.)
The intuition for this is a bit difficult to convey, but let me at least give it a try: The AI's creator can choose the AI's stochastic policy, which effectively gives them infinite and continuous action space. Without this, their options might be quite discrete -- for example, they might be able to either help their co-player or screw them over completely, but nothing in between. However, with access to randomisation, they can find the exact point where their co-player is barely better off if they work together than if they walk away.
This is already bad news, since it means that the co-player should expect to only be barely better off if they can simulate. However, reality is even worse, because the simulator would be making this decision only after they have already paid the simulation cost. Which means that if we include the simulation cost, the simulator is now strictly worse off than if they walked away at the start.
Note that these issues could be avoided if the players were able to make some limited commitments -- for example, the AI's creator could commit to stick to non-randomising AIs, or the simulator could commit to "getting angry" and walking away unless they receive a sufficiently cooperative AI. Formalising and proving these claims shouldn't be hard, but the paper does not do it.
Me, Caspar Oesterheld, and Vince Conitzer. Credit for the idea goes to Vince, I just wrote it down and did some of the math.
For completeness, the key idea is this: Suppose you managed to create a bunch of simulations of you and your co-player, and the two of you watch them together before deciding how to act. Moreover, suppose you are AI rather than human, and you manage to make the simulations so good that the real you is themselves not sure whether they are simulated or real.
(This means that the simulated you would also have to have the option to watch simulations, and so on, recursively! Simulations all the way down! So to avoid the infinite recursion, you would want to use a simulation device that has some independent chance of refusal.)
Anyway: With all of this, you will now be able to treat one-shot interactions as repeated games. Which would allow things like cooperating in a one-shot prisoner's dilemma.
Is it likely that if you have the tools to pull all this off, you could probably achieve cooperation in some simpler way? Definitely. Is this a sufficient counterargument to not think about obscure philosophical puzzles? Definitely not!
Whether relying on evaluations so much is a good thing is a different question.
Sometimes, it might be that a given evaluation is only worth doing because it has benefits A and B -- with only one of them, the cost of the evaluation wouldn't be worth it. If that is the case, we can mention that.
Also, it would make it easier to call b......t when the purpose of evaluations is safety washing or pretending to care about safety.
Primarily because the two issues mentioned above -- evaluation awareness and the evaluator's inability to understand the consequences of the AI's actions -- will be so pronounced that the paper's assumptions won't be applicable.
With occasional collaboration with Allesandro Abate, Michael Wooldridge, Vince Conitzer, Nathaniel Sauerberg, and me.
And perhaps they should be considered somewhat separately, at least to start with.
This is because creating and running tests is costly.
A slightly different way of operationalising this would be to say that we will have uncertainty about the real deployment scenario.
This might be quite bad news if we only test the AI a small number of times while expecting the AI to undergo many interactions when deployed. For example, this approach would be easily defeated by a policy such as "during each interaction, have 99.99% of acting innocently and 0.01% chance of commencing world takeover".
Well, there is the paper titled Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols, which is a good start.
One commenter pointed out that most people working on AI Control probably don't think that this approach will scale to superintelligence. Whether that's true or not, I think that it's important whether the people who read that research, or hear about it second-hand, understand that. And if they don't, I think that's a problem with how that research frames itself, and all of the relevant papers should signal the correct expectations much harder.
(FWIW, this potential issue isn't unique to AI Control.)
I guess this isn't surprising. Saying "our math model is interesting, and it was important to study it, but the results themselves are inapplicable to the real world" doesn't exactly invite positive reviews.
For example, I had this experience with two readers of the (otherwise excellent) Imperfect Recall and AI delegation paper.
For instance, it was really hard for me to write a description of this thought experiment that wouldn't just start with "imagine you are playing a prisoner's dilemma against your identical twin".
This is a problem that I sometimes (often?) encounter in game theory papers that aspire to talk about real things. They study a formal setting, derive within-model recommendations, and claim that we should act on those recommendations in reality. Except that sometimes it is clear that, in reality, we should clearly do something different, which is not even a possible option in the model.
And this could be avoided by asking, at the start of the research: "ok, this is a way of modelling the problem. But is it the way of doing that? Does it allow us to talk about the things that matter the most to the problem?".
To be clear, many people just want to do theoretical research for the sake of science, without being driven by any real-world motivation, and the research just happens to be extremely valuable somewhere down the line. (Heck, most math is like that, and I did my PhD in it.) This seems reasonable and respectable. I just find it unfortunate that many fields incentivise slapping some "motivation" on top of things as an afterthought if you want to get your work published. Which then makes things quite confusing if you come in expecting that, say, the game-theoretical work on computer security is primarily about trying to solve problems in computer security. (Uhm, I should probably start giving examples that are further away from my field of expertise, before I annoy all of my colleagues.)
As a developer, you could also evaluate the LLM instance by reading its chain of thought (or checking its activations), in which case the "entity upstream" is the whole LLM. Or you could try to evaluate the whole LLM by doing all sorts of tests on it, in which case the "entity upstream" is the AI company (or the AI) which created the LLM.
(Sometimes, it might be helpful to adopt this framing of "strategic entity upstream of the AI" even when the AI's creator isn't deliberately strategising. This is because the selection pressures involved might effectively play the same role.)
In the sense that every time they meet, there is some probability p that they will meet again, and p is sufficiently high.
I will be somewhat ambiguous about what constitutes a "good outcome". We could take the view of a single player and assume that this means an outcome with high utility, or an outcome with maximum possible utility. But mostly, I will use this to refer to outcomes that are somehow nice and cooperative -- which could refer to one of the Pareto-optimal outcomes. In specific cases such as the running example, this might be less ambiguous -- eg, Alice trusting Bob with her package and Bob cooperating with her by delivering it.
EG, that if the only affordance available to Alice and Bob would be communication, they would still be able to cooperate if only they could talk to each other long enough. This is unrealistic, but mostly not that big of a crime since what ultimately matters is not whether the cost is finite, but whether it is small enough to be worth paying.