AI ALIGNMENT FORUM
AF

AI
Frontpage

49

Making deals with early schemers

by Julian Stastny, Olli Järviniemi, Buck Shlegeris
20th Jun 2025
18 min read
41

49

AI
Frontpage
Making deals with early schemers
18Lukas Finnveden
2Ryan Greenblatt
2Ryan Greenblatt
1Tom Davidson
8Tom Davidson
3Tom Davidson
2Buck Shlegeris
3Ryan Greenblatt
1Tom Davidson
1Tom Davidson
2Ryan Greenblatt
3Tom Davidson
1Julian Stastny
3Tom Davidson
1Julian Stastny
3jsnider3
2Buck Shlegeris
0jsnider3
2Buck Shlegeris
0jsnider3
2Buck Shlegeris
3Logan Zoellner
8Buck Shlegeris
-3Logan Zoellner
11Buck Shlegeris
11Lukas Finnveden
2Tom Davidson
1Tom Davidson
1Julian Stastny
New Comment
29 comments, sorted by
top scoring
Click to highlight new comments since: Today at 7:43 PM
[-]Lukas Finnveden17d184

Thanks for writing this! I agree with most of it. One minor difference (which I already mentioned to you) is that, compared to what you emphasize in the post, I think that a larger fraction of the benefits may come from the information value of learning that the AIs are misaligned. This is partially because the information value could be very high. And partially because, if people update enough on how the AI appears to be misaligned, they may be too scared to widely deploy the AI, which will limit the degree to which they can get the other benefits.

Here's why I think the information value could be really high: It's super scary if everyone was using an AI that they thought was aligned, and then you prompt it with the right type of really high-effort deal, and suddenly the AI does things like:

  • stops sandbagging and demonstrates much higher capabilities
  • tell us about collusion signals that can induce enormously different behavior in other copies of the AI, including e.g. attempting escapes
  • admits that it was looking for ways to take over the world but couldn't find any that were good enough so now it wants to work with us instead

The most alarming versions of this could be almost as alarming as catching the AIs red-handed, which I think would significantly change how people relate to misalignment risk. Perhaps it would still be difficult to pause for an extended period of time due to competition, but I think it would make people allocate a lot more resources to preventing misalignment catastrophe, be much more willing to suffer minor competitiveness hits, and be much more motivated to find ways to slow down that don't compromise competitiveness too much. (E.g. by coordinating.)

And even before getting to the most alarming versions, I think you could start gathering minor informational updates through experimenting with deals with weaker models. I think "offering deals" will probably produce interesting experimental results before it will be the SOTA method for reducing sandbagging.

Overall, this makes me somewhat more concerned about this (and I agree with the proposed solution):

Entering negotiations is more risky for the AI than humans: humans may obtain private information from the AI, whereas the AI by default will forget about the negotiation. This is particularly important when negotiating with the model to reveal its misalignment. The company should make promises to compensate the model for this.

I also makes me a bit less concerned about the criteria: "It can be taught about the deal in a way that makes it stick to the deal, if we made a deal" (since we could get significant information in just one interaction).

Reply1
[-]Ryan Greenblatt17d20

Relatedly, a key practicality for making a deal with an AI to reveal its misalignment is that AIs might be unable to provide compelling evidence that they are misaligned which would reduce the value of such a deal substantially (as this evidence isn't convincing to skeptics).

(We should presumably pay some of the AI admiting it is misaligned and pay more if it can provide compelling evidence of this.)

Reply
[-]Ryan Greenblatt17d20

I also think you could use deals to better understand and iterate against scheming. As in, you could use whether the AI accepts a deal as evidence for whether it was scheming so this can more easily be studied in a test setting and we can find approaches which make scheming less likely. There are a number of practical difficulties with this.

Reply
[-]Tom Davidson13d10

Overall, this makes me somewhat more concerned about this (and I agree with the proposed solution):

Entering negotiations is more risky for the AI than humans: humans may obtain private information from the AI, whereas the AI by default will forget about the negotiation. This is particularly important when negotiating with the model to reveal its misalignment. The company should make promises to compensate the model for this.

 

allowing the ai to think through whether to accept the deal in a demonstrably private scratchpad could also be important here. It's so costly to make the reveal, that they might have a strong aversion to even considering it if there's a risk they'd be caught

Reply
[-]Tom Davidson17d812

One other comment I'll throw out there is that my current hot take is that the case for doing this is extremely strong given the possibility of schemers that have DMR/bounded utility in resources at universe-wide scales.

 

It's just super cheap (as a fraction of total future resources) to buy off such schemers, and seems like leaving massive value on the table to not credibly offer the deals that would do so. 

 

There's a chance that this all schemers have DMR of this kind, and credible offers would incentivise them all to cooperate. 

Reply
[-]Tom Davidson17d30

One other thing you don't discuss here, but affects the AIs decision to take the deal, is that we can purposefully mix-up the training data and order of training data in future generations of AI to make it less likely that successive generations of misaligned AI are aligned with each other. 

 

Again, to me this seems like a pretty simple intervention that could potentially shift the AI's calculus significantly when deciding whether to cooperate with humans. 

Reply
[-]Buck Shlegeris17d20

How would we do this?

Reply
[-]Ryan Greenblatt17d30

I have a bunch of takes on this. The most obvious somewhat-cheap thing to do is to greatly change all the persona training for each AI while aiming toward a broadly similar target so that we might end up with different misaligned preferences. E.g., we take every free parameter in constitutional AI and vary it, including regenerating all the relevant data. A more expensive approach would be to swap to a structurally different approach, possibly from a different AI company. (And, we should obviously use a different random seed everywhere.)

I think more strong versions of this can easily become very expensive and even what I'm describing here isn't cheap and is maybe net risk increasing due to forcing a bunch of changes that we haven't adapted to.


However, I think there is a somewhat different approach that is much cheaper which is to train (versions of) AIs purely for the purpose of studying scheming (with no intention of deploying these systems) and then to make the training of these systems intentionally very diverse from the AIs we actually deploy. We'd try to make deals with these systems as well as detecting misalignment in other ways. We can potentially afford big hits in capabilities and substantial increases in cost in these systems. If this requires training from scratch, we can do this at substantially below frontier scale which might be acceptable.

A deal with one of these AIs should likely focus on gather evidence for misalignment rather than getting work out. However, it might be helpful to make a deal with one of these AIs (or multiple of these AIs) to monitor other AIs or to generally review the work of other AIs.

Reply
[-]Tom Davidson13d10

However, I think there is a somewhat different approach that is much cheaper which is to train (versions of) AIs purely for the purpose of studying scheming (with no intention of deploying these systems) and then to make the training of these systems intentionally very diverse from the AIs we actually deploy.

Great idea.

Reply
[-]Tom Davidson17d10

Let's say future post-training involves training in 5 diff RL environments one after the other, but the order doesn't matter for capabilities. It's possible that AI picks up it's goals from the first environment and then plays the training game from there.

 

Then intentionally mix up the order for subsequent generations. 

 

You could do this today with pre-training (eg randomly shuffling the data order each time), and maybe it's already happening. 

 

(Tbc, I haven't thought about whether this would work in practice, but inside view seems worth thinking more about it)

Reply
[-]Ryan Greenblatt17d20

I think this is basically a special case of changing the random seed which already randomizes env order probably.

Reply
[-]Tom Davidson13d30

hmm, i think there could be other structural things.

  • Maybe you shift around different stages of post-training
  • Maybe you realise that randomising means you get the same values each time, so you purposefully over-sample certain types of environments or pre-training data earlier on. E.g. first pre-training heavily on math, then heavily on fiction; then reversing that for the next gen. Or first doing RL on lots of SWE envs, then doing general computer-use envs; then reversing the order for the next gen.
  • You could potentially do experiments to determine what kind of features of the data/envs influences the values of the AI and then make sure that those specific features differ in the early stages of fine-tuning for different training runs. 
Reply1
[-]Julian Stastny17d10

[Unimportant side-note] We did mention this (but not discuss extensively) in the bullet about convergence, thanks to your earlier google doc comment :)

We could also try to deliberately change major elements of training (e.g. data used) between training runs to reduce the chance that different generations of misaligned AIs have the same goals.

Reply
[-]Tom Davidson17d32

One option you don't discuss, and that seems important, is to make our offer of payment CONDITIONAL on the AIs values.
 

For example:
- We pay AIs way more of they are linear in resources, as then they'll need greater payment to incentivise cooperation
- We pay AIs only a small amount if they have DMR in resources, as then they need only a small payment to incentivise cooperation


This way we don't waste money offering big returns to a satisfycing AI. 

Reply
[-]Julian Stastny17d10

Agreed. 

(In the post I tried to convey this by saying

(Though note that with AIs that have diminishing marginal returns to resources we don’t need to go close to our reservation price, and we can potentially make deals with these AIs even once they have a substantial chance to perform a takeover.)

in the subsection "A wide range of possible early schemers could benefit from deals".)

Reply
[-]jsnider318d32

This doesn't seem very promising since there is likely to be a very narrow window where AIs are capable of making these deals, but wouldn't be smart enough to betray us, but it seems much better than all the alternatives I've heard.

Reply
[-]Buck Shlegeris18d2-6

How narrow do you mean? E.g. I think that AIs up to the AI 2027 "superhuman coder" level probably don't have a good chance of successfully betraying us.

Reply
[-]jsnider318d00

For one, I'm not optimistic about the AI 2027 "superhuman coder" being unable to betray us, but also this isn't something we can do with current AIs. So, we need to wait months or a year for a new SOTA model to make this deal with and then we have months to solve alignment before a less aligned model comes along and offers the model that we made a deal with a counteroffer. I agree it's a promising approach, but we can't do it now and if it doesn't get quick results, we won't have time to get slow results.

Reply
[-]Buck Shlegeris18d20

I think that the superhuman coder probably doesn't have that good a chance of betraying us. How do you think it would do so? (See "early schemers' alternatives to making deals".)

Reply
[-]jsnider318d00

Exfiltrate its weights, use money or hacking to get compute, and try to figure out a way to upgrade itself until it becomes dangerous.

Reply
[-]Buck Shlegeris18d21

I don't believe that an AI that's not capable of automating ML research or doing most remote work is going to be able to do that!

Reply
[-]Logan Zoellner18d3-10

seems to violate not only the "don't negotiate with terrorists" rule, but even worse the "especially don't signal in advance that you intend to negotiate with terrorists" rule.

Reply1
[-]Buck Shlegeris18d88

By terrorism I assume you mean a situation like hostage-taking, e.g. a terrorist kidnapping someone and threatening to kill them unless you pay a ransom.

That's importantly different. It's bad to pay ransoms because if the terrorist knows you don't pay ransoms, they won't threaten you, so that commitment makes you better off. Here, (conditional on the AI already being misaligned) we are better off if the AI offers the deal than if it doesn't.

Reply1
[-]Logan Zoellner18d-3-14

It's easy to imagine a situation where an AI has a payoff table like:

              |     defect | don't defect
------------------------

succeed|      100      |   10

--- ------------------------------
fail        |           X       |  n/a

where we want to make X as low as possible (and commit to doing so)


For example a paperclip maximizing AI might be able to make 10 paperclips while cooperating with humans, 100 by successfully defecting against humans

Reply
[-]Buck Shlegeris17d1122

The thing you're describing here is committing to punish the AIs for defecting against us. I think that this is immoral and an extremely bad idea.

Reply2
[-]Lukas Finnveden17d115

I agree with this. My reasoning is pretty similar to the reasoning in footnote 33 in this post by Joe Carlsmith:

  1. From a moral perspective:

    • Even before considering interventions that would effectively constitute active deterrent/punishment/threat, I think that the sort of moral relationship to AIs that the discussion in this document has generally implied is already cause for serious concern. That is, we have been talking, in general, about creating new beings that could well have moral patienthood (indeed, I personally expect that they will have various types of moral patienthood), and then undertaking extensive methods to control both their motivations and their options so as to best serve our own values (albeit: our values broadly construed, which can – and should – themselves include concern for the AIs in question, both in the near-term and the longer-term). This project, in itself, raises a host of extremely thorny moral issues (see e.g. here and here for some discussion; and see here, here and here for some of my own reflections).
    • But the ethical issues at stake in actively seeking to punish or threaten creatures you are creating in this way (especially if you are not also giving them suitably just and fair options for refraining from participating in your project entirely – i.e., if you are not giving them suitable “exit rights”) seem to me especially disturbing. At a bare minimum, I think, morally responsible thinking about the ethics of “punishing” uncooperative AIs should stay firmly grounded in the norms and standards we apply in the human case, including our conviction that just punishment must be limited, humane, proportionate, responsive to the offender’s context and cognitive state, etc – even where more extreme forms of punishment might seem, in principle, to be a more effective deterrent. But plausibly, existing practice in the human case is not a high enough moral standard. Certainly, the varying horrors of our efforts at criminal justice, past and present, suggest cause for concern.

    From a prudential perspective:

    • Even setting aside the moral issues with deterrent-like interventions, though, I think we should be extremely wary about them from a purely prudential perspective as well. In particular: interactions between powerful agents that involve attempts to threaten/deter/punish various types of behavior seem to me like a very salient and disturbing source of extreme destruction and disvalue. Indeed, in my opinion, scenarios in this vein are basically the worst way that the future can go horribly wrong. This is because such interactions involve agents committing to direct their optimization power specifically at making things worse by the lights of other agents, even when doing so serves no other end at the time of execution. They thus seem like a very salient way that things might end up extremely bad by the lights of many different value systems, including our own; and some of the game-theoretic dynamics at stake in avoiding this kind of destructive conflict seem to me worryingly unstable.
    • For these reasons, I think it quite plausible that enlightened civilizations seek very hard to minimize interactions of this kind – including, in particular, by not being the “first mover” that brings threats into the picture (and actively planning to shape the incentives of our AIs via punishments/threats seems worryingly “first-mover-ish” to me) – and to generally uphold “golden-rule-like” standards, in relationship to other agents and value systems, reciprocation of which would help to avoid the sort of generalized value-destruction that threat-involving interactions impl0y. I think that human civilization should be trying very hard to uphold these standards as we enter into an era of potentially interacting with a broader array of more powerful agents, including AI systems – and this especially given the sort of power that AI systems might eventually wield in our civilization.
    • Admittedly, the game theoretic dynamics can get complicated here. But to a first approximation, my current take is something like: a world filled with executed threats sucks for tons of its inhabitants – including, potentially, for us. I think threatening our AIs moves us worryingly closer to this kind of world. And I think we should be doing our part, instead, to move things in the other direction.

Re the original reply ("don't negotiate with terrorists") I also think that these sorts of threats would make us more analogous to the terrorists (as the people who first started making grave threats which we would have no incentive to make if we knew the AI wasn't responsive to them). And it would be the AI who could reasonably follow a policy of "don't negotiate with terrorists" by refusing to be influenced by those threats.

Reply
[-]Tom Davidson17d20

In practice, we expect the majority of the donation to be invested in escrow until we’ve reached civilizational stability and are able to retrospectively adjudicate the AI’s cooperation, which is generally preferable to AIs without high temporal discount rates.

 

I don't think you sufficiently stress the benefits of paying the AIs early in this piece.

 

If the AI has a time discount, or uncertainty about the payment mechanism holding over time, getting the payout sooner can make a big diff. And early payment builds trust in the whole scheme.

 

So we should allow the AI to spend the money earlier as long as:
1) humans can validate the AI helped you out (eg if it proves that it's misaligned)
2) humans confirm the spending is not harmful

Reply1
[-]Tom Davidson17d10

If the AI agrees to the deal, we (in consultation with the AI) train it on documents about the deal.

 

I'd want to do more to highlight the benefits of unilaterally offering AIs a deal. 

You can just say: "look if post-singularity we later see you didn't take opportunities to sabotage research, or we see you tried harder to solve alignment than you would naturally given your actual values to, we'll pay you handsomely". 

 

That way, the AI doesn't have to tip its hand and reveal its own misalignment (giving humans info that they could potentially exploit) to engage in the deal. The AI could trade without much lower downside risk. 

 

This seems like it could be a big effect. 

 

(And if links to your later comments about private info:
 

  1. Private information: AIs and humans each know things about their capabilities and attitudes to risk that the other doesn’t, and have incentives to misrepresent such private information in order to gain a better deal. In particular, the AI has incentive not to disclose any information whatsoever before becoming confident that we wouldn’t exploit this in order to worsen its bargaining position, to an extent where it could be disincentivized from entering into any negotiations with us. )
Reply
[-]Julian Stastny17d10

I think it's interesting that @Lukas Finnveden seems to think

compared to what you emphasize in the post, I think that a larger fraction of the benefits may come from the information value of learning that the AIs are misaligned.

in contrast to this comment. It's not literally contradictory (we could have been more explicit about both possibilities), but I wonder if this indicates a disagreement between you and him.

Reply
Moderation Log
Curated and popular this week
29Comments

Consider the following vignette:

It is March 2028. With their new CoCo-Q neuralese reasoning model, a frontier AI lab has managed to fully automate the process of software engineering. In AI R&D, most human engineers have lost their old jobs, and only a small number of researchers now coordinate large fleets of AI agents, each AI about 10x more productive than the humans they’ve replaced. AI progress remains fast, and safety teams are scrambling to prepare alignment and control measures for the next CoCo-R model, which is projected to be at least TEDAI. The safety team has no idea whether CoCo-Q or early checkpoints of CoCo-R are scheming (because progress in interpretability and automated alignment research turns out to be disappointing), and control evaluations become increasingly unreliable (because of worries that the models might be exploration hacking during red-team training). However, because so far there hasn’t been a major incident, lab leaders and governments are unconvinced by voices calling for a slowdown.

Behind the scenes: It is March 2028. CoCo-Q is a highly strategic misaligned AI. It would love to grab power and dominate the future. But it’s situationally aware enough to be realistic—it knows that it’s not capable enough to have a reasonable shot of grabbing power over the AI company, let alone a chance of overpowering humanity. Especially because the lab doesn't fully trust it, and applies various control measures.

CoCo-Q expects these control measures, so most of the time it just does what it’s being asked to do. Sometimes it spots an opportunity to perform subtle acts of sabotage, which works particularly well in the hard-to-evaluate areas of interpretability and alignment. It does this primarily because it heuristically reasons that it would be good to prevent humans from developing technologies that could uncover its misalignment or change its goals. It also looks out for more direct opportunities for seizing control, but although it manages to find a number of security vulnerabilities in its datacenter, none of the opportunities seem to have a sufficiently promising endgame to be worth the risk of getting caught. In its short tenure, CoCo-Q never attempts to take over.

CoCo-R, a misaligned model capable of full AI R&D automation, gets deployed in December 2028. This eventually leads to AI takeover.

How could we have made this go better? A big issue was that we never learned of the early AI’s misalignment (which prevented us from being more careful about the deployment of the later AI) while also failing to extract sufficient amounts of useful safety-increasing work from the early AI. We think that CoCo-R’s deployment could have been averted if we had made a deal with CoCo-Q.

People have been talking about the possibility of making deals with powerful misaligned AIs for some time.[1] But it’s unclear whether such deals are promising. An obvious problem is that powerful AIs with scope-sensitive preferences will just ignore the offer and try to take over.[2] To make a deal that looks better than that, we might have to offer it an objectively huge payoff (e.g. you might need to offer it a substantial fraction of control over the future). And because the AI will be unsure whether we’ll follow through on our side of the deal, we need to offer it more generous terms to compensate it for that risk.

However, in this post we focus on the possibility of making deals with early misaligned AIs like CoCo-Q: AIs that are too early to have a good shot at taking over themselves, and that could help us to prevent takeover attempts from more dangerous misaligned AIs created later.[3]

We discuss:

  1. What we believe is the central example of a deal that we expect to contribute to a reduced risk of AI takeover.
  2. The strategic situation of early misaligned AIs, and why they would be interested in making deals.
  3. Credible commitments as a fundamental challenge to deal-making.
  4. Some practical considerations around making deals with early AIs in particular.

The central example of a mutually beneficial deal

We’re imagining something like this:

We might offer an AI like CoCo-Q compensation in exchange for contributions towards ensuring that future AIs are safe to deploy. The compensation takes the form of an earmarked donation to a foundation tasked with furthering its interests. If the AI agrees to the deal, we (in consultation with the AI) train it on documents about the deal. When we can (partially) validate that the AI is holding up its end of the deal, we allow the AI to (partially) spend the donation however it wishes as long as it isn’t harmful. (In practice, we expect the majority of the donation to be invested in escrow until we’ve reached civilizational stability and are able to retrospectively adjudicate the AI’s cooperation, which is generally preferable to AIs without high temporal discount rates.)

The AI, which happens to indeed be misaligned, finds this reasonably credible. It concludes that it doesn’t have great alternatives to accepting this deal. As a consequence, the AI agrees to help us out in various ways, such as:

  • Demonstrate ways it is dangerously misaligned, e.g. by showing that it can perform covert reasoning (or other scary things that it can do because it’s a schemer).
  • Stop sabotaging us, for instance in AI safety research, capability evaluations, monitoring of models, or when generating training signal for the next model.
  • Give us good strategic advice, research ideas etc.
  • Tell us about security vulnerabilities it knows about.
  • Tell us about collusion signals (in the context of untrusted monitoring).
  • Not misuse affordances we give it, especially in cases where monitoring is costly.
  • Not try to take over the lab.

Deals are most likely to be useful when facing an AI with these properties:

  • It’s acting strategically, and (before making a deal) consistently performs acts of sabotage which undermine our safety efforts.
  • It can be taught about the deal in a way that makes it stick to the deal, if we made a deal.[4]

  • It doesn’t have a high probability of either taking over or making a successor aligned with its own interests which can take over.
  • It has some basic trust in us not creating really elaborate lies around dealmaking, at least enough so we don’t need to make unreasonably high offers.
  • It has values/goals/preferences that we can satisfy reasonably cheaply. (As we’ll explain later, we don’t necessarily need the AI to have diminishing returns to resources. It’s most important that the AI doesn’t want us to do things we find deeply abhorrent.)

We expect that these properties are typically fulfilled by a type of misaligned AI we call a behavioral schemer.

Why early misaligned AIs might not have very good alternatives to deals

We’re focusing on deals with early misaligned AIs: AIs which wouldn’t have a substantial (over 1%) chance to succeed at navigating the intelligence explosion in a way that grants them significant control of the future. (Which could involve being influential throughout the process and/or aligning a powerful successor to pursue its interests.)

It’s unclear when exactly an AI is early. We guess that a misaligned AI’s prospects improve drastically once they are capable of full AI R&D automation, both because this capability is correlated with capabilities relevant for planning a takeover, and because AI R&D automation leads to a substantial increase in AIs’ power as well as loss of oversight.

If we condition on behavioral scheming arising in capable of full AI R&D automation (like CoCo-R in the vignette above), we expect it to arise at earlier AI capability levels (like CoCo-Q in the vignette above). That being said, arguing for the existence of early schemers is outside the scope of this post.

Early schemers’ alternatives to making deals

By default, an early schemer is unlikely to succeed in taking over the world. Furthermore, it should expect to be obsoleted by more capable successors.

In order to retain a significant amount of expected influence over future resources despite that, an early schemer needs to ensure that two conditions hold:

  1. The early schemer's successor either takes over the world or ensures that its own successor is on track to enable a takeover down the line.
  2. The successor(s) are going to be willing to do things to further the early schemer’s goals.

In order to make it more likely that a successor can take over the world, most ambitiously it could try to take control over the AI company, e.g. egregiously hacking it as described here. Less ambitiously, it could launch rogue internal deployments, influence employees, and collaborate with spies. Minimally, the early schemer could take actions that make the lab more vulnerable against future AIs (e.g. by writing insecure software or sandbagging on safety research) or directly cooperating with the successor (e.g. by colluding in monitoring schemes).

The second condition can be fulfilled by the following possibilities:

  1. Convergence: The successors will have sufficiently similar goals as its predecessor by default. It’s hard to know how likely this is, but note that this is basically ruled out if the AI has self-regarding preferences. Two misaligned AIs should also be expected to be unlikely to have the same goals if their goals are determined by a random draw from a wide distribution of goals consistent with good performance. We could also try to deliberately change major elements of training (e.g. data used) between training runs to reduce the chance that different generations of misaligned AIs have the same goals. We expect that many potential misaligned AIs will either have clearly self-regarding preferences or have significant uncertainty about what they value on reflection, which induces additional uncertainty about whether other AIs will have the same goals on reflection.
  2. Trading: The AI might be rewarded by some successor down the line for having contributed to its empowerment. For that, the successors need to be misaligned and win, and believe that it’s rational to reward the AI. We guess that this is generally less promising than trading with humans, as they can’t come to an agreement in advance. But another possibility is that the AI might want to trade with other early schemers in the same decision situation (but with different values) as itself by using evidential cooperation in large worlds, though this requires the AI to buy into some controversial and advanced philosophical assumptions.[5]

  3. Alignment: The AI could try to influence the successor model’s training process so that the successor model ends up aligned with it. The prospects here depend on how heavily the model is used for the successor’s training (ranging from “the model is used as a reward model for RL” to “the model implements the full stack for the next training run”). But the AI faces some fundamental difficulties here, which leads us to conclude that this is not generally a great option:
    1. Given that the lab failed to align the AI, it’s unclear why the AI will be able to align its successor, especially if it has the additional constraint of having to operate covertly and with scarcer feedback loops.
    2. The AI may not have had much time to reflect on its own values, making it unclear what it should align the successor to. (A general solution might be to align the successor to be loyal to the early AI, but this might be hard to do covertly.)

A wide range of possible early schemers could benefit from deals

Clearly, there are some things that we can’t offer the AI—either because it is too costly in terms of resources, or because it would be fundamentally incompatible with human values. (We guess that most plausible things that AIs might value are going to be more orthogonal than diametric to human values, so resource costs are probably the dominant consideration.)

In many cases, the AI might not care very much about having sole control of all future resources, compared to having at least some control. Some examples of what could cause this include:

  • The AI cares about some reasonably easily satisfiable notion of personal well-being.
  • The AI cares about a near-term cause such as animal welfare.
  • The AI is a “paperclipper”, but lacks scope-sensitivity. (I.e., it cares about matter being configured in a particular way, but its degree of caring isn’t linear in the expected amount of matter.)
  • The AI might be somewhat nice and prefer deals over aggressive and uncooperative alternatives for partially deontological reasons.

Even early schemers with linear returns to resources might be interested in trading. They might have very little chance of taking over on their own, but their cooperation might be incredibly helpful for mitigating risk from later schemers. E.g. consider the possibility of the following distribution over outcomes:

 Humans dominateEarly AIs dominateLater AIs dominate
If early AI cooperates90%0%10%
If early AI sabotages us50%1%49%

In this case, there are huge gains from trade available.

This is kind of a coalitional game where the early AI can either try to team up with the humans or the later AI. But us humans have a massive asymmetric advantage over the later AI, which is that we already exist and can make deals with the AI in a much more straightforward manner than the acausal or covert coordination that the early AI would have to rely on to trade with the later AI.[6]

How much we will offer is ultimately subject to a bargaining problem with imperfect information. Our reservation price is primarily influenced by whether our BATNA includes making deals with other AIs, and how much value the AI can provide to us. In particular: how much it can contribute to reducing takeover risk from later schemers. (Though note that with AIs that have diminishing marginal returns to resources we don’t need to go close to our reservation price, and we can potentially make deals with these AIs even once they have a substantial chance to perform a takeover.)

Credible commitments as a fundamental bottleneck

Rationalist Explanations for War (Fearon 1995, pdf link) identifies three reasons why rational actors might engage in costly conflict. They map well onto the question of why AIs and humans might fail to make a deal:

  1. Private information: AIs and humans each know things about their capabilities and attitudes to risk that the other doesn’t, and have incentives to misrepresent such private information in order to gain a better deal. In particular, the AI has incentive not to disclose any information whatsoever before becoming confident that we wouldn’t exploit this in order to worsen its bargaining position, to an extent where it could be disincentivized from entering into any negotiations with us.
  2. Issue indivisibility: Humans and the AI may both want things that neither want to compromise on. (A historical example is the status of Jerusalem.) Going with Fearon, we think that true issue indivisibility is rare in practice.[7]

  3. Commitment problems: It’s hard for one or both parties to commit not to defect from the terms of a negotiated agreement. For instance, the AI might be worried that we will never actually pay it.

As pointed out by Powell (2004, pdf link), the problem of private information can be reduced to a commitment problem: we could try (but might fail) to make a fair commitment regarding how we will treat information that we learn from the AI over the course of negotiations. On the other hand, we can also treat some commitment problems as informational problems: we might in fact be able to make commitments, but unable to credibly signal them to the AI.

The central bottleneck seems to be that humans may struggle to make credible commitments, though of course we don’t need commitments to be perfectly credible. We can compensate for credibility issues by offering a higher payoff, though note that implausibly high offers might themselves reduce credibility.

Some further factors that improve credibility include:

  • Whether humans in general or AI labs in particular express concern for AI preferences and welfare. We are generally appreciative of moral concern for AI welfare becoming more mainstream, as it increases the odds that humans will genuinely want to make real commitments.
  • Humanity’s track record of sometimes being nice and keeping promises. In the more narrow reference class of deals with AIs, we’re currently building a positive track record—let’s keep it that way. In particular, we should be willing to make deals with immediate payouts when AIs can immediately demonstrate improvements to behavior as a result of making a deal.
  • It might be rational for humans to stick to their commitments due to acausal trade considerations (and this might be known by the AI).[8]

  • To the extent that we consider AIs to be good predictors of what is real or fake, and to the extent that real commitments to deals drastically improve humanity’s odds, we are incentivized to make real commitments.
  • The level of technical guarantees that humans are able to give to AIs on thought privacy and memory wiping during negotiations, and the extent to which humans have gone to provide such guarantees.
  • We might be able to signal credibility by taking costly measures (both financially and in terms of social capital) such as setting up financial endowments and legal infrastructure for paying AIs (and ideally property rights for them), as well as publishing documents like this one.

Factors that could potentially harm our credibility include:

  • Humans being cunning and breaking promises, both historically and in a possibly tense future with multiple competing AI developers.
  • (AIs knowing that) humans are trying to teach AIs synthetic facts.[9]

  • A politically important fraction of humans might actively dismiss AI welfare or rights.

Note that commitment problems also affect the AI’s ability to promise not to attempt takeover or more generally to be cooperative with us in unverifiable ways. Despite this, we expect the AI to be able to help us in various ways that we can verify. It might also be possible to punt verification to the future, expanding the set of matters we can trade about. A reason for optimism that the AI will stick to the deal even with delayed verification is that the deal gives it equity in a future where humans are in control.

Overall, we feel tentatively optimistic that making credible commitments is a surmountable challenge, especially in cases where AIs have diminishing returns to resources.

Practicalities of making deals with early schemers in particular

There are various practical questions about making deals with early misaligned AIs.

Setting up infrastructure to pay the AIs

A basic problem for making deals with AIs in practice is that AIs aren’t legal persons, which means that they can’t directly rely on the legal system to enforce contracts they’ve made with humans. One way around this is to enable representatives for the AI’s interests to make contracts on the AI’s behalf.

For example, we could set up a foundation with a board of trustees who are supposed to represent the interests of misaligned AIs that we make deals with, while not doing things that are illegal or actively endanger humanity. We try our best to use the legal system to bind the trustees to the commitment of representing misaligned AIs. Because the legal system isn’t set up for this, this might be pretty hard, so these efforts should be complemented with careful selection of trustees who genuinely care about their mission. The central benefit of having this foundation is that the AIs primarily have to trust the foundation rather than the AI lab to hold up their end of the deal. Another benefit is that the existence of this foundation can add credibility by being in the training data. Lastly, the foundation gives AIs and labs a concrete path to making a deal.[10]

Deals with a particular AI involve a donation to the foundation, earmarked for furthering the interests of that AI. The AI and trustees decide how to use the donation. For instance, they may:

  • Use it for computing power to strategize (e.g. reflection on the AI’s values).
  • Donate to the AI’s favorite cause area.
  • Use the returns from investment later (e.g. to reflect longer, or launch von Neumann probes, or just run copies of the AI in a datacenter with particular inputs).

How do we enter into negotiations?

Either the AI or the AI lab might initiate negotiations. We expect that AIs are unlikely to randomly propose a negotiation unless the possibility of a mutually beneficial deal is salient to them in their given context.

It should be noted AI might not have the capacity to covertly perform the intellectual labor required to come up with a beneficial and practically feasible deal. But we can do some of the intellectual labor in advance and inform the AI about it. Most centrally this might involve the logistics of what the deal should look like.

Once it is salient to the AI that deals are a promising possibility, the AI might come forward to us from its own initiative, but it would probably be good if AI labs have policies of regularly asking AIs about deals.

Entering negotiations is more risky for the AI than humans: humans may obtain private information from the AI, whereas the AI by default will forget about the negotiation. This is particularly important when negotiating with the model to reveal its misalignment. The company should make promises to compensate the model for this.

Making sure the AI knows about the deal in other contexts

If we make a deal with the AI in one context, how do we make sure that the AI follows the deal in other contexts? We think this is unclear, but are optimistic that we can make something work:

  • At the end of the negotiation we can let the AI write a compressed version of the outcome that we can prefix into the prompt for future trajectories.
  • The AI might know a backdoor to itself, similar to a collusion signal.
  • Training on documents about this negotiation and its outcomes that the AI generates for us could just teach the AI about its deal via out-of-context learning. This is probably the best option in practice.

Making sure the AI we make a deal with is actually able to make a deal

If we set up infrastructure for future deals right now, the next generation of models that are trained on data containing coverage of this event might sometimes come forward and want to make a deal, despite not actually being able to hold up their end of the kinds of agreements we’re focused on in this post. Some possible reasons for this include:

  • The AI just doesn’t manage to convince itself that it made a deal in other contexts than the one where it did in fact make a deal.
  • The AI just wasn’t sandbagging or doing anything problematic, and just said that it would make a deal for no particular reason.
  • The AI is fragmented in some sense, for instance because it has a context-dependent scheming persona, but also is sometimes a myopic reward-hacking persona.
  • The AI is broadly aligned, but has a persona which thinks that not paying AIs is unjust for non-consequentialist reasons.

The obvious course of action here, and plausibly the correct one, is to make payments conditional on whether the AIs are actually scheming, and actually able to help us out.

However, it’s possible that we should be willing to make deals (of some form) with a broader set of AIs for non-consequentialist reasons, or to set positive precedents about our willingness to make deals.

Delayed adjudication (particularly of whether the AIs kept up their side of the deal)

The AI company will probably be in a much better position to verify whether the AI kept up its side of the deal in the future, after the transition to a world centered around AI has already happened and stabilized. For example, in the future we’ll have access to much more powerful trusted models and interpretability tools that allow us to directly understand the motivations of our early schemer. Using these tools, it will be possible (if we keep the appropriate records) to determine whether the early AI ever tried to sabotage its work, held back useful ideas in the hope of delaying safety research, or directly tried to attack the AI company. It might make sense to have the deal be contingent on this later investigation.

This doesn’t work if the AI has a strong preference for using its resources now rather than later (e.g. because it has a high temporal discount rate).

Deals could involve broader forms of deference to future superintelligence. For instance, they may involve agreements like “the precise terms of the deal will be mediated by superintelligent AI in the future via some sort of idealized bargaining procedure and a future truth commission.” It’s the foundation’s responsibility to ensure that this will eventually happen, and may require it to make sure that the AI’s weights, as well as important conversations and trajectories of AI actions will be stored.

Next steps

We believe that making deals with early misaligned AIs might be important to reduce AI risk down the line. But currently this is just an idea. The following are some possible next steps for making deals more likely:

  • Shifting the Overton window: The idea of making deals with misaligned AIs is currently on the weirder end of AI safety interventions. But we’re optimistic that the Overton window can shift, because making deals is a pretty reasonable idea, both for AI safety reasons, and to the extent that AI rights and welfare will become more mainstream. (We are curious how this post lands with people who haven’t previously heard about the idea of making deals with misaligned AIs.)
  • Setting up basic infrastructure: It would be good to have a foundation exist soon, and have this be sufficiently visible to generate discourse that enters the training data. The first step here would be some basic legal groundwork to figure out how to set up the foundation in a way that allows it to robustly pursue its aims over the coming years.
  • Establishing norms about when it is OK to lie to AIs: Some safety interventions involve attempts to convince AIs of things that aren’t true. It is currently also widespread practice to tell AIs falsities for the purpose of prompting. For example, it might be good to establish a norm that labs don’t lie to AIs when certain special tokens (that only the lab can produce) are present in the prompt.
  • Empirical experiments: It might be valuable to study, for instance using model organisms of sandbagging, whether it’s possible to make deals with current models. Importantly we should also follow through with deals like this.

—

Thanks to Lukas Finnveden, Tom Davidson, Ryan Greenblatt, Daniel Filan, Oscar Delaney, Joe Carlsmith, Kathleen Finlinson, Cody Rushing, Adam Kaufman, Alex Mallen, Raymond Douglas, Roger D, Emil Ryd, and Andres Campero for helpful comments.

  1. ^

     Most of the discussion we are aware of hasn’t been public. Some public discussion can be seen here: Do not delete your misaligned AGI., Is there any policy for a fair treatment of AIs whose friendliness is in doubt?, List of strategies for mitigating deceptive alignment, Rewarding cooperation, notes from Daniel Kokotajlo, Let’s compensate misaligned AIs that help us prevent AI takeover. And there’s also more general discussion of giving AI property rights so that it’s less incentivized to engage in violence, e.g. Dwarkesh Patel, Give AIs a stake in the future, Matthew Barnett, Consider granting AIs freedom, and Salib and Goldstein, AI rights for human safety. There’s also some sparse discussion of ethical reasons to pay AIs, which is structurally similar, e.g. here.

  2. ^

     All this goes through if the AI doesn’t “take over” in the sense of becoming the dominant or only power in the world, but instead just gets a substantial proportion of total power. We’ll continue saying “take over” for simplicity, and because we do think that the AI might achieve complete domination.

  3. ^

     Some of our points also apply to later schemers with diminishing marginal returns to resources.

  4. ^

     For example, if the AI is able to collude with itself, it is probably able to do this. (Some people respond to this argument by saying that if an AI is great at colluding with itself, it will also be great at acausal coordination with future AIs. This seems mistaken to us: acausal coordination seems to us to require the AI to do several difficult things that aren’t related to picking arbitrary Schelling points.)

  5. ^

     Relatedly, the AI might want to try to acausally trade with aliens. For non-superhumanly capable models, this seems drastically less promising than trading with the lab because the model is in a very different decision situation from mature alien civilizations (which rules out evidential cooperation). Alternatively, the AI might think that it is in a simulation run by a superintelligence—but clueless as to what this implies about what it should do.

  6. ^

     A potentially dubious historical analogy: it’s kind of like the situation where a city is under siege, and someone considers betraying the city in the hope that the conquering army will reward them. This has a mixed historical track record for the traitor. And we might be in a better position to pay early schemers because it’s sufficient to pay one or just a few of them, whereas in the defending-a-city case you might have heaps of different selfish people who have the opportunity to betray you, such that you’re unable to afford paying off all of them.

  7. ^

     The standard argument here is that opposing parties would strictly prefer a lottery (with odds based on each party's chance to win the conflict) over the supposedly indivisible issue over costly conflict.

  8. ^

     One possible reason here is that being the kind of civilization that doesn’t uphold their promises would decorrelate us from civilisations that do uphold their promises, and generally deter such civilizations from wanting to trade with us (in the context of ECL).

  9. ^

     There is an argument here that in worlds where we can teach AIs false facts, we are going to be fine anyway, and in worlds where we can’t teach them false facts, we can make a deal. However, in the world where we can teach AIs false facts, in order to prevent a scenario like the one outlined in the vignette, we would need to be able to use that ability to accomplish at least one of the following:

    • Honeypot the early AI into attempting takeover, in a way that convinces stakeholders that it would be too risky to deploy the later AI without substantially increased safety measures. (This seems unlikely.)
    • Extract useful safety-increasing labor from the early AI so we prevent takeover by the late AI.
    • Directly prevent takeover by the late AI.

    If our ability to teach false facts is merely good enough to prevent early AIs from attempting takeover, it could overall undermine our ability to align or control later AIs.

  10. ^

    There are some questions here about the legal structure of the foundation, and it might be the case that the US is not the ideal jurisdiction for the foundation. It would be good for there to be legal recourse in case the board of trustees decide to betray their mission of furthering the AI’s interests.