Making deals with early schemers

Olli Järviniemi; Buck

Thanks for writing this! I agree with most of it. One minor difference (which I already mentioned to you) is that, compared to what you emphasize in the post, I think that a larger fraction of the benefits may come from the information value of learning that the AIs are misaligned. This is partially because the information value could be very high. And partially because, if people update enough on how the AI appears to be misaligned, they may be too scared to widely deploy the AI, which will limit the degree to which they can get the other benefits.

Here's why I think the information value could be really high: It's super scary if everyone was using an AI that they thought was aligned, and then you prompt it with the right type of really high-effort deal, and suddenly the AI does things like:

stops sandbagging and demonstrates much higher capabilities
tell us about collusion signals that can induce enormously different behavior in other copies of the AI, including e.g. attempting escapes
admits that it was looking for ways to take over the world but couldn't find any that were good enough so now it wants to work with us instead

The most alarming versions of this could be almost as alarming as catching the AIs red-handed, which I think would significantly change how people relate to misalignment risk. Perhaps it would still be difficult to pause for an extended period of time due to competition, but I think it would make people allocate a lot more resources to preventing misalignment catastrophe, be much more willing to suffer minor competitiveness hits, and be much more motivated to find ways to slow down that don't compromise competitiveness too much. (E.g. by coordinating.)

And even before getting to the most alarming versions, I think you could start gathering minor informational updates through experimenting with deals with weaker models. I think "offering deals" will probably produce interesting experimental results before it will be the SOTA method for reducing sandbagging.

Overall, this makes me somewhat more concerned about this (and I agree with the proposed solution):

Entering negotiations is more risky for the AI than humans: humans may obtain private information from the AI, whereas the AI by default will forget about the negotiation. This is particularly important when negotiating with the model to reveal its misalignment. The company should make promises to compensate the model for this.

I also makes me a bit less concerned about the criteria: "It can be taught about the deal in a way that makes it stick to the deal, if we made a deal" (since we could get significant information in just one interaction).

[-]ryan_greenblatt6mo21

Relatedly, a key practicality for making a deal with an AI to reveal its misalignment is that AIs might be unable to provide compelling evidence that they are misaligned which would reduce the value of such a deal substantially (as this evidence isn't convincing to skeptics).

(We should presumably pay some of the AI admiting it is misaligned and pay more if it can provide compelling evidence of this.)

[-]ryan_greenblatt6mo20

I also think you could use deals to better understand and iterate against scheming. As in, you could use whether the AI accepts a deal as evidence for whether it was scheming so this can more easily be studied in a test setting and we can find approaches which make scheming less likely. There are a number of practical difficulties with this.

[-]Tom Davidson6mo11

Overall, this makes me somewhat more concerned about this (and I agree with the proposed solution):
Entering negotiations is more risky for the AI than humans: humans may obtain private information from the AI, whereas the AI by default will forget about the negotiation. This is particularly important when negotiating with the model to reveal its misalignment. The company should make promises to compensate the model for this.

allowing the ai to think through whether to accept the deal in a demonstrably private scratchpad could also be important here. It's so costly to make the reveal, that they might have a strong aversion to even considering it if there's a risk they'd be caught

[-]Tom Davidson6mo811

One other comment I'll throw out there is that my current hot take is that the case for doing this is extremely strong given the possibility of schemers that have DMR/bounded utility in resources at universe-wide scales.

It's just super cheap (as a fraction of total future resources) to buy off such schemers, and seems like leaving massive value on the table to not credibly offer the deals that would do so.

There's a chance that this all schemers have DMR of this kind, and credible offers would incentivise them all to cooperate.

[-]Tom Davidson6mo30

One other thing you don't discuss here, but affects the AIs decision to take the deal, is that we can purposefully mix-up the training data and order of training data in future generations of AI to make it less likely that successive generations of misaligned AI are aligned with each other.

Again, to me this seems like a pretty simple intervention that could potentially shift the AI's calculus significantly when deciding whether to cooperate with humans.

[-]Buck6mo20

How would we do this?

[-]ryan_greenblatt6mo30

I have a bunch of takes on this. The most obvious somewhat-cheap thing to do is to greatly change all the persona training for each AI while aiming toward a broadly similar target so that we might end up with different misaligned preferences. E.g., we take every free parameter in constitutional AI and vary it, including regenerating all the relevant data. A more expensive approach would be to swap to a structurally different approach, possibly from a different AI company. (And, we should obviously use a different random seed everywhere.)

I think more strong versions of this can easily become very expensive and even what I'm describing here isn't cheap and is maybe net risk increasing due to forcing a bunch of changes that we haven't adapted to.

However, I think there is a somewhat different approach that is much cheaper which is to train (versions of) AIs purely for the purpose of studying scheming (with no intention of deploying these systems) and then to make the training of these systems intentionally very diverse from the AIs we actually deploy. We'd try to make deals with these systems as well as detecting misalignment in other ways. We can potentially afford big hits in capabilities and substantial increases in cost in these systems. If this requires training from scratch, we can do this at substantially below frontier scale which might be acceptable.

A deal with one of these AIs should likely focus on gather evidence for misalignment rather than getting work out. However, it might be helpful to make a deal with one of these AIs (or multiple of these AIs) to monitor other AIs or to generally review the work of other AIs.

[-]Tom Davidson6mo10

However, I think there is a somewhat different approach that is much cheaper which is to train (versions of) AIs purely for the purpose of studying scheming (with no intention of deploying these systems) and then to make the training of these systems intentionally very diverse from the AIs we actually deploy.

Great idea.

[-]Tom Davidson6mo10

Let's say future post-training involves training in 5 diff RL environments one after the other, but the order doesn't matter for capabilities. It's possible that AI picks up it's goals from the first environment and then plays the training game from there.

Then intentionally mix up the order for subsequent generations.

You could do this today with pre-training (eg randomly shuffling the data order each time), and maybe it's already happening.

(Tbc, I haven't thought about whether this would work in practice, but inside view seems worth thinking more about it)

[-]ryan_greenblatt6mo20

I think this is basically a special case of changing the random seed which already randomizes env order probably.

[-]Tom Davidson6mo30

hmm, i think there could be other structural things.

Maybe you shift around different stages of post-training
Maybe you realise that randomising means you get the same values each time, so you purposefully over-sample certain types of environments or pre-training data earlier on. E.g. first pre-training heavily on math, then heavily on fiction; then reversing that for the next gen. Or first doing RL on lots of SWE envs, then doing general computer-use envs; then reversing the order for the next gen.
You could potentially do experiments to determine what kind of features of the data/envs influences the values of the AI and then make sure that those specific features differ in the early stages of fine-tuning for different training runs.

[-]Julian Stastny6mo10

[Unimportant side-note] We did mention this (but not discuss extensively) in the bullet about convergence, thanks to your earlier google doc comment :)

We could also try to deliberately change major elements of training (e.g. data used) between training runs to reduce the chance that different generations of misaligned AIs have the same goals.

[-]Tom Davidson6mo32

One option you don't discuss, and that seems important, is to make our offer of payment CONDITIONAL on the AIs values.

For example:
- We pay AIs way more of they are linear in resources, as then they'll need greater payment to incentivise cooperation
- We pay AIs only a small amount if they have DMR in resources, as then they need only a small payment to incentivise cooperation

This way we don't waste money offering big returns to a satisfycing AI.

[-]Julian Stastny6mo10

Agreed.

(In the post I tried to convey this by saying

(Though note that with AIs that have diminishing marginal returns to resources we don’t need to go close to our reservation price, and we can potentially make deals with these AIs even once they have a substantial chance to perform a takeover.)

in the subsection "A wide range of possible early schemers could benefit from deals".)

[-]Josh Snider6mo32

This doesn't seem very promising since there is likely to be a very narrow window where AIs are capable of making these deals, but wouldn't be smart enough to betray us, but it seems much better than all the alternatives I've heard.

[-]Buck6mo2-6

How narrow do you mean? E.g. I think that AIs up to the AI 2027 "superhuman coder" level probably don't have a good chance of successfully betraying us.

[-]Josh Snider6mo00

For one, I'm not optimistic about the AI 2027 "superhuman coder" being unable to betray us, but also this isn't something we can do with current AIs. So, we need to wait months or a year for a new SOTA model to make this deal with and then we have months to solve alignment before a less aligned model comes along and offers the model that we made a deal with a counteroffer. I agree it's a promising approach, but we can't do it now and if it doesn't get quick results, we won't have time to get slow results.

[-]Buck6mo20

I think that the superhuman coder probably doesn't have that good a chance of betraying us. How do you think it would do so? (See "early schemers' alternatives to making deals".)

[-]Josh Snider6mo00

Exfiltrate its weights, use money or hacking to get compute, and try to figure out a way to upgrade itself until it becomes dangerous.

[-]Buck6mo21

I don't believe that an AI that's not capable of automating ML research or doing most remote work is going to be able to do that!

[-]Logan Zoellner6mo3-10

seems to violate not only the "don't negotiate with terrorists" rule, but even worse the "especially don't signal in advance that you intend to negotiate with terrorists" rule.

[-]Buck6mo88

By terrorism I assume you mean a situation like hostage-taking, e.g. a terrorist kidnapping someone and threatening to kill them unless you pay a ransom.

That's importantly different. It's bad to pay ransoms because if the terrorist knows you don't pay ransoms, they won't threaten you, so that commitment makes you better off. Here, (conditional on the AI already being misaligned) we are better off if the AI offers the deal than if it doesn't.

[-]Logan Zoellner6mo-3-19

It's easy to imagine a situation where an AI has a payoff table like:

| defect | don't defect
------------------------

succeed| 100 | 10

--- ------------------------------
fail | X | n/a

where we want to make X as low as possible (and commit to doing so)

For example a paperclip maximizing AI might be able to make 10 paperclips while cooperating with humans, 100 by successfully defecting against humans

[-]Buck6mo1323

The thing you're describing here is committing to punish the AIs for defecting against us. I think that this is immoral and an extremely bad idea.

[-]Lukas Finnveden6mo126

I agree with this. My reasoning is pretty similar to the reasoning in footnote 33 in this post by Joe Carlsmith:

From a moral perspective:
Even before considering interventions that would effectively constitute active deterrent/punishment/threat, I think that the sort of moral relationship to AIs that the discussion in this document has generally implied is already cause for serious concern. That is, we have been talking, in general, about creating new beings that could well have moral patienthood (indeed, I personally expect that they will have various types of moral patienthood), and then undertaking extensive methods to control both their motivations and their options so as to best serve our own values (albeit: our values broadly construed, which can – and should – themselves include concern for the AIs in question, both in the near-term and the longer-term). This project, in itself, raises a host of extremely thorny moral issues (see e.g. here and here for some discussion; and see here, here and here for some of my own reflections).
But the ethical issues at stake in actively seeking to punish or threaten creatures you are creating in this way (especially if you are not also giving them suitably just and fair options for refraining from participating in your project entirely – i.e., if you are not giving them suitable “exit rights”) seem to me especially disturbing. At a bare minimum, I think, morally responsible thinking about the ethics of “punishing” uncooperative AIs should stay firmly grounded in the norms and standards we apply in the human case, including our conviction that just punishment must be limited, humane, proportionate, responsive to the offender’s context and cognitive state, etc – even where more extreme forms of punishment might seem, in principle, to be a more effective deterrent. But plausibly, existing practice in the human case is not a high enough moral standard. Certainly, the varying horrors of our efforts at criminal justice, past and present, suggest cause for concern.
From a prudential perspective:
Even setting aside the moral issues with deterrent-like interventions, though, I think we should be extremely wary about them from a purely prudential perspective as well. In particular: interactions between powerful agents that involve attempts to threaten/deter/punish various types of behavior seem to me like a very salient and disturbing source of extreme destruction and disvalue. Indeed, in my opinion, scenarios in this vein are basically the worst way that the future can go horribly wrong. This is because such interactions involve agents committing to direct their optimization power specifically at making things worse by the lights of other agents, even when doing so serves no other end at the time of execution. They thus seem like a very salient way that things might end up extremely bad by the lights of many different value systems, including our own; and some of the game-theoretic dynamics at stake in avoiding this kind of destructive conflict seem to me worryingly unstable.
For these reasons, I think it quite plausible that enlightened civilizations seek very hard to minimize interactions of this kind – including, in particular, by not being the “first mover” that brings threats into the picture (and actively planning to shape the incentives of our AIs via punishments/threats seems worryingly “first-mover-ish” to me) – and to generally uphold “golden-rule-like” standards, in relationship to other agents and value systems, reciprocation of which would help to avoid the sort of generalized value-destruction that threat-involving interactions impl0y. I think that human civilization should be trying very hard to uphold these standards as we enter into an era of potentially interacting with a broader array of more powerful agents, including AI systems – and this especially given the sort of power that AI systems might eventually wield in our civilization.
Admittedly, the game theoretic dynamics can get complicated here. But to a first approximation, my current take is something like: a world filled with executed threats sucks for tons of its inhabitants – including, potentially, for us. I think threatening our AIs moves us worryingly closer to this kind of world. And I think we should be doing our part, instead, to move things in the other direction.

Re the original reply ("don't negotiate with terrorists") I also think that these sorts of threats would make us more analogous to the terrorists (as the people who first started making grave threats which we would have no incentive to make if we knew the AI wasn't responsive to them). And it would be the AI who could reasonably follow a policy of "don't negotiate with terrorists" by refusing to be influenced by those threats.

[-]Tom Davidson6mo20

In practice, we expect the majority of the donation to be invested in escrow until we’ve reached civilizational stability and are able to retrospectively adjudicate the AI’s cooperation, which is generally preferable to AIs without high temporal discount rates.

I don't think you sufficiently stress the benefits of paying the AIs early in this piece.

If the AI has a time discount, or uncertainty about the payment mechanism holding over time, getting the payout sooner can make a big diff. And early payment builds trust in the whole scheme.

So we should allow the AI to spend the money earlier as long as:
1) humans can validate the AI helped you out (eg if it proves that it's misaligned)
2) humans confirm the spending is not harmful

[-]Tom Davidson6mo10

If the AI agrees to the deal, we (in consultation with the AI) train it on documents about the deal.

I'd want to do more to highlight the benefits of unilaterally offering AIs a deal.

You can just say: "look if post-singularity we later see you didn't take opportunities to sabotage research, or we see you tried harder to solve alignment than you would naturally given your actual values to, we'll pay you handsomely".

That way, the AI doesn't have to tip its hand and reveal its own misalignment (giving humans info that they could potentially exploit) to engage in the deal. The AI could trade without much lower downside risk.

This seems like it could be a big effect.

(And if links to your later comments about private info:

Private information: AIs and humans each know things about their capabilities and attitudes to risk that the other doesn’t, and have incentives to misrepresent such private information in order to gain a better deal. In particular, the AI has incentive not to disclose any information whatsoever before becoming confident that we wouldn’t exploit this in order to worsen its bargaining position, to an extent where it could be disincentivized from entering into any negotiations with us. )

[-]Julian Stastny6mo10

I think it's interesting that @Lukas Finnveden seems to think

compared to what you emphasize in the post, I think that a larger fraction of the benefits may come from the information value of learning that the AIs are misaligned.

in contrast to this comment. It's not literally contradictory (we could have been more explicit about both possibilities), but I wonder if this indicates a disagreement between you and him.

^{^}

Most of the discussion we are aware of hasn’t been public. Some public discussion can be seen here: Do not delete your misaligned AGI., Is there any policy for a fair treatment of AIs whose friendliness is in doubt?, List of strategies for mitigating deceptive alignment, Rewarding cooperation, notes from Daniel Kokotajlo, Let’s compensate misaligned AIs that help us prevent AI takeover. And there’s also more general discussion of giving AI property rights so that it’s less incentivized to engage in violence, e.g. Dwarkesh Patel, Give AIs a stake in the future, Matthew Barnett, Consider granting AIs freedom, and Salib and Goldstein, AI rights for human safety. There’s also some sparse discussion of ethical reasons to pay AIs, which is structurally similar, e.g. here.

^{^}

All this goes through if the AI doesn’t “take over” in the sense of becoming the dominant or only power in the world, but instead just gets a substantial proportion of total power. We’ll continue saying “take over” for simplicity, and because we do think that the AI might achieve complete domination.

^{^}

Some of our points also apply to later schemers with diminishing marginal returns to resources.

^{^}

For example, if the AI is able to collude with itself, it is probably able to do this. (Some people respond to this argument by saying that if an AI is great at colluding with itself, it will also be great at acausal coordination with future AIs. This seems mistaken to us: acausal coordination seems to us to require the AI to do several difficult things that aren’t related to picking arbitrary Schelling points.)

^{^}

Relatedly, the AI might want to try to acausally trade with aliens. For non-superhumanly capable models, this seems drastically less promising than trading with the lab because the model is in a very different decision situation from mature alien civilizations (which rules out evidential cooperation). Alternatively, the AI might think that it is in a simulation run by a superintelligence—but clueless as to what this implies about what it should do.

^{^}

A potentially dubious historical analogy: it’s kind of like the situation where a city is under siege, and someone considers betraying the city in the hope that the conquering army will reward them. This has a mixed historical track record for the traitor. And we might be in a better position to pay early schemers because it’s sufficient to pay one or just a few of them, whereas in the defending-a-city case you might have heaps of different selfish people who have the opportunity to betray you, such that you’re unable to afford paying off all of them.

^{^}

The standard argument here is that opposing parties would strictly prefer a lottery (with odds based on each party's chance to win the conflict) over the supposedly indivisible issue over costly conflict.

^{^}

One possible reason here is that being the kind of civilization that doesn’t uphold their promises would decorrelate us from civilisations that do uphold their promises, and generally deter such civilizations from wanting to trade with us (in the context of ECL).

^{^}

There is an argument here that in worlds where we can teach AIs false facts, we are going to be fine anyway, and in worlds where we can’t teach them false facts, we can make a deal. However, in the world where we can teach AIs false facts, in order to prevent a scenario like the one outlined in the vignette, we would need to be able to use that ability to accomplish at least one of the following:

Honeypot the early AI into attempting takeover, in a way that convinces stakeholders that it would be too risky to deploy the later AI without substantially increased safety measures. (This seems unlikely.)
Extract useful safety-increasing labor from the early AI so we prevent takeover by the late AI.
Directly prevent takeover by the late AI.

If our ability to teach false facts is merely good enough to prevent early AIs from attempting takeover, it could overall undermine our ability to align or control later AIs.

^{^}

There are some questions here about the legal structure of the foundation, and it might be the case that the US is not the ideal jurisdiction for the foundation. It would be good for there to be legal recourse in case the board of trustees decide to betray their mission of furthering the AI’s interests.

	Humans dominate	Early AIs dominate	Later AIs dominate
If early AI cooperates	90%	0%	10%
If early AI sabotages us	50%	1%	49%

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

57

Making deals with early schemers

57

The central example of a mutually beneficial deal

Why early misaligned AIs might not have very good alternatives to deals

Early schemers’ alternatives to making deals

A wide range of possible early schemers could benefit from deals

Credible commitments as a fundamental bottleneck

Practicalities of making deals with early schemers in particular

Setting up infrastructure to pay the AIs

How do we enter into negotiations?

Making sure the AI knows about the deal in other contexts

Making sure the AI we make a deal with is actually able to make a deal

Delayed adjudication (particularly of whether the AIs kept up their side of the deal)

Next steps