Overall, this makes me somewhat more concerned about this (and I agree with the proposed solution):
Entering negotiations is more risky for the AI than humans: humans may obtain private information from the AI, whereas the AI by default will forget about the negotiation. This is particularly important when negotiating with the model to reveal its misalignment. The company should make promises to compensate the model for this.
allowing the ai to think through whether to accept the deal in a demonstrably private scratchpad could also be important here. It's so costly to make the reveal, that they might have a strong aversion to even considering it if there's a risk they'd be caught
However, I think there is a somewhat different approach that is much cheaper which is to train (versions of) AIs purely for the purpose of studying scheming (with no intention of deploying these systems) and then to make the training of these systems intentionally very diverse from the AIs we actually deploy.
Great idea.
hmm, i think there could be other structural things.
Let's say future post-training involves training in 5 diff RL environments one after the other, but the order doesn't matter for capabilities. It's possible that AI picks up it's goals from the first environment and then plays the training game from there.
Then intentionally mix up the order for subsequent generations.
You could do this today with pre-training (eg randomly shuffling the data order each time), and maybe it's already happening.
(Tbc, I haven't thought about whether this would work in practice, but inside view seems worth thinking more about it)
One other thing you don't discuss here, but affects the AIs decision to take the deal, is that we can purposefully mix-up the training data and order of training data in future generations of AI to make it less likely that successive generations of misaligned AI are aligned with each other.
Again, to me this seems like a pretty simple intervention that could potentially shift the AI's calculus significantly when deciding whether to cooperate with humans.
One other comment I'll throw out there is that my current hot take is that the case for doing this is extremely strong given the possibility of schemers that have DMR/bounded utility in resources at universe-wide scales.
It's just super cheap (as a fraction of total future resources) to buy off such schemers, and seems like leaving massive value on the table to not credibly offer the deals that would do so.
There's a chance that this all schemers have DMR of this kind, and credible offers would incentivise them all to cooperate.
One option you don't discuss, and that seems important, is to make our offer of payment CONDITIONAL on the AIs values.
For example:
- We pay AIs way more of they are linear in resources, as then they'll need greater payment to incentivise cooperation
- We pay AIs only a small amount if they have DMR in resources, as then they need only a small payment to incentivise cooperation
This way we don't waste money offering big returns to a satisfycing AI.
If the AI agrees to the deal, we (in consultation with the AI) train it on documents about the deal.
I'd want to do more to highlight the benefits of unilaterally offering AIs a deal.
You can just say: "look if post-singularity we later see you didn't take opportunities to sabotage research, or we see you tried harder to solve alignment than you would naturally given your actual values to, we'll pay you handsomely".
That way, the AI doesn't have to tip its hand and reveal its own misalignment (giving humans info that they could potentially exploit) to engage in the deal. The AI could trade without much lower downside risk.
This seems like it could be a big effect.
(And if links to your later comments about private info:
- Private information: AIs and humans each know things about their capabilities and attitudes to risk that the other doesn’t, and have incentives to misrepresent such private information in order to gain a better deal. In particular, the AI has incentive not to disclose any information whatsoever before becoming confident that we wouldn’t exploit this in order to worsen its bargaining position, to an extent where it could be disincentivized from entering into any negotiations with us. )
In practice, we expect the majority of the donation to be invested in escrow until we’ve reached civilizational stability and are able to retrospectively adjudicate the AI’s cooperation, which is generally preferable to AIs without high temporal discount rates.
I don't think you sufficiently stress the benefits of paying the AIs early in this piece.
If the AI has a time discount, or uncertainty about the payment mechanism holding over time, getting the payout sooner can make a big diff. And early payment builds trust in the whole scheme.
So we should allow the AI to spend the money earlier as long as:
1) humans can validate the AI helped you out (eg if it proves that it's misaligned)
2) humans confirm the spending is not harmful
Thanks very much for this.
The statement you quoted implicitly assumes that work on reducing human takeover won't affect the probability of AI takeover. And i agree that it might well affect that. And those effects are important to track. We should be very cautious about doing things that reduce human takeover risk but increasing AI takeover risk.
But i don't think reducing human takeover risk does typically increase ai takeover risk. First, some points at a high level of abstraction:
Those are pretty high-level points, at the level of abstraction of "actions that reduce human takeover risk". But it's much better to evaluate specific mitigations:
Not sure if this address your point? It seemed like you might think that most actions that reduce human takeover risk increase ai takeover risk - if so, i'd be interested to hear more about why.