AI ALIGNMENT FORUM
AF

942
Tom Davidson
Ω4059290
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
AI-enabled coups: a small group could use AI to seize power
Tom Davidson2mo11

Thanks very much for this.

The statement you quoted implicitly assumes that work on reducing human takeover won't affect the probability of AI takeover. And i agree that it might well affect that. And those effects are important to track. We should be very cautious about doing things that reduce human takeover risk but increasing AI takeover risk.

But i don't think reducing human takeover risk does typically increase ai takeover risk. First, some points at a high level of abstraction:

  • If human takeover is possible then the incentive the race is a lot higher. The rewards of winning are higher -- you get a personal DSA. And the costs of losing are higher -- you get completely dominated.
  • A classic strategy for misaligned AI takeover is "divide and rule". Misaligned AI offers greedy humans opportunities to increase their own power, increasing its own influence in the process. This is what happened with the Conquistadors i believe. If there are proper processes preventing illegitimate human power-seeking, this strategy becomes harder for misaligned AI to pursue.
  • If someone actually tries to stage  a human takeover, i think they'll take actions that massively increase AI risk. Things like: training advanced AI to reason about how to conceal its secret loyalties from everyone else and game all the alignment audits; deploying AI broadly without proper safeguards; getting AIs from one company deployed in the military; executing plans your AI advisor gave you that you don't fully understand and haven't been independently vetted.

Those are pretty high-level points, at the level of abstraction of "actions that reduce human takeover risk". But it's much better to evaluate specific mitigations:

  • Alignment audits reduce human takeover and ai takeover.
  • Infosecurity against tampering with the weights reduces both risks.
  • Many things that make it hard for lab insiders to insert secret loyalties also make it hard for misaligned AI (+ humans they manipulate) to pass their specific type of misalignment onto future generations. (I think making it hard for misaligned AI to do this could be pretty crucial.)
  • Guardrails and control measures help with both risks.
  • Clear rules for what Ai should and shouldn't do in high-stakes situations (gov and military deployments) reduces both risks. It reduces wiggle room for misaligned AI and humans to use such deployments to seize power
  • Transparency about Ai capabilities and what large amounts of compute are being used for reduces both risks i think.
    • Though there is more uncertainty here. If you really trust one particular lab and expect them to win the race and solve alignment, you might think that transparency will prevent them from saving the world. (Maybe this is the kind of thing you have in mind?) My own view here is that no company/project should be trusted with this.
  • Making one centralised project would (i'd guess) increase human takeover risk but reduce misaligned risk. And i agree that those worried about human takeover should be wary to oppose centralised projects for this reason. Though i also think those worried about ai takeover should be wary about pushing for centralised projects.

 

Not sure if this address your point? It seemed like you might think that most actions that reduce human takeover risk increase ai takeover risk - if so, i'd be interested to hear more about why. 

Reply
Making deals with early schemers
Tom Davidson4mo11

Overall, this makes me somewhat more concerned about this (and I agree with the proposed solution):

Entering negotiations is more risky for the AI than humans: humans may obtain private information from the AI, whereas the AI by default will forget about the negotiation. This is particularly important when negotiating with the model to reveal its misalignment. The company should make promises to compensate the model for this.

 

allowing the ai to think through whether to accept the deal in a demonstrably private scratchpad could also be important here. It's so costly to make the reveal, that they might have a strong aversion to even considering it if there's a risk they'd be caught

Reply
Making deals with early schemers
Tom Davidson4mo10

However, I think there is a somewhat different approach that is much cheaper which is to train (versions of) AIs purely for the purpose of studying scheming (with no intention of deploying these systems) and then to make the training of these systems intentionally very diverse from the AIs we actually deploy.

Great idea.

Reply
Making deals with early schemers
Tom Davidson4mo30

hmm, i think there could be other structural things.

  • Maybe you shift around different stages of post-training
  • Maybe you realise that randomising means you get the same values each time, so you purposefully over-sample certain types of environments or pre-training data earlier on. E.g. first pre-training heavily on math, then heavily on fiction; then reversing that for the next gen. Or first doing RL on lots of SWE envs, then doing general computer-use envs; then reversing the order for the next gen.
  • You could potentially do experiments to determine what kind of features of the data/envs influences the values of the AI and then make sure that those specific features differ in the early stages of fine-tuning for different training runs. 
Reply1
Making deals with early schemers
Tom Davidson4mo10

Let's say future post-training involves training in 5 diff RL environments one after the other, but the order doesn't matter for capabilities. It's possible that AI picks up it's goals from the first environment and then plays the training game from there.

 

Then intentionally mix up the order for subsequent generations. 

 

You could do this today with pre-training (eg randomly shuffling the data order each time), and maybe it's already happening. 

 

(Tbc, I haven't thought about whether this would work in practice, but inside view seems worth thinking more about it)

Reply
Making deals with early schemers
Tom Davidson4mo30

One other thing you don't discuss here, but affects the AIs decision to take the deal, is that we can purposefully mix-up the training data and order of training data in future generations of AI to make it less likely that successive generations of misaligned AI are aligned with each other. 

 

Again, to me this seems like a pretty simple intervention that could potentially shift the AI's calculus significantly when deciding whether to cooperate with humans. 

Reply
Making deals with early schemers
Tom Davidson4mo811

One other comment I'll throw out there is that my current hot take is that the case for doing this is extremely strong given the possibility of schemers that have DMR/bounded utility in resources at universe-wide scales.

 

It's just super cheap (as a fraction of total future resources) to buy off such schemers, and seems like leaving massive value on the table to not credibly offer the deals that would do so. 

 

There's a chance that this all schemers have DMR of this kind, and credible offers would incentivise them all to cooperate. 

Reply
Making deals with early schemers
Tom Davidson4mo32

One option you don't discuss, and that seems important, is to make our offer of payment CONDITIONAL on the AIs values.
 

For example:
- We pay AIs way more of they are linear in resources, as then they'll need greater payment to incentivise cooperation
- We pay AIs only a small amount if they have DMR in resources, as then they need only a small payment to incentivise cooperation


This way we don't waste money offering big returns to a satisfycing AI. 

Reply
Making deals with early schemers
Tom Davidson4mo10

If the AI agrees to the deal, we (in consultation with the AI) train it on documents about the deal.

 

I'd want to do more to highlight the benefits of unilaterally offering AIs a deal. 

You can just say: "look if post-singularity we later see you didn't take opportunities to sabotage research, or we see you tried harder to solve alignment than you would naturally given your actual values to, we'll pay you handsomely". 

 

That way, the AI doesn't have to tip its hand and reveal its own misalignment (giving humans info that they could potentially exploit) to engage in the deal. The AI could trade without much lower downside risk. 

 

This seems like it could be a big effect. 

 

(And if links to your later comments about private info:
 

  1. Private information: AIs and humans each know things about their capabilities and attitudes to risk that the other doesn’t, and have incentives to misrepresent such private information in order to gain a better deal. In particular, the AI has incentive not to disclose any information whatsoever before becoming confident that we wouldn’t exploit this in order to worsen its bargaining position, to an extent where it could be disincentivized from entering into any negotiations with us. )
Reply
Making deals with early schemers
Tom Davidson4mo20

In practice, we expect the majority of the donation to be invested in escrow until we’ve reached civilizational stability and are able to retrospectively adjudicate the AI’s cooperation, which is generally preferable to AIs without high temporal discount rates.

 

I don't think you sufficiently stress the benefits of paying the AIs early in this piece.

 

If the AI has a time discount, or uncertainty about the payment mechanism holding over time, getting the payout sooner can make a big diff. And early payment builds trust in the whole scheme.

 

So we should allow the AI to spend the money earlier as long as:
1) humans can validate the AI helped you out (eg if it proves that it's misaligned)
2) humans confirm the spending is not harmful

Reply1
Load More
23Could one country outgrow the rest of the world?
2mo
0
26Thoughts on Gradual Disempowerment
2mo
0
24How quick and big would a software intelligence explosion be?
3mo
0
51The Industrial Explosion
4mo
0
52AI-enabled coups: a small group could use AI to seize power
6mo
9
37Will compute bottlenecks prevent a software intelligence explosion?
6mo
6
6Will the Need to Retrain AI Models from Scratch Block a Software Intelligence Explosion?
7mo
0
10Will AI R&D Automation Cause a Software Intelligence Explosion?
7mo
0
16Three Types of Intelligence Explosion
7mo
0
36Takeoff speeds presentation at Anthropic
1y
0
Load More