Tom Davidson — AI Alignment Forum

Thanks! (Quickly written reply!)

I believe I was here thinking about how society has, at least in the past few hundred years, spent a minority of GDP on obtaining new raw materials. Which suggests that access to such materials wasn't a significant bottleneck on expansion.

So it's a stronger claim that "hard cap". I think a hard cap would, theoretically, result in all GDP being used to unblock the bottleneck, as there's no other way to increase GDP. I think you could quantify the strength of the bottleneck as the marginal elasticity of GDP to additional raw materials. In a task-based model, i think the % of GDP spent on each task is proportional to this elasticity?

Though this seems kind of like a fully general argument

Yeah, I think maybe it is? I do feel like, given the very long history of sustained growth, it's on the sceptic to explain why their proposed bottleneck will kick in with explosive growth but not before. So you could state my argument as: raw materials never bottlenecked growth before; no particular reason they would just bc growth is faster bc that faster growth is driven by having more labour+capital which can be used for gathering more resources; so we shouldn't expect raw materials to bottleneck growth in the future.

TBC, this is all compatible with "if we had way more raw materials then this would boost output". E.g. in Cobb Douglas doubling an input to output increases output notably, but there still aren't bottlenecks.

(And i actually agree that it's more like CES with rho<0, i.e. raw materials is a stronger bottleneck, but just think we'll be able to spend output to get more raw materials.)

(Also, to clarify: this is all about the feasibility of explosive growth. I'm not claiming it would be good to do any of this!)

AI-enabled coups: a small group could use AI to seize power

Tom Davidson4mo11

Thanks very much for this.

The statement you quoted implicitly assumes that work on reducing human takeover won't affect the probability of AI takeover. And i agree that it might well affect that. And those effects are important to track. We should be very cautious about doing things that reduce human takeover risk but increasing AI takeover risk.

But i don't think reducing human takeover risk does typically increase ai takeover risk. First, some points at a high level of abstraction:

If human takeover is possible then the incentive the race is a lot higher. The rewards of winning are higher -- you get a personal DSA. And the costs of losing are higher -- you get completely dominated.
A classic strategy for misaligned AI takeover is "divide and rule". Misaligned AI offers greedy humans opportunities to increase their own power, increasing its own influence in the process. This is what happened with the Conquistadors i believe. If there are proper processes preventing illegitimate human power-seeking, this strategy becomes harder for misaligned AI to pursue.
If someone actually tries to stage a human takeover, i think they'll take actions that massively increase AI risk. Things like: training advanced AI to reason about how to conceal its secret loyalties from everyone else and game all the alignment audits; deploying AI broadly without proper safeguards; getting AIs from one company deployed in the military; executing plans your AI advisor gave you that you don't fully understand and haven't been independently vetted.

Those are pretty high-level points, at the level of abstraction of "actions that reduce human takeover risk". But it's much better to evaluate specific mitigations:

Alignment audits reduce human takeover and ai takeover.
Infosecurity against tampering with the weights reduces both risks.
Many things that make it hard for lab insiders to insert secret loyalties also make it hard for misaligned AI (+ humans they manipulate) to pass their specific type of misalignment onto future generations. (I think making it hard for misaligned AI to do this could be pretty crucial.)
Guardrails and control measures help with both risks.
Clear rules for what Ai should and shouldn't do in high-stakes situations (gov and military deployments) reduces both risks. It reduces wiggle room for misaligned AI and humans to use such deployments to seize power
Transparency about Ai capabilities and what large amounts of compute are being used for reduces both risks i think.
- Though there is more uncertainty here. If you really trust one particular lab and expect them to win the race and solve alignment, you might think that transparency will prevent them from saving the world. (Maybe this is the kind of thing you have in mind?) My own view here is that no company/project should be trusted with this.
Making one centralised project would (i'd guess) increase human takeover risk but reduce misaligned risk. And i agree that those worried about human takeover should be wary to oppose centralised projects for this reason. Though i also think those worried about ai takeover should be wary about pushing for centralised projects.

Not sure if this address your point? It seemed like you might think that most actions that reduce human takeover risk increase ai takeover risk - if so, i'd be interested to hear more about why.

Making deals with early schemers

Tom Davidson6mo11

Overall, this makes me somewhat more concerned about this (and I agree with the proposed solution):
Entering negotiations is more risky for the AI than humans: humans may obtain private information from the AI, whereas the AI by default will forget about the negotiation. This is particularly important when negotiating with the model to reveal its misalignment. The company should make promises to compensate the model for this.

allowing the ai to think through whether to accept the deal in a demonstrably private scratchpad could also be important here. It's so costly to make the reveal, that they might have a strong aversion to even considering it if there's a risk they'd be caught

Making deals with early schemers

Tom Davidson6mo10

However, I think there is a somewhat different approach that is much cheaper which is to train (versions of) AIs purely for the purpose of studying scheming (with no intention of deploying these systems) and then to make the training of these systems intentionally very diverse from the AIs we actually deploy.

Great idea.

Making deals with early schemers

Tom Davidson6mo30

hmm, i think there could be other structural things.

Maybe you shift around different stages of post-training
Maybe you realise that randomising means you get the same values each time, so you purposefully over-sample certain types of environments or pre-training data earlier on. E.g. first pre-training heavily on math, then heavily on fiction; then reversing that for the next gen. Or first doing RL on lots of SWE envs, then doing general computer-use envs; then reversing the order for the next gen.
You could potentially do experiments to determine what kind of features of the data/envs influences the values of the AI and then make sure that those specific features differ in the early stages of fine-tuning for different training runs.

Making deals with early schemers

Tom Davidson6mo10

Let's say future post-training involves training in 5 diff RL environments one after the other, but the order doesn't matter for capabilities. It's possible that AI picks up it's goals from the first environment and then plays the training game from there.

Then intentionally mix up the order for subsequent generations.

You could do this today with pre-training (eg randomly shuffling the data order each time), and maybe it's already happening.

(Tbc, I haven't thought about whether this would work in practice, but inside view seems worth thinking more about it)

Making deals with early schemers

Tom Davidson6mo30

One other thing you don't discuss here, but affects the AIs decision to take the deal, is that we can purposefully mix-up the training data and order of training data in future generations of AI to make it less likely that successive generations of misaligned AI are aligned with each other.

Again, to me this seems like a pretty simple intervention that could potentially shift the AI's calculus significantly when deciding whether to cooperate with humans.

Making deals with early schemers

Tom Davidson6mo811

One other comment I'll throw out there is that my current hot take is that the case for doing this is extremely strong given the possibility of schemers that have DMR/bounded utility in resources at universe-wide scales.

It's just super cheap (as a fraction of total future resources) to buy off such schemers, and seems like leaving massive value on the table to not credibly offer the deals that would do so.

There's a chance that this all schemers have DMR of this kind, and credible offers would incentivise them all to cooperate.

Making deals with early schemers

Tom Davidson6mo32

One option you don't discuss, and that seems important, is to make our offer of payment CONDITIONAL on the AIs values.

For example:
- We pay AIs way more of they are linear in resources, as then they'll need greater payment to incentivise cooperation
- We pay AIs only a small amount if they have DMR in resources, as then they need only a small payment to incentivise cooperation

This way we don't waste money offering big returns to a satisfycing AI.

Making deals with early schemers

Tom Davidson6mo10

If the AI agrees to the deal, we (in consultation with the AI) train it on documents about the deal.

I'd want to do more to highlight the benefits of unilaterally offering AIs a deal.

You can just say: "look if post-singularity we later see you didn't take opportunities to sabotage research, or we see you tried harder to solve alignment than you would naturally given your actual values to, we'll pay you handsomely".

That way, the AI doesn't have to tip its hand and reveal its own misalignment (giving humans info that they could potentially exploit) to engage in the deal. The AI could trade without much lower downside risk.

This seems like it could be a big effect.

(And if links to your later comments about private info:

Private information: AIs and humans each know things about their capabilities and attitudes to risk that the other doesn’t, and have incentives to misrepresent such private information in order to gain a better deal. In particular, the AI has incentive not to disclose any information whatsoever before becoming confident that we wouldn’t exploit this in order to worsen its bargaining position, to an extent where it could be disincentivized from entering into any negotiations with us. )

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments