Here's a story for how we could get lots of AI help with AI safety research even if schemers are somewhat common and diffuse control doesn't work to get them to help us:
Downsides with this approach:
Overall, thinking about this makes me:
I explore similar considerations here in The Case for Mixed Deployment (5 min read), key takeaways were:
However, I think that (contra your proposal) most of the oomph comes from the AIs monitoring and cross-examining each others work, rather than running them in parallel. That is, I disagree with "If you split your compute across 20 different types of AIs, then your favorite type of AI is going to only have 5% as much compute available to it as it would've had if you concentrated your bets." This is because I think we can run things like debate between all the variants.
One upshot of this difference is that I expect coordination between labs to matter significantly: if lab A starts with 100% schemers and lab B starts with 5% schemers, then we'll elicit useful work from the AIs if the labs cross-examine each other's AI research with their own AI variants.
I explore similar considerations here in The Case for Mixed Deployment (5 min read)
Oh nice, thanks!
However, I think that (contra your proposal) most of the oomph comes from the AIs monitoring and cross-examining each others work, rather than running them in parallel.
Yeah, I think the case for diverse monitors to achieve high-stakes control is actually a lot more straightforward. It's easy to implement a monitor in a way such that it has very little ability to autonomously cause bad things to happen, so basically the only cost is the extra compute it takes to train and run it. And in the high-stakes/concentrated failures regime, humans will typically be able to recognize attempted catastrophes once pointed out to them.
I agree that debate-style approaches seem good. I do think it seems a lot more feasible to judge such debates once the AIs have already accomplished a lot of work, vs. early on in the research when AIs are making taste-driven judgments about direction, so it complements nicely with letting AIs do a lot of work in parallel also.
a company successfully solves control for "high-stakes"/"concentrated" threats, meaning that AIs can safely be deployed without a large risk of them hacking their own datacenters or exfiltrating or causing any similar catastrophic action.
This seems like quite a difficult task, especially since we (or some other trusted agent) need to read the outputs of these AIs to get any value from their safety research. Schemers could self-exfiltrate by manipulating us somehow.
- The company probably can't initially tell apart the schemers from the aligned AIs, or indeed, tell which of the projects are succeeding at a day-to-day basis. (It's just too hard to evaluate which research directions are genuinely promising vs. a waste of time.) But they run them all in parallel, and after several months of massively sped-up work, it's likely much easier to adjudicate whose solutions makes sense to adopt.
- This is similar to how, in well-functioning scientific fields, it's not typically possible to get consensus on what speculative research directions are most promising. But once a research project or whole research agenda is completed, there are normally ways in which the value-add can be demonstrated.
Maybe, but the schemers could optimize their research for looking-good-to-us, and it might be hard to distinguish this from actually good work. That's a disanalogy with well-functioning scientific fields: scientists don't deliberately pick bad research directions and try hard to make them seem good.
I think self-exfiltration via manipulation seems pretty hard. I think we're likely to have transformatively useful systems that can't do that, for some amount of time. (Especially since there's no real reason to train them to be good at manipulation, though of course they might generalize from other stuff.) I agree people should definitely be thinking about it as a potential problem and try to mitigate and estimate the risk.
That's a disanalogy with well-functioning scientific fields: scientists don't deliberately pick bad research directions and try hard to make them seem good.
Some scientists do optimize for (something like) "impressiveness" of their work, regardless of whether it's good or bad. It's true that they don't intentionally compromise impressiveness in order to make the work more misleading. That said, if some models optimize for impressiveness, and some models compromise impressiveness to make their work more misleading, then I guess the humans should be somewhat more likely to use the more impressive work? So maybe that limits the degree of influence that intentionally misleading work could I have.
I am definitely pretty worried about humans ability to judge what work is good or bad. In addition to the disanalogy you mention, I'd also highlight (i) non-experts typically struggle to judge which expert is right in a dispute, and if humans are slow & dumb enough relative to the AIs, humans might be analogous to non-experts, and (ii) it's plausible to me that human science benefits a lot from the majority of scientists (and maybe especially the most capable scientists?) having at least a weak preference for truth, e.g. because it's easier to recognize a few bad results when you can lean on the presumed veracity of most established results.
Maybe, but the schemers could optimize their research for looking-good-to-us, and it might be hard to distinguish this from actually good work
if you ask each variant to review the research of the other variants, then the schemers need to optimise their research for looking good to each variant. but the optimistic assumption is that at least one variant is an equally capable non-schemer.
then the schemers need to optimise their research for looking good to each variant
Not necessarily. The nice AIs also need to be able to win a debate against the schemers, as judged by humans. It's not enough for the variants to be able to recognize poor research if they can't show their work in an unrefutable (by the schemers) way.