Research Scientist at DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/
I'm not going to repeat all of the literature on debate here, but as brief pointers:
(Also, the "arbitrary amounts of time and arbitrary amounts of explanation" was pretty central to my claim; human disagreements are way more bounded than that.)
I do, but more importantly, I want to disallow the judge understanding all the concepts here.
I think I don't actually care about being robust to this assumption. Generally I think of arbitrarily-scalable-debate as depending on a universality assumption (which in turn would rule out "the judge can never understand the concepts"). But even if the universality assumption is false, it wouldn't bother me much; I don't expect such a huge gap between debaters and judges that the judge simply can't understand the debaters' concepts, even given arbitrary amounts of time and arbitrary amounts of explanation from the debaters. (Importantly, I would want to bootstrap alignment, to keep the gaps between debaters and the judge relatively small.)
"The honest strategy"? If you have that, you can just ask it and not bother with the debate. If the problem is distinguishing it, and only dishonest actors are changing their answers based on the provided situation, you can just use that info. But why are you assuming you have an "honest strategy" available here?
The general structure of a debate theorem is: if you set up the game in such-and-such way, then a strategy that simply answers honestly will dominate any other strategy.
So in this particular case I am saying: if you penalize debaters that are inconsistent under cross-examination, you are giving an advantage to any debater that implements an honest strategy, and so you should expect training to incentivize honesty.
Making that kind of abstract conclusion from a practical number of experiments requires abstractions like potential energy, entropy, Noether's theorem, etc - which in this example, the judge doesn't understand. (Without such abstractions, you'd need to consider every possible type of machine separately, which isn't feasible.)
I agree, but I don't see why that matters. As I mentioned, a main point of debate is to produce good oversight of claims without giving the judge an understanding of those claims. In this example I would imagine that you decompose the argument as:
The inventor can disagree with one or more of these claims, then we sample one of the disagreements, and continue debating that one alone, ignoring all the others. This doesn't mean the judge understands the other claims, just that the judge isn't addressing them when deciding who wins the overall debate.
If we recurse on #1, which I expect you think is the hardest one, then you could have a decomposition like "the principle has been tested many times", "in the tests, confirming evidence outweighs the disconfirming evidence", "there is an overwhelming scientific consensus behind it", "there is significant a priori theoretical support" (assuming that's true), "given the above the reasonable conclusion is to have very high confidence in conservation of energy". Again, find disagreements, sample one, recurse. It seems quite plausible to me that you get down to something fairly concrete relatively quickly.
If you want to disallow appeals to authority, on the basis that the correct analogy is to superhuman AIs that know tons of stuff that aren't accepted by any authorities the judge trusts, I still think it's probably doable with a larger debate, but it's harder for me to play out what the debate would look like because I don't know in enough concrete detail the specific reasons why we believe conservation of energy to be true. I might also disagree that we should be thinking about such big gaps between AI and the judge, but that's not central.
The debaters are the same AI with different contexts, so the same is true of both debaters. Am I missing something here?
That seems right, but why is it a problem?
The honest strategy is fine under cross-examination, it will give consistent answers across contexts. Only the dishonest strategy will change its answers (sometimes saying the perpetual energy machines are impossible sometimes saying that they are possible).
There are several different outs to this example:
The process proposed in the paper
Which paper are you referring to? If you mean doubly efficient debate, then I believe the way doubly efficient debate would be applied here is to argue about what the boss would conclude if he thought about it for a long time.
Strongly agree on the first challenge; on the theory workstream we're thinking about how to deal with this problem. Some past work (not from us) is here and here.
Though to be clear, I don't think the empirical evidence clearly rules out "just making neural networks explainable". Imo, if you wanted to do that, you would do things in the style of debate and prover-verifier games. These ideas just haven't been tried very much yet. I don't think "asking an AI what another AI is doing and doing RLHF on the response" is nearly as good; that is much more likely to lead to persuasive explanations that aren't correct.
I'm not that compelled by the second challenge yet (though I'm not sure I understand what you mean). My main question here is how the AI system knows that X is likely or that X is rare, and why it can't just explain that to the judge. E.g. if I want to argue that it is rare to find snow in Africa, I would point to weather data I can find online, or point to the fact that Africa is mostly near the Equator, I wouldn't try to go to different randomly sampled locations and times in Africa and measure whether or not I found snow there.
It depends fairly significantly on how you draw the boundaries; I think anywhere between 30 and 50 is defensible. (For the growth numbers I chose one specific but arbitrary way of drawing the boundaries, I expect you'd get similar numbers using other methods of drawing the boundaries.) Note this does not include everyone working on safety, e.g. it doesn't include the people working on present day safety or adversarial robustness.
Okay, I think it's pretty clear that the crux between us is basically what I was gesturing at in my first comment, even if there are minor caveats that make it not exactly literally that.
I'm probably not going to engage with perspectives that say all current [alignment work towards building safer future powerful AI systems] is net negative, sorry. In my experience those discussions typically don't go anywhere useful.
I expect that, absent impressive levels of international coordination, we're screwed.
This is the sort of thing that makes it hard for me to distinguish your argument from "[regardless of the technical work you do] there will always be some existentially risky failures left, so if we proceed we will get doom. Therefore, we should avoid solving some failures, because those failures could help build political will to shut it all down".
I agree that, conditional on believing that we're screwed absent huge levels of coordination regardless of technical work, then a lot of technical work including debate looks net negative by reducing the will to coordinate.
What kinds of people are making/influencing key decisions in worlds where we're likely to survive?
[...]
I don't think conditioning on the status-quo free-for-all makes sense, since I don't think that's a world where our actions have much influence on our odds of success.
Similarly this only makes sense under a view where technical work can't have much impact on p(doom) by itself, aka "regardless of technical work we're screwed". Otherwise even in a "free-for-all" world, our actions do influence odds of success, because you can do technical work that people use, and that reduces p(doom).
I'm only keen on specifications that plausibly give real guarantees: level 6(?) or 7. I'm only keen on the framework conditional on meeting an extremely high bar for the specification.
Oh, my probability on level 6 or level 7 specifications becoming the default in AI is dominated by my probability that I'm somehow misunderstanding what they're supposed to be. (A level 7 spec for AGI seems impossible even in theory, e.g. because it requires solving the halting problem.)
If we ignore the misunderstanding part then I'm at << 1% probability on "we build transformative AI using GSA with level 6 / level 7 specifications in the nearish future".
(I could imagine a pause on frontier AI R&D, except that you are allowed to proceed if you have level 6 / level 7 specifications; and those specifications are used in a few narrow domains. My probability on that is similar to my probability on a pause.)
Not going to respond to everything, sorry, but a few notes:
It fits the pattern of [lower perceived risk] --> [actions that increase risk].
My claim is that for the things you call "actions that increase risk" that I call "opportunity cost", this causal arrow is very weak, and so you shouldn't think of it as risk compensation.
E.g. presumably if you believe in this causal arrow you should also believe [higher perceived risk] --> [actions that decrease risk]. But if all building-safe-AI work were to stop today, I think this would have very little effect on how fast the world pushes forward with capabilities.
However, I think people are too ready to fall back on the best reference classes they can find - even when they're terrible.
I agree that reference classes are often terrible and a poor guide to the future, but often first-principles reasoning is worse (related: 1, 2).
I also don't really understand the argument in your spoiler box. You've listed a bunch of claims about AI, but haven't spelled out why they should make us expect large risk compensation effects, which I thought was the relevant question.
- Quantify "it isn't especially realistic" - are we talking [15% chance with great effort], or [1% chance with great effort]?
It depends hugely on the specific stronger safety measure you talk about. E.g. I'd be at < 5% on a complete ban on frontier AI R&D (which includes academic research on the topic). Probably I should be < 1%, but I'm hesitant around such small probabilities on any social claim.
For things like GSA and ARC's work, there isn't a sufficiently precise claim for me to put a probability on.
Is [because we have a bunch of work on weak measures] not a big factor in your view? Or is [isn't especially realistic] overdetermined, with [less work on weak measures] only helping conditional on removal of other obstacles?
Not a big factor. (I guess it matters that instruction tuning and RLHF exist, but something like that was always going to happen, the question was when.)
This characterization is a little confusing to me: all of these approaches (ARC / Guaranteed Safe AI / Debate) involve identifying problems, and, if possible, solving/mitigating them.
To the extent that the problems can be solved, then the approach contributes to [building safe AI systems];
Hmm, then I don't understand why you like GSA more than debate, given that debate can fit in the GSA framework (it would be a level 2 specification by the definitions in the paper). You might think that GSA will uncover problems in debate if they exist when using it as a specification, but if anything that seems to me less likely to happen with GSA, since in a GSA approach the specification is treated as infallible.
Google DeepMind does lots of work on safety practice, mostly by other teams. For example, Gemini Safety (mentioned briefly in the post) does a lot of automated red teaming. The AGI Safety & Alignment team has also contributed to safety practice work. GDM usually doesn't publish about that work, mainly because the work here is primarily about doing all the operational work necessary to translate existing research techniques into practice, which doesn't really lend itself to paper publications.
I disagree that the AGI safety team should have 4 as its "bread and butter". The majority of work needed to do safety in practice has little relevance to the typical problems tackled by AGI safety, especially misalignment. There certainly is some overlap, but in practice I would guess that a focus solely on 4 would cause around an order of magnitude slowdown in research progress. I do think it is worth doing to some extent from an AGI safety perspective, because of (1) the empirical feedback loops it provides, which can identify problems you would not have thought of otherwise, and (2) at some point we will have to put our research into practice, and it's good to get some experience with that. But at least while models are still not that capable, I would not want it to be the main thing we do.
A couple of more minor points: