Oh, to be clear, with "to help find" I only mean that we expect to make significant progress using debate. If we knew we'd safely make enough progress to get to a solution, then you're quite right that that would amount to (2). (apologies for lack of clarity if this was the miscommunication)
That's the distinction I mean to make between (1) and (2): we need to get to the moon safely
With (1) we have no idea when our rocket will explode.
Similarly, we have no idea whether the moon will be far enough to know when our next rocket will explode. (not that I'm knocking robustly getting to the moon safely)
If we had some principled argument telling us how far we could push debate before things became dangerous, that'd be great. I'm claiming that we have no such argument, and that all work on debate (that I'm aware of) stands near-zero chance of finding one.
Of course I'm all for work "on debate" that aims at finding that kind of argument - however, I would expect that such work leaves the specifics of debate behind pretty quickly.
The problem is robustly getting the incentive to show that the other AI is being deceptive.
Giving access to the weights, activations and tools may give debaters the capability to expose deception - but that alone gets you nothing.
You're still left saying:
So long as we can get the AI to robustly do what we want (i.e. do its best to expose deception), we can get the AI to robustly do what we want.
Similarly, "...and avoid cooperation" is essentially the entire problem.
To be clear, I'm not saying that an approach of this kind will never catch any instances of an AI being deceptive. (this is one reason I'm less certain on (1))
I'm am saying that there's no reason to predict anything along these lines should catch all such instances.
I see no reason to think it'll scale.
Another issue: unless you have some kind of true name of deception (I see no reason to expect this exists), you'll train an AI to detect [things that fit your definition of deception], and we die to things that didn't fit your definition.
Scalable oversight: finding ways to leverage more powerful models to produce better reward signals
It might be worth clarifying how you expect this to help, and to make clear where you'd expect other researchers to disagree.
For instance, for debate, one could believe:
1) Debate will work for long enough for us to use it to help find make progress towards an alignment solution.
2) Debate is a plausible basis for an alignment solution.
To me (2) seems fairly clearly false - at the very least it's not doing anything about inner alignment (debate on weights/activations does nothing to address this, since there's still no [debaters are aiming to win the game] starting point).
Viewing it as a question-answering system is similarly confused: it's an [output whatever text is selected by the debate process] system.
We can't have both [debaters optimise for a debate win] and [debate robustly remains a question-answering system] - at least without making obviously false assumptions about a human-based judge system.
Could Debate be a component of an alignment solution? Sure.
Is it the part that seems hard/neglected? No.
On (1) I'm less clear, however here the case that needs to be made is that debate approaches will be more useful before they become dangerous than e.g. simulators or conditioning predictive models (which I agree will also break at some point).
This is not obviously false, but I don't see a good argument for it. If I have to bet which of these approaches has the lowest [capability before deceptive alignment] (cbda) threshold, my money is currently on debate (and indeed RRM). Imitative amplification seems plausibly safer, but only to the degree that it's less efficient - so still unclear it gets higher cbda (if distillation ends up buying efficiency, I expect it to throw out the imitative rationale for safety in the process).
To me, most of the value to a new researcher in studying debate would lie in:
And as Eliezer/Nate/John... would point out, this doesn't require getting into the details of the mechanism design - only to notice that the mechanism is doing nothing to address the fundamentals of the problem.
I'd be genuinely interested if I'm wrong on any of this - it'd be nice if debate were actually useful! (I don't claim to be making all the necessary arguments above - just pointing out my current belief)
It seems to me that the key threshold has to do with the net impact of meme replication:
Where the constraint is very limiting, all but a small proportion of memes will be selected against. The [hunting technique of lions] meme is transferred between lions, because being constrained to hunt is not costly, while having offspring observe hunting technique is beneficial.
This is still memetic transfer - just a rather uninteresting version.
Humans get to transmit a much wider variety of memes more broadly because the behavioural constraint isn't so limiting (speaking, acting, writing...), so the upside needn't meet a high bar.
The mechanism that led to hitting this threshold in the first place isn't clear to me. The runaway behaviour after the threshold is hit seems unsurprising.
Still, I think [transmission became much cheaper] is more significant than [transmission became more beneficial].
The generation dies, and all of the accumulated products of within lifetime learning are lost.
This seems obviously false. For animals that do significant learning throughout their lifetimes, it's standard for exactly the same "providing higher quality 'training data' for the next generation..." mechanism to be used.
The distinction isn't in whether learned knowledge is passed on from generation to generation. It's in whether the n+1th generation tends to pass on more useful information than the nth generation. When we see this not happening, it's because the channel is already saturated in that context.
I guess that this is mostly a consequence of our increased ability to represent useful behaviour in ways other than performing the behaviour: I can tell you a story about fighting lions without having to fight lions. (that and generality of intelligence and living in groups).
(presumably unimportant for your overall argument - though I've not thought about this :))
Yeah, me neither - mainly it just clarified the point, and is the first alternative I've thought of that seems not-too-bad. It still bothers me that it could be taken as short for "malicious/malign/malevolent generalization".
Sure, I don't think it's entirely wrong to have started using the word this way (something akin to "misbehave" rather than "misfire").
However, when I take a step back and ask "Is using it this way net positive in promoting clear understanding and communication?", I conclude that it's unhelpful.
I think you're correct, but I find "misgeneralization" an unhelpful word to use for "behaved in a way that made the programmer unhappy". It suggests too strong an idea of some natural correct generalization. This seems needlessly likely to lead to muddled thinking (and miscommunication).
I guess I'd prefer "malgeneralization": it's not incorrect, but rather just an outcome I didn't like.
Booking predictions seems great, but did you also write things like:
If my assumption x holds, I expect ...
If my assumption x does not hold, I expect ...
This seems important to me if you're aiming to update your models in a somewhat bias-resistant fashion. (though I realize that it may get quite complicated)
Only knowing that you made a particular incorrect prediction allows quite a bit of freedom to explain the error away while holding on to the assumptions you're most attached to.
But I'm very happy you're doing this at all! (including the encouraging-others-to-predict part)
Agreed.
Are you aware of any work that attempts to answer this question?
Does this work look like work on debate?
(not rhetorical questions!)
My guess is that work likely to address this does not look like work on debate.
Therefore my current position remains: don't bother working on debate; rather work on understanding the fundamentals that might tell you when it'll break.
The world won't be short of debate schemes.
It'll be short of principled arguments for their safe application.