Yep, happy to chat!
Yep. For empirical work I'm in favor of experiments with more informed + well-trained human judges who engage deeply etc, and having a high standard for efficacy (e.g. "did it get the correct answer with very high reliability") as opposed to "did it outperform a baseline by a statistically significant margin" where you then end up needing high n and therefore each example needs to be cheap / shallow
IMO the requirements are a combination of stability and compactness - these trade off against each other, and the important thing is the rate at which you get evidence for which debater is dishonest while exploring the tree.
iiuc, the stability definition used here is pretty strong - says that the error in the parent is smaller than the largest error across the children. So any argument structure where errors can accumulate (like a conjunctive argument, or a proof which requires all the steps to be correct) is out.
I was really impressed by the technical work in this paper. Getting to a formalization of the problem setup and the protocol that allows you to prove meaningful things is a big accomplishment.
However, as the authors mention above, I don't think this should be a substantial update about whether obfuscated arguments are a problem for recursive decomposition approaches to scalable oversight. (I think the discussion in this post is fine, but I think the title of the paper "avoiding obfuscation with prover-estimator debate" is a bit misleading. I believe the authors are going to change this in v2.)
I'm excited about more empirical work on making debate protocols work in practice. I feel a bit less excited about pure theory work, but I think work that mixes experimentation and theory could be helpful.
I think there are broadly two classes of hope about obfuscated arguments:
(1.) In practice, obfuscated argument problems rarely come up, due to one of:
(2.) We can create a protocol that distinguishes between cases where:
This is the way (apart from argument size) in which the primality test example differs from the obfuscated factorization example: the debaters have some high-level mathematical concepts which allow them to quickly determine whether some proposed lemma is correct.
This wouldn't get us to full ELK (bc maybe models still know things they have no human-understandable arguments for), but would at least expand the class of honest arguments that we can trust to include ones that are large + unstable in human-understandable form but where the debaters do have a faster way of identifying which subtree to go down.
This is a great post, very happy it exists :)
Quick rambling thoughts:
I have some instinct that a promising direction might be showing that it's only possible to construct obfuscated arguments under particular circumstances, and that we can identify those circumstances. The structure of the obfuscated argument is quite constrained - it needs to spread out the error probability over many different leaves. This happens to be easy in the prime case, but it seems plausible it's often knowably hard. Potentially an interesting case to explore would be trying to construct an obfuscated argument for primality testing and seeing if there's a reason that's difficult. OTOH, as you point out,"I learnt this from many relevant examples in the training data" seems like a central sort of argument. Though even if I think of some of the worst offenders here (e.g. translating after learning on a monolingual corpus) it does seem like constructing a lie that isn't going to contradict itself pretty quickly might be pretty hard.
I'd be surprised if this was employee-level access. I'm aware of a red-teaming program that gave early API access to specific versions of models, but not anything like employee-level.
If anyone wants to work on this or knows people who might, I'd be interested in funding work on this (or helping secure funding - I expect that to be pretty easy to do).
Good point!
Hmm I think it's fine to use OpenAI/Anthropic APIs for now. If it becomes an issue we can set up our own Llama or whatever to serve all the tasks that need another model. It should be easy to switch out one model for another.
Yep, that's right. And also need it to be possible to check the solutions without needing to run physical experiments etc!
I can write more later, but here's a relevant doc I wrote as part of discussion with Geoffrey + others. Maybe the key point from there is that I don't think this protocol solves the examples given in the original post describing obfuscated arguments. But yeah, I was always thinking this was a completeness problem (the original post poses the problem as distinguishing a certain class of honest arguments from dishonest obfuscated arguments - not claiming you can't get soundness by just ignoring any arguments that are plausibly obfuscated.)