@Adam I'm interested if you have the same criticism of the language in the paper (in appendix E)?
(I mostly wrote it, and am interested whether it sounds like it's ascribing agency too much)
You might want to reference Ajeya's post on 'Aligning Narrowly Superhuman Models' where you're discussing alignment research that can be done with current models
I think this is a really useful post, thanks for making this! I maybe have a few things I'd add but broadly I agree with everything here.
"Even if actively trying to push the field forward full-time I'd be a small part of that effort"
I think conditioning on something like 'we're broadly correct about AI safety' implies 'we're right about some important things about how AI development will go that the rest of the ML community is surprisingly wrong about'. In that world we're maybe able to contribute as much as a much larger fraction of the field, due to being correct about some things that everyone else is wrong about.
I think your overall point still stands, but it does seem like you sometimes overestimate how obvious things are to the rest of the ML community
We're trying to address cases where the human isn't actually able to update on all of D and form a posterior based on that. We're trying to approximate 'what the human posterior would be if they had been able to look at all of D'. So to do that, we learn the human prior, and we learn the human likelihood, then have the ML do the computationally-intensive part of looking at all of D and updating based on everything in there.
Does that make sense?
Starting with amplification as a baseline; am I correct to infer that imitative generalisation only boosts capabilities, and doesn't give you any additional safety properties?
I think the distinction isn't actually super clear, because you can usually trade off capabilities problems and safety problems. I think of it as expanding the range of questions you can get aligned answers to in a reasonable number of steps. If you're just doing IDA/debate, and you try to get your model to give you answers to questions where the model only knows the answer because of updating on a big dataset, you can either keep going through the big dataset when any question of this type comes up (very slow, so capability limitation), or not trust these answers (capability limitation), or just hope they're correct (safety problem).
Bonus question: Is the intention only to boost efficiency, or do you think that IA will fundamentally allow amplification to solve more problems? (Ie., solve more problems with non-ridiculous amounts of compute – I'd be happy to count an exponential speedup as the latter.)
The latter :)
I think the only way to get debate to be able to answer all the questions that debate+IG can answer is to include subtrees that are the size of your whole training dataset at arbitrary points in your debate tree, which I think counts as a ridiculous amount of compute
That is a concern, but only in the case where there's no answer that has an argument tree that bottoms out in depth<D. As long as there exists an answer that is supported by a depth<D tree, this answer will beat the answers only supported by depth>D argument trees.
So there is a case where the debaters are not incentivised to be honest; the case where the debaters know something but there's no human-understandable argument for it that bottoms out in <D steps. This is where we get the PSPACE constraint.
If we include discussion of cross-examination (which the analysis there did not include), then we can get rid of this constraint: each debater commits to an argument tree, then each debater points out the weakest node in the tree (or points out that some part of the tree doesn't bottom out).
(we can only handle really large trees if we assume debaters are computationally unbounded in general though. If we don't assume this, even if we still assume they have oracles for some specific problems, we still probably can't supervise anything that's not in NP, because of the obfuscated argument problem)
I don't think 'assuming one player is honest' and 'not trusting answers by default' are in contradiction. if the judge assumes one player is honest, then if they see two different answers they don't know which one to trust, but if they only see one answer (the debaters agree on an answer/the answer is not challenged by the opposing debater) then they can trust that answer.
I was trying to describe something that's the same as the judging procedure in that doc! I might have made a mistake, but I'm pretty sure the key piece about recursion payments is the same. Apologies that things are unclear. I'm happy to try to clarify, if there were particular aspects that seem different to you.
Yeah, I think the infinite tree case should work just the same - ie an answer that's only supported by an infinite tree will behave like an answer that's not supported (it will lose to an answer with a finite tree and draw with an answer with no support)
It seems possible that the proposal you're discussing very significantly addresses concerns I've had about debate.
In the ball-attached-to-a-pole example, the honest debater has assigned probabilities that are indistinguishable from what you would do if you knew noting except that the claim is false. (I.e., assign probabilities that doubt each component equally.) I'm curious how difficult it is to find the flaw in this argument structure. Have you done anything like showing these transcripts to other experts and seeing if they will be able to answer it?
Not systematically; I would be excited about people doing these experiments. One tricky thing is that you might think this is a strategy that's possible for ML models, but that humans aren't naturally very good.
If I had to summarize this finding in one sentence, it would be "it seems like an expert can generally find a set of arguments for a false claim that is flawed such that an equally competent expert can't identify the flawed component, and the set of arguments doesn't immediately look suspect". This seems surprising, and I'm wondering whether it's unique to physics. (The cryptographic example was of this kind, but there, the structure of the dishonest arguments was suspect.)
Yeah, this is a great summary. One thing I would clarify is that it's sufficient that the set of arguments don't look suspicious to the judge. The arguments might look suspicious to the expert, but unless they have a way to explain to the judge why it's suspicious, we still have a problem.
If this finding holds, my immediate reaction is "okay, in this case, the solution for the honest debater is to start a debate about whether the set of arguments from the dishonest debater has this character". I'm not sure how good this sounds. I think my main issue here is that I don't know enough physics understand why the dishonest arguments are hard to identify
Yeah, I think that is the obvious next step. The concern is that the reasons the argument is suspicious may be hard to justify in a debate, especially if they're reasons of the form 'look, I've done a bunch of physics problems, and approaching it this way feels like it will makes things messy, whereas approaching it this way feels cleaner'. Debate probably doesn't work very well for supervising knowledge that's gained through finding patterns in data, as opposed to knowledge that's gained through step-by-step reasoning. Something like imitative generalisation (AKA 'learning the prior') is trying to fill this gap.