Akbir Khan

Wiki Contributions


Hey this is cool work. I think i disagree with a couple of these comments coming from a Scalable Oversight background rather than an AI control, and what weak/strong gaps we intend to look over. 

My main contention is the framing that you can talk about Consultants producing arguments with convincingness scores  $c_i$ and $c_c$, which are independent of the protocol and judge that you are using. I try to break this down into the two claims. I make my arguments backwards, but in particular, I contest the claims:

1) “Open consultancy is as robust against scheming as regular consultancy”

2) “Open consultancy is sometimes better than debate“

In particular, I think the confusion stems from this idea from debate:

The idea is that when it is easier to argue for something true than something false, then the training process will force the scheming model to argue for the truth.

This statement may be true in debate but has no grounding in consultancy. Judges are non-experts and are going to have a really hard time distinguishing between truth and lies in consultancy. I can generate plausible arguments as to why the sqrt(3) is rational and irrational (so a judge under consultancy would say this is okay), whilst in debate it's the comparative analysis doing the work (and the judge would only choose one). This, again is going to be a larger problem in Scalable Oversight - where the expertise gap is the issue and not AI Control - where the gap is from scheming/audit budget.

Convincingness by protocol (arguments are equivalent in different protocols)

Another way to say this is for the same argument from either a correct consultant or correct debater, there is a different persuasiveness c_c and d_c  depending on the protocol you are in.

From this it falls apart that c_c > c_i for consultancy (I have no reason to believe this is true). The other way to put this probably is that  d_i + d_c = 1 for debate, but c_i + c_c is unbounded.

Now, moving to Toy Model 1: I appreciate you highlighting this as an unrealistic assumption:

Putting the correct justification and incorrect justification next to each other doesn’t change the convincingness of either side

But in either situation, I don't think it is fair to compare to debate anymore, because the entire point of debate is the adversarial and comparative arguments. Lots of the juice is going to come from providing interactions between two experts, which will change the convincingness of an output. Again this, is more of an issue in Scalable Oversight - because here, we will ask questions where we have to exploit the Generator-Critique gap and not the Generator-Discriminator gap. 

Convincingness of the Truth (true arguments convince better)

If there are no guarantees that c_c > c_i in consultancy, then I think analysis on open-consultancy > consultancy needs to be considered. The analysis of the c_c > c_i regime is not where we have any issues with open consultancy or regular consultancy,  if anything, here is where this stuff is all gonna work fine! It is, instead, when we are in the regime where c_i < c_c, where using regular consultancy certainly trumps open consultancy because the schemer takes every chance they have to sandbag you.

Anyway, just my two cents. I think using open-consultancy is a better baseline for Control. In general, I urge folks to consider the persuasiveness or convincingness of arguments relative to the judge and the protocol being used. Hope that makes sense.

Hey this is super exciting work, I'm a huge fan of the clarification over the protocol and introduction of cross-examination!

Will you be able to open-source the dataset at any point? In particular, the questions, human arguments and then counter-claims. It would be very useful for further work.