Hi, thanks for the writeup! I might be completely out of my league here, but could we not, before measuring alignement, take one step backwards and measure the capability of the system to misalign?
Say for instance in conversation, I give my model an information to hide. I know it "intends" to hide it. Basically I'm a cop and I know they're the culprit. Interogating them, it might be feasible enough for a human to mark out specific instances of deflection (changing the subject), transformation(altering facts) or fabrication(coming up with outright falsehoods) in conversation. And to give an overall score to the model as to their capacity for deception or even their dangerosity in conversation.
I feel (but again, might be very wrong here) that measuring the capacity for a model to misalign if incentivised to do so might be an easier first step prior to spotting specific instances of misalignment. Might it provide enough data to then train on it later on? Or am I misunderstanding the problem entirely or missing some key litterature on the matter?