Hi, thanks for the writeup! I might be completely out of my league here, but could we not, before measuring alignement, take one step backwards and measure the capability of the system to misalign?
Say for instance in conversation, I give my model an information to hide. I know it "intends" to hide it. Basically I'm a cop and I know they're the culprit. Interogating them, it might be feasible enough for a human to mark out specific instances of deflection (changing the subject), transformation(altering facts) or fabrication(coming up with outright falsehood... (read more)