well, the fact that I don't have an answer ready is itself a significant component of an answer to my question, isn't it? A friend on an alignment chat said something to the effect of: And so I figured I'd come here and ask about it. This eval seems super shallow, only checking if the model is, on its own, trying to destroy the world. Seems rather shallow and uncreative - it barely touched on any of the jailbreaks or ways to pressure or trick the model into misbehaving.

I've skimmed over the beginning of your paper, and I think there might be several problems with it.

  1. I don't see where it is explicitly stated, but I think information "seller's prediction is accurate with probability 0,75" is supposed to be common knowledge. Is it even possible for a non-trivial probabilistic prediction to be a common knowledge? Like, not as in some real-life situation, but as in this condition not being logical contradiction? I am not a specialist on this subject, but it looks like a logical contradiction. And you can prove absolutel
