Can you clarify what you mean by this, especially (i)?
In particular, right now I don’t have even a single example of a function f such that (i) there are two clearly distinct mechanisms that can lead to f(x) = 1, (ii) there is no known efficient discriminator for distinguishing those mechanisms. I would really love to have such examples.
In particular, do you mean f(x)=1 is true for all input x, or just some particular x, etc?
Very cool! Have you tested the AI's outputs when run in <|bad|> mode instead of <|good|> mode? It seems like the point of the <|good|> and <|bad|> tokens is to make it easy to call up good/bad capabilities, but we don't want the good/evil switch to ever get set to the "be evil" side.
twothree mitigations to this line of thinking: