All of Robert_AIZI's Comments + Replies

Very cool! Have you tested the AI's outputs when run in <|bad|> mode instead of <|good|> mode? It seems like the point of the <|good|> and <|bad|> tokens is to make it easy to call up good/bad capabilities, but we don't want the good/evil switch to ever get set to the "be evil" side.

I see two three mitigations to this line of thinking:

  • Since <|bad|> also includes glitchy code etc, maybe the AI is less capable in bad mode, and therefore not a threat. Here it would be helpful to know what the AI produces when prompted by <|bad
... (read more)
2Tomek Korbak10mo
We did, LMs tends to generate toxic text when conditioned on <|bad|>. Though we tended to have a risk-aversive thresholds, i.e. we used <|good|> for only about 5% safest sentences and <|bad|> for the remaining 95%. So <|bad|> is not bad all the time. That's a good point. We haven't systematically investigate difference in capabilities between<|good|> and <|bad|> modes, I'd love to see that. Yeah, you could even block the entire direction in activation space corresponding to the embedding of the <|bad|> token

Can you clarify what you mean by this, especially (i)?

In particular, right now I don’t have even a single example of a function f such that (i) there are two clearly distinct mechanisms that can lead to f(x) = 1, (ii) there is no known efficient discriminator for distinguishing those mechanisms. I would really love to have such examples.

In particular, do you mean f(x)=1 is true for all input x, or just some particular x, etc?

1David Scott Krueger10mo
Yeah this was super unclear to me; I think it's worth updating the OP.
1Ramana Kumar1y
It means f(x) = 1 is true for some particular x's, e.g., f(x_1) = 1 and f(x_2) = 1, there are distinct mechanisms for why f(x_1) = 1 compared to why f(x_2) = 1, and there's no efficient discriminator that can take two instances f(x_1) = 1 and f(x_2) = 1 and tell you whether they are due to the same mechanism or not.