Tomek Korbak

Aligning language models at Anthropic

Wiki Contributions


Have you tested the AI's outputs when run in <|bad|> mode instead of <|good|> mode?

We did, LMs tends to generate toxic text when conditioned on <|bad|>. Though we tended to have a risk-aversive thresholds, i.e. we used <|good|> for only about 5% safest sentences and <|bad|> for the remaining 95%. So <|bad|> is not bad all the time.

Here it would be helpful to know what the AI produces when prompted by <|bad|>.

That's a good point. We haven't systematically investigate difference in capabilities between<|good|> and <|bad|> modes, I'd love to see that.

Just before public release, one could delete the <|bad|> token from the tokenizer and the model parameters, so switching to evil mode would require rediscovering that token embedding.

Yeah, you could even block the entire direction in activation space corresponding to the embedding of the <|bad|> token

Do you think these insights would generalise to the case where the language model may be interacting with some system during this fine-tuning phase? For example, if it generates queries to an external search engine or API, or has dialogue with a human, then the optimal policy is no longer equivalent to just generating the correct output distribution, as it now also involves environment observations.

That's a good point and helps to make a distinction between generative models and policies. In the interactive case, your policy pi(a|s) is conditional distribution. You can equivalently view it as a collection of unconditional distributions {pi_s(a)}, one for each s, and for each of these you are likely to also have distribution collapse (single best action for a given state). Arguably, that's what you want in RL.

So I think it mostly comes down to a philosophical difference. Do you want your LM to be a decision-maker acting in a world or a model of a some probability distribution over texts? If you want a decision-maker and training on language is just a scaffolding to get you there, maybe indeed staying close to the original distribution only has instrumental value?

But what if what you want is just an oracle-type conversational AI: a knowledge base and a common-sense reasoner. Maybe in this case staying close to human knowledge and inference rules represented in language is of inherent value?