AI ALIGNMENT FORUM
AF

Tomek Korbak
Ω166500
Message
Dialogue
Subscribe

Senior Research Scientist at UK AISI working on AI control

https://tomekkorbak.com/

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Pretraining Language Models with Human Preferences
Tomek Korbak2y20

Have you tested the AI's outputs when run in <|bad|> mode instead of <|good|> mode?

We did, LMs tends to generate toxic text when conditioned on <|bad|>. Though we tended to have a risk-aversive thresholds, i.e. we used <|good|> for only about 5% safest sentences and <|bad|> for the remaining 95%. So <|bad|> is not bad all the time.

Here it would be helpful to know what the AI produces when prompted by <|bad|>.

That's a good point. We haven't systematically investigate difference in capabilities between<|good|> and <|bad|> modes, I'd love to see that.

Just before public release, one could delete the <|bad|> token from the tokenizer and the model parameters, so switching to evil mode would require rediscovering that token embedding.

Yeah, you could even block the entire direction in activation space corresponding to the embedding of the <|bad|> token

Reply
RL with KL penalties is better seen as Bayesian inference
Tomek Korbak3y10

Do you think these insights would generalise to the case where the language model may be interacting with some system during this fine-tuning phase? For example, if it generates queries to an external search engine or API, or has dialogue with a human, then the optimal policy is no longer equivalent to just generating the correct output distribution, as it now also involves environment observations.

That's a good point and helps to make a distinction between generative models and policies. In the interactive case, your policy pi(a|s) is conditional distribution. You can equivalently view it as a collection of unconditional distributions {pi_s(a)}, one for each s, and for each of these you are likely to also have distribution collapse (single best action for a given state). Arguably, that's what you want in RL.

So I think it mostly comes down to a philosophical difference. Do you want your LM to be a decision-maker acting in a world or a model of a some probability distribution over texts? If you want a decision-maker and training on language is just a scaffolding to get you there, maybe indeed staying close to the original distribution only has instrumental value?

But what if what you want is just an oracle-type conversational AI: a knowledge base and a common-sense reasoner. Maybe in this case staying close to human knowledge and inference rules represented in language is of inherent value?

Reply
18How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
3mo
0
33A sketch of an AI control safety case
5mo
0
21Eliciting bad contexts
5mo
3
38Automation collapse
9mo
7
6Compositional preference models for aligning LMs
2y
0
39Towards Understanding Sycophancy in Language Models
2y
0
56Paper: LLMs trained on “A is B” fail to learn “B is A”
2y
0
44Paper: On measuring situational awareness in LLMs
2y
13
36Imitation Learning from Language Feedback
2y
1
54Pretraining Language Models with Human Preferences
2y
4
Load More