Pretraining Language Models with Human Preferences

Sam Bowman; Ethan Perez

Since <|bad|> also includes glitchy code etc, maybe the AI is less capable in bad mode, and therefore not a threat. Here it would be helpful to know what the AI produces when prompted by <|bad|>.
Just before public release, one could delete the <|bad|> token from the tokenizer and the model parameters, so switching to evil mode would require rediscovering that token embedding.
[Edit: A third option would be to poison-pill bad mode in training, for instance by making 50% of <|bad|> mode data random noise. Ideally this would leave <|good|> mode unaffected and make <|bad|> mode useless from a capabilities perspective.]

[-]Tomek Korbak3y20

Have you tested the AI's outputs when run in <|bad|> mode instead of <|good|> mode?

Here it would be helpful to know what the AI produces when prompted by <|bad|>.

That's a good point. We haven't systematically investigate difference in capabilities between<|good|> and <|bad|> modes, I'd love to see that.

Just before public release, one could delete the <|bad|> token from the tokenizer and the model parameters, so switching to evil mode would require rediscovering that token embedding.

Yeah, you could even block the entire direction in activation space corresponding to the embedding of the <|bad|> token

[-]Logan Riggs3y10

Is it actually true that you only trained on 5% of the dataset for filtering (I’m assuming training for 20 epochs)?

[-]Evan R. Murphy3y*20

Bravo, I've been wondering if this was possible for awhile now - since RLHF came into common use and there have been more concerns around it. Your results seem encouraging!

PHF seems expensive to implement. Finetuning a model seems a lot easier/cheaper than sculpting and tagging an entire training corpus and training a model from scratch. Maybe there is some practical workflow of internally prototyping models using finetuning, and then once you've honed your reward model and done a lot of testing, using PHF to train a safer/more robust version of the model.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

54

Pretraining Language Models with Human Preferences

54

Summary of the paper

LM pretraining can be aligned to reward models surprisingly well

Conditional training is a competitive alignment technique

Scaling laws for alignment

PHF improves robustness to red-teaming

Pretraining with human feedback is performance-competitive

PHF is more effective than finetuning with feedback