All of Tomek Korbak's Comments + Replies

Have you tested the AI's outputs when run in <|bad|> mode instead of <|good|> mode?

We did, LMs tends to generate toxic text when conditioned on <|bad|>. Though we tended to have a risk-aversive thresholds, i.e. we used <|good|> for only about 5% safest sentences and <|bad|> for the remaining 95%. So <|bad|> is not bad all the time.

Here it would be helpful to know what the AI produces when prompted by <|bad|>.

That's a good point. We haven't systematically investigate difference in capabilities between<|goo... (read more)

1Logan Riggs Smith9mo
Is it actually true that you only trained on 5% of the dataset for filtering (I’m assuming training for 20 epochs)?

Do you think these insights would generalise to the case where the language model may be interacting with some system during this fine-tuning phase? For example, if it generates queries to an external search engine or API, or has dialogue with a human, then the optimal policy is no longer equivalent to just generating the correct output distribution, as it now also involves environment observations.

That's a good point and helps to make a distinction between generative models and policies. In the interactive case, your policy pi(a|s) is conditional distr... (read more)

2Robert Kirk1y
I feel like this wasn't exactly made clear in the post/paper that this was the motivation. You state that distribution collapse is bad without really justifying it (in my reading). Clearly distributional collapse to a degenerate bad output is bad, and also will often stall learning so is bad from an optimisation perspective as well (as it makes exploration much less likely), but this seems different from distributional collapse to the optimal output. For example, when you say I think I just disagree, in the case where we're considering LLMs as a model for future agent-like systems that we will have to align, which to me is the reason they're useful for alignment research. If there's a normative claim that diversity is important, then you should just have that in your objective/algorithm. I think the reason KL divergence is included in RLHF is an optimisation hack to make sure it does well. Maybe that's revealed that for alignment we actually wanted the Bayesian posterior distribution you describe rather than just the optimal distribution according to the reward function (i.e. a hardmax rather than a softmax on the reward over trajectories), although that seems to be an empirical question whether that was our preference all along or it's just useful in the current regime.