Do you think these insights would generalise to the case where the language model may be interacting with some system during this fine-tuning phase? For example, if it generates queries to an external search engine or API, or has dialogue with a human, then the optimal policy is no longer equivalent to just generating the correct output distribution, as it now also involves environment observations.
That's a good point and helps to make a distinction between generative models and policies. In the interactive case, your policy pi(a|s) is conditional distr...
We did, LMs tends to generate toxic text when conditioned on
<|bad|>
. Though we tended to have a risk-aversive thresholds, i.e. we used<|good|>
for only about 5% safest sentences and<|bad|>
for the remaining 95%. So<|bad|>
is not bad all the time.That's a good point. We haven't systematically investigate difference in capabilities between
<|goo
... (read more)