Anthropic, DeepMind and Google Brain are all working on strategies to train language models on their own outputs. For a brief summary of the work so far:
I'd like to point out a simple failure mode of this approach: Failures of alignment and capability in the original model could be amplified by fine-tuning on its own outputs. Empirically, recent experiments on language models have found more benefit than harm in model-driven feedback. But that might not always be the case.
This recent work is an extension of weak supervision, a technique dating back to at least 1963 which has been successful in applications such as image classification and protein folding. This literature has long acknowledged the possibility of amplifying a model's existing shortcomings via self-training:
One particularly dangerous failure mode would be the classic deceptive alignment story, in which a model with long-term goals gains awareness of its training process and subverts it. With a model-driven feedback approach, there would be more of an opportunity to hide misaligned behavior during training. Models used for critiques or oversight could also engage in gradient hacking, putting their goals into the generator model.
A better approach might keep humans at the center of the feedback process. This is slower and might be less accurate in some cases, but could potentially avoid the worst failures of model-driven feedback. A popular middle ground uses model-assisted feedback methods:
Model-driven feedback has achieved impressive results on scalable oversight, especially compared to the empirical and theoretical challenges with debate. But in the future, the old adage might hold true: Garbage in, garbage out.