Wiki Contributions


I think this is a very good critique of OpenAI's plan.  However, to steelman the plan, I think you could argue that advanced language models will be sufficiently "generally intelligent" that they won't need very specialized feedback  in order to produce high quality alignment research.  As e. g. Nate Soares has pointed out repeatedly, the case of humans suggests that in some cases, a system's capabilities can generalize way past the kinds of problems that it was explicitly trained to do.  If we assume that sufficiently powerful language models will therefore have, in some sense, the capabilities to do alignment research, the question then becomes how easy it will be for us to elicit these capabilities from the model.  The success of RLHF at eliciting capabilities from models suggests that by default, language models do not output their "beliefs", even if they are generally intelligent enough to in some way "know" the correct answer.  However, addressing this issue involves solving a different and I think probably easier problem (ELK/creating language models which are honest), rather than the problem of how to provide good feedback in domains where we are not very capable.