Links: Blog, Paper.


Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior—for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model? We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks. We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization. However, we are still far from recovering the full capabilities of strong models with naive finetuning alone, suggesting that techniques like RLHF may scale poorly to superhuman models without further work. We find that simple methods can often significantly improve weak-to-strong generalization: for example, when finetuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level performance on NLP tasks. Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.

New Comment
4 comments, sorted by Click to highlight new comments since:

I'm confused by the analogy between this experiment and aligning a superintelligent model.

I can imagine someone seeing the RLHF result and saying, "oh, that's great news for alignment! If we train a superintelligent model on our preferences, it will just imitate our preferences as-is, rather than treating them as a flawed approximation of some other, 'correct' set of preferences and then imitating those instead."

But the paper's interpretation is the opposite of this.  From the paper's perspective, it's bad if the student (analogized to a superintelligence) simply imitates the preferences of the teacher (analogized to us), as opposed to imitating some other set of "correct" preferences which differ from what the student explicitly expressed.

Now, of course, there is a case where it makes sense to want this out of a superintelligence, and it's a case that the paper talks about to motivate the experiment: the case where we don't understand what the superintelligence is doing, and so we can't confidently express preferences about its actions.

That is, although we may basically know what we want at a coarse outcome level -- "do a good job, don't hurt anyone, maximize human flourishing," that sort of thing -- we can't translate this into preferences about the lower-level behaviors of the AI, because we don't have a good mental model of how the lower-level behaviors cause higher-level outcomes.

From our perspective, the options for lower-level behavior all look like "should it do Incomprehensibly Esoteric Thing A or Incomprehensibly Esoteric Thing B?"  If asked to submit a preference annotation for this, we'd shrug and say "uhh, whichever one maximizes human flourishing??" and then press button A or button B effectively at random.

But in this case, trying to align the AI by expressing preferences about low-level actions seems like an obviously bad idea, to the point that I wouldn't expect anyone to try it?  Like, if we get to the point where we are literally doing preference annotations on Incomprehensibly Esoteric Things, and we know we're basically pushing button A or button B at random because we don't know what's going on, then I assume we would stop and try something else.

(It is also not obvious to me that the reward modeling experiment factored in this way, with the small teacher "having the right values" but not understanding the tasks well enough to know which actions were consistent with them.  I haven't looked at every section of the paper, so maybe this was addressed?)

In this case, finetuning on preference annotations no longer conveys our preferences to the AI, because the annotations no longer capture our preferences.  Instead, I'd imagine we would want to convey our preferences to the AI in a more direct and task-independent way -- to effectively say, "what we want is for you to do a good job, not hurt anyone, maximize human flourishing; just do whatever accomplishes that."

And since LLMs are very good at language and human-like intuition, and can be finetuned for generic instruction-following, literally just saying that (or something similar) to an instruction-following superintelligent LLM would be at least a strong baseline, and presumably better than preference data we know is garbage.

(In that last point, I'm leaning on the assumption that we can finetune an superintelligence for generic instruction-following more easily than we can finetune it for a specific task we don't understand.

This seems plausible: we can tune it on a diverse set of instructions paired with behaviors we know are appropriate [because the tasks are merely human-level], and it'll probably make the obvious generalization of "ah, I'm supposed to do whatever it says in the instruction slot," rather than the bizarre misfire of "ah, I'm supposed to do whatever it says in the instruction slot unless the task requires superhuman intelligence, in which case I'm supposed to do some other thing."  [Unless it is deceptively aligned, but in that case all of these techniques will be equally useless.])

You might find Appendix G in the paper worth reading.

I think we should have the strong student use the Scientific Method. Specifically, it should:

  1. Come up with hypotheses about the weak supervisor's behavior (such as by doing repeated CoT to predict supervisor and ground truth labels for specific cases), and notice when it is uncertain between two-or-more alternative hypotheses (such as by clustering steps from repeated CoTs for the same test case, and looking for clusters that are on the same subject, mutually exclusive  in occurrence, and logically incompatible, then clustering these across many test cases).
  2. Prioritize sets of alternative hypotheses (such as by looking at how many predicted ground truth labels would change if we resolved a specific uncertainty).
  3. For each set of alternative hypotheses it prioritized resolving, devise an experiment. The goal is to maximize expected information gain. So create some new test cases for which the strong student would expect that:
    1. seeing labels for these test cases would provide good evidence to hep resolve the uncertainty, i.e. the cases are diagnostic, and
    2. it believes are simple enough that the weak supervisor can label them accurately.
  4. Send those new test cases to the weak supervisor for labeling.
  5. Periodically, perform an approximate Bayesian update on the new results by retraining (or online training) with the new labeled test case results.

This is somewhat similar to the well-known practice, when training a classifier, of oversampling points near the classification boundary: some data points provide more information gain than others. The Scientific Method is the best known approach for achieving this maximization of information gain (a lot better than bandit-style explore-exploit).

Steps 1. and 3. require making some moderately sophisticated logical deductions. With suitable prompting/fintuning, a model of GPT-4 strength should be fairly capable of these (obviously a stronger model would do better).

Step 3. requires that the difficulty of creating a suitable new test case is within the capabilities of the strong student model, which depends upon the nature of the task being trained. In practice, for a strong student of GPT-4 level, I would expect this to generally be feasible for many tasks.

I am here assuming that the strong student is separately predicting both weak supervisor labels and ground truth labels, i.e. that it has a mistake theory about the weak supervisor more sophisticated than random noise, capable of also modeling systematic mistakes. E.g. for the question-answer-pair preference modeling task, predicting that flattery or sycophancy in the answer might increase the labeled score, but that these are not valid ways to improve an answer (something I'd expect GPT-4 to be able to figure out from its knowledge of humans, but if not I'm sure it could figure out how to ask us, so discussions with the humans to check questions like that should be added to the procedure above), so these rhetorical tricks would not increase the predicted ground truth score, only the predicted weak supervisor score. In fact, they should decrease the ground truth score, since humans don't like being successfully manipulated (once they realize it has happened). So the strong student should learn not to do these things, even though they work in terms of the weak supervisor's reward model.

Good paper - even if it shows that the problem is hard!

Sounds like it might be worth it to me to spend time understanding the "confidence loss" to figure out what's going on. There's an obvious intuitive parallel to a human student going "Ah yes, I know what you're doing" - rounding off the teacher to the nearest concept the student is confident in. But it's not clear how good an intuition pump that is.

I agree with Roger that active learning seems super useful (especially for meta-preferences, my typical hobby-horse). It seems a lot easier for the AI to learn about how the teacher generalizes (and how it wants to generalize) if it gets to do experiments, rather than having to wait for natural evidence to crop up in the data. This definitely gets me excited about brainstorming experiments we could do along these lines in the near term.

Though there's likely an uncanny valley here: if the teacher is making systematic mistakes that you're wholly relying on the student's inductive biases to correct, then the student being better at learning the teacher's generalization behavior will make its performance worse! Maybe you get out of the valley on the other side when the student learns a model of how much to use its inductive biases that approximates what the teacher wants it to do.