This is a linkpost for a recent paper from Anthropic: A General Language Assistant as a Laboratory for Alignment.


Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless. As an initial foray in this direction we study simple baseline techniques and evaluations, such as prompting. We find that the benefits from modest interventions increase with model size, generalize to a variety of alignment evaluations, and do not compromise the performance of large models. Next we investigate scaling trends for several training objectives relevant to alignment, comparing imitation learning, binary discrimination, and ranked preference modeling. We find that ranked preference modeling performs much better than imitation learning, and often scales more favorably with model size. In contrast, binary discrimination typically performs and scales very similarly to imitation learning. Finally we study a `preference model pre-training' stage of training, with the goal of improving sample efficiency when finetuning on human preferences.

I think this is great work and shows encouraging results. I take the fact that alignability scales with model size as evidence for the most convenient form of the natural abstraction hypothesis:

  • Stronger AIs have better internal models of human values.
  • A few well chosen gradient updates can “wire up” the human values model to the AI’s output.

I’d be interested to see this sort of work extended to reinforcement learning systems, which I think are more dangerous than language models.

Edit: thanks to Daniel Kokotajilo's comment for prompting me to highlight some additional interesting results, now added here:

Another interesting result is the paper's investigation of "alignment taxes" (the potential loss in performance of aligned models compared with unaligned models):

The "HHH prompt" is a prompt they use to induce "Helpful, Honest and Harmless" output. Context distillation refers to training an unprompted model to imitate the logits generated when using the HHH prompt.

Their alignment tax results indicate the models suffer a small alignment penalty on Lambada, which is roughly constant with respect to model size (and therefore decreases in relative importance as model performance increases). In contrast, smaller models suffer a large alignment tax on programming evaluations, while larger models actually get a slight alignment benefit.


The HHH prompt doesn't contain examples of "Harmlessness", where the assistant declines to help the user perform harmful acts. Despite this, larger HHH models score better on harmlessness evaluations, suggesting the larger models are better at generalizing to other aspects of human values, rather than just better at matching the expected format or being just generally more capable. 

Additionally, using prompt distillation helps larger models and hurts smaller models on Truthful QA, which asks about common misconceptions. Larger models better represent those misconceptions. Without the prompt, models become less accurate with increasing size, while larger models become more accurate. This is another example of the alignment tax decreasing for larger models. It's also a potential example of an "alignment failure" being corrected by an alignment approach.



New Comment