Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic)

This is a followup to what I cheekily call Anthropic's "just try to get the large model to do what you want" research agenda. (Previously: A General Language Assistant as a Laboratory for Alignment, Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, Language Models (Mostly) Know What They Know)

The most interesting takeaway for me is that this is the first paper where Anthropic benchmarks their 175B parameter language model (probably a Claude variant). Previous papers only benchmarked up to 52B parameters. However, we don't have the performance of this model on standard benchmarks (the only benchmarked model from Anthropic is a 52B parameter one called stanford-online-all-v4-s3). They also don't give details about its architecture or pretraining procedure.

In this paper (Ganguli and Askell et al.), the authors study what happens when you just ... ask the language model to be less biased (that is, change their answers based on protected classes such as age or gender). They consider several setups: asking questions directly (Q), adding in the instruction to not be biased (Q+IF), giving it the instruction + chain of thought (Q+IF+CoT), and in some cases, asking it to match particular statistics.^[1]

They find that as you scale the parameter count of their RLHF'ed language models,^[2] the models become more biased, but they also become increasingly capable of correcting for their biases:

On both the BBQ benchmark and the Winogender benchmark, we see signs of life 22B parameters. For the admissions discrimination benchmark, we see instead see instruction following + CoT having an effect much earlier, but pure instruction following having no real effect until 52B parameters.

They also report how their model changes as you take more RLHF steps:

First, this suggests that RLHF is having some effect on instruction following: the gap between the Q and Q+IF setups increases as you scale the number of RLHF steps, for both BBQ and admissions discrimination. (I'm not sure what's happening for the gender bias one?) However, simply giving the language model instructions and prompting it to do CoT, even after 50 RLHF steps, seems to have a significantly larger effect than RLHF.

I was also surprised at how few RLHF steps are needed to get instruction following -- the authors only consider 50-1000 steps of RLHF, and see instruction following even after 50 RLHF steps. I wonder if this is a property of their pretraining process, a general fact about pretrained models (PaLM shows significant 0-shot instruction following capabilities, for example), or if RLHF is just that efficient?

^{^}
The authors caution that they've done some amount of prompt engineering, and "have not systematically tested for this in any of our experiments."
^{^}
They use the same RLHF procedure as in Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

30

Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic)

30