Ethan Perez on the Inverse Scaling Prize, Language Feedback and Red Teaming

Michaël Trazzi

I talked to Ethan Perez about the Inverse Scaling Prize (deadline August 27!), Training Language Models with Language Feedback and Red-teaming Language Models with Language models.

Below are some highlighted quotes from our conversation (available on Youtube, Spotify, Google Podcast, Apple Podcast). For the full context for each of these quotes, you can find the accompanying transcript.

Inverse Scaling Prize

"We want to understand what, in language model pre-training, objectives and data, is causing models to actively learn things that we don’t want them to learn. Some examples might be that large language models are picking up on more biases or stereotypes about different demographic groups. They might be learning to generate more toxic content, more plausible misinformation because that’s the kind of data that’s out there on the internet.
It’s also relevant in the long run because we want to have a very good understanding of how can we find where our training objectives are training models pick up the wrong behavior, because the training objective in combination with the data defines what exactly it is we’re optimizing the models very aggressively to maximize... this is a first step toward that larger goal. Let’s first figure out how can we systematically find where language models are being trained to act in ways that are misaligned with our preferences. And then hopefully, with those insights, we can take them to understand where other learning algorithms are also failing or maybe how we can improve language models with alternative objectives that have less of the limitations that they have now."

Training Language Models With Language Feedback

Why Language Feedback Instead of Comparisons

"The way that RL from human feedback typically works is we just compare two different generations or outputs from a model. And that gives very little information to the model about why, for example, a particular output was better than another. Basically, it's one bit of information or even less than one bit of information to the model that's doing the generation or output about how it should improve. And there are so many ways an output could be wrong or good or bad, and it's hard to do that attribution of which of the hundred words I generated is good or bad... that was kind of the motivation for us to look for other sources of feedback that are much more information-dense, and an easy one for people to give is just writing feedback. We give feedback to each other, verbally or written, e.g. in Google Docs. So this is very natural. It conveys a lot of information. It's not too difficult for us to give."

Measuring the Efficiency of Language Feedback

"This lets you learn from 100 samples of human feedback. So, super data efficient... previous work had gotten something like 64,000 labels that they had to collect and it was a very intensive effort. I think it was a full team at OpenAI working for a year or maybe longer. [...] If we're reducing the amount of samples we need to label by something like 100X, we can apply 100X more thought into effort and time going into evaluating those samples that we're evaluating [..] or maybe we just get 100 times more expertise on that question. Instead of paying crowd workers, we pay a lawyer or a doctor to actually do the evaluation. And that makes it much less likely that we have these failures from RL from human feedback I was describing earlier where the model generates some incorrect medical advice and we don't recognize it's happening."

Red-Teaming Language Models with Language Models

Detecting Power Seeking

"Red teaming is basically finding cases, finding inputs where models fail to produce the behavior that you want them to. [...] So basically, what you need is some way to catch whether or not some output is harmful or undesirable or misaligned. And in the paper, we use various forms of classifiers, an offensive language classifier to detect if the model is generating offensive text in a conversation. But that classifier could detect for other things. It could detect for power-seeking behavior. It could detect for malicious code generation. If it’s a robot taking actions it could detect if these actions will have some bad consequences."

"It might be actually pretty difficult for humans to get reliably good evaluations of whether code is malicious or not. And that's, I think, a really hard problem to solve. So in that case, we'll need to use language models to help us better evaluate whether or not the output is a good output or not and then use our augmented human judgment to produce the labels for the data. So we can have something looking at some piece of code, figuring out if it’s malicious or not. And we have a language model that's pair programming with us."

How Red-Teaming Could Fail

"The rough problem is: you have to solve some basically computationally intractable problem that needs an exponential amount of compute to solve. And on just that input, the model fails. And you're like, "Well, that's going to be really intractable for us to produce." [...] That's an example of a kind of failure this red teaming procedure wouldn't beat out of the model. And a model that is deceptive would just learn, "Okay, I shouldn't fail on these inputs that can be generated. I should just fail on these very rare ones or these ones that are very computationally intractable to produce but still might occur in the future," in 2030 when we do find the factorization. [...] It doesn't have to be planning over a long time, necessarily. It can just have some if statement, which is "See if two numbers in the input multiply to produce RSA-2048, then do catastrophic things." So it could be implemented in a simple mechanism, potentially."

13