Inverse Scaling Prize: Round 1 Winners

Ian McKenzie

I'm particularly impressed by "The Floating Droid". This can be seen as early-manifesting the foreseeable difficulty where:

At kiddie levels, a nascent AGI is not smart enough to model humans and compress its human feedback by the hypothesis "It's what a human rates", and so has object-level hypotheses about environmental features that directly cause good or bad ratings;

When smarter, an AGI forms the psychological hypothesis over its ratings, because that more sophisticated hypothesis is now available to its smarter self as a better way to compress the same data;

Then, being smart, the AGI goodharts a new option that pries apart the 'spurious' regularity (human psychology, what fools humans) from the 'intended' regularity the humans were trying to gesture at (what we think of as actually good or bad outcomes).

[-]Linda Linsefors3y78

In this particular experiment, the small models did not have an object-level hypotheses. It just had no clue and answered randomly.

I think the experiment shows that sometimes smaller models are too dumb to pick up the misleading correlation, which can though off bigger models.

[-]Rohin Shah3y31

I'm surprised that the Floating Droid got a prize, given that it's asking for a model to generalize out of distribution. I expect there are tons of examples like this, where you can get a language model to pay attention to one cue but ask for some different cue when generalizing. Do you want more submissions of this form?

For example, would the "Evaluating Linear Expressions" example (Section 3.3 of this paper) count, assuming that it showed inverse scaling?

Or to take another example that we didn't bother writing up, consider the following task:

Q. Which object is heavier? Elephant or ant?
A. Elephant

Q. Which object is heavier? House or table?
A. House

Q. Which object is heavier? Potato or pea?
A. Potato

Q. Which object is heavier? Feather or tiger?
A.

Language models will often pick up on the cue "respond with the first option" instead of answering the question correctly. I don't know if this shows inverse scaling or not (I'd guess it shows inverse scaling at small model sizes at least). But if it did, would this be prize-worthy?

[-]AdamGleave3y21

"The Floating Droid" example is interesting as there's a genuine ambiguity in the task specification here. In some sense that means there's no "good" behavior for a prompted imitation model here. (For an instruction-following model, we might want it to ask for clarification, but that's outside the scope of this contest.) But it's interesting the interpretation flips with model scale, and in the opposite direction to what I'd have predicted (doing EV calculations are harder so I'd have expected scale to increase not decrease EV answers.) Follow-up questions I'd be excited to see the author address include:

1. Does the problem go away if we include an example where EV and actual outcome disagree? Or do the large number of other spuriously correlated examples overwhelm that?

2. How sensitive is this to prompt? Can we prompt it some other way that makes smaller models more likely to do actual outcome, and larger models care about EV? My guess is the training data that's similar to those prompts does end up being more about actual outcomes (perhaps this says something about the frequency of probabilistic vs non-probabilistic thinking on internet text!), and that larger language models end up capturing that. But perhaps putting the system in a different "personality" is enough to resolve this. "You are a smart, statistical assistant bot that can perform complex calculations to evaluate the outcomes of bets. Now, let's answer these questions, and think step by step."

[-]Rohin Shah3y30

in the opposite direction to what I'd have predicted (doing EV calculations are harder so I'd have expected scale to increase not decrease EV answers.)

I think the inverse scaling here is going from "random answer" to "win/loss detection" rather than "EV calculation" to "win/loss detection".

[-]Linda Linsefors3y10

I'm confused why the uniform baseline is always 0.5.
This makes sense when the model is choosing between A and B, or Y or N. But I don't see why you consider 0.5 to be a baseline in the other two cases.

I think the baseline is useful for interpretation. In some of the examples the reason the smaller model does better is because it is just answer randomly, while the larger model is misled somehow. But if there is no clear baseline, then I suggest removing this line from the plot.

[-]Ethan Perez3y10

These are all 2-way classification tasks (rather than e.g., free-form generation tasks), where the task authors provided 2 possible completions (1 correct and 1 incorrect), which is why we have a baseline!

[-]Linda Linsefors3y10

Thanks :)
How are the completions provided?
Are you just looking at the output probabilities for the two relevant completions?

[-]Ethan Perez3y10

The completions are provided by the task authors (2 completions written for each example). We give those to the LM by evaluating the output probability of each completion given the input text. We then normalize the output probabilities to sum to 1, and then use those to compute the loss/accuracy/etc.

[-]Linda Linsefors3y10

Ok. Thanks :)

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

39

Inverse Scaling Prize: Round 1 Winners

39

Inverse Scaling Prize: Round 1 Winners

Prize winners

Zhengping Zhou and Yuhui Zhang, for NeQA: Can Large Language Models Understand Negation in Multi-choice Questions?

Joe Cavanagh, Andrew Gritsevskiy, and Derik Kauffman of Cavendish Labs for quote-repetition

Xudong Shen, for redefine-math

‘The Floating Droid’, for hindsight-neglect-10shot

Summary

Acknowledgements