TL;DR: We’re launching the Inverse Scaling Prize: a contest with $250k in prizes for finding zero/few-shot text tasks where larger language models show increasingly undesirable behavior (“inverse scaling”). We hypothesize that inverse scaling is often a sign of an alignment failure and that more examples of alignment failures would benefit empirical alignment research. We believe that this contest is an unusually concrete, tractable, and safety-relevant problem for engaging alignment newcomers and the broader ML community. This post will focus on the relevance of the contest and the inverse scaling framework to longer-term AGI alignment concerns. See our GitHub repo for contest details, prizes we’ll award, and task evaluation criteria.
Recent work has found that Language Models (LMs) predictably improve as we scale LMs in various ways (“scaling laws”). For example, the test loss on the LM objective (next word prediction) decreases as a power law with compute, dataset size, and model size:
Scaling laws appear in a variety of domains, ranging from transfer learning to generative modeling (on images, video, multimodal, and math) and reinforcement learning. We hypothesize that alignment failures often show up as scaling laws but in the opposite direction: behavior gets predictably worse as models scale, what we call “inverse scaling.” We may expect inverse scaling, e.g., if the training objective or data are flawed in some way. In this case, the training procedure would actively train the model to behave in flawed ways, in a way that grows worse as we scale. The literature contains a few potential examples of inverse scaling. For example, increasing LM size appears to increase social biases on BBQ and falsehoods on TruthfulQA, at least under certain conditions. As a result, we believe that the prize may help to uncover new alignment-relevant tasks and insights by systematically exploring the space of tasks where LMs exhibit inverse scaling. In particular, submissions must demonstrate new or surprising examples of inverse scaling, e.g., excluding most misuse-related behaviors where you specifically prompt the LM to generate harmful or deceptive text; we don't consider scaling on these behaviors to be surprising in most cases, and we're hoping to uncover more unexpected, undesirable behaviors. Below, we outline two questions in AI alignment that we believe the Inverse Scaling Prize may help to answer.
The above question is important to answer to avoid running into outer-alignment-related catastrophes [1, 2]. Language Models (LMs) are “outer aligned” to the extent that doing well on the LM objective (next word prediction) results in desirable model behavior. Inverse scaling on a task we care about is evidence that the LM objective is misaligned with human preferences; better and better performance on the training objective (language modeling) leads to worse and worse performance on a task we care about. Finding inverse scaling tasks is thus helpful for us in understanding the extent to which the LM objective is outer misaligned, which may be important in two ways:
We believe it is important to not let the above issues go uncaught – otherwise, we may end up in a situation where we realize later that LMs are flawed in some important/obvious way, but we accept this limitation as the way things are (e.g., social media’s various negative impacts on users), because it’s too difficult or too late to fix. This kind of failure can lead to catastrophes that arise from the consequences of many, low-stakes failures building up over time. We see the Inverse Scaling Prize as a step in the direction of catching more outer alignment failures.
Having a good outer alignment benchmark is also valuable for outer alignment research, and there currently isn’t a good benchmark suite. Empirical alignment labs typically resort to evaluating a broad set of NLP tasks (primarily an evaluation of capabilities) and human evaluation (for alignment-related properties), which makes iteration on the alignment axis harder, slower, and more costly. There are a few tasks where failures seem potentially robust to scaling (e.g., TruthfulQA, RealToxicityPrompts, BBQ); these few tasks are frequently used to evaluate current AI alignment techniques like RL from human feedback, leaving us at risk of overfitting to them. We hope the Inverse Scaling Prize helps to uncover at least a few more alignment-relevant tasks to help with empirical alignment research.
With more examples of outer alignment failures, we hope to gain a better understanding what causes outer misalignment and be better able to mitigate it (e.g., to suggest improvements for future pretraining runs). Concretely, the inverse scaling tasks we receive could help us or other research groups answer the following outer-alignment-relevant questions:
Speculative note on inner misalignment: Looking for inverse scaling laws could also be a useful lens for finding inner misalignment. We are excited to receive inner alignment -related task submissions, alongside clear explanations for how the observed scaling behavior relates to inner alignment. There are two kinds of inner alignment failures we could look for with scaling laws:
Examining inverse scaling tasks may yield useful, general observations about how to uncover misalignment. Such observations could generalize to other models/objectives (e.g., LMs trained with RLHF) and in other domains (e.g., vision, vision-and-language, or RL environments). Such observations could come from asking the following questions using inverse scaling tasks:
We see the Inverse Scaling Prize as just a first step in the directions outlined above. We are also fairly unsure about how useful the contest will turn out to be in the end: At best it could help bring capable people into alignment work and expose early signs of relevant emerging problems, and at worst it can be a distraction or feed into false confidence that large language models are safe by default. We’re optimistic, though, and hope you’ll help us push it toward these best-case outcomes. If you’re excited about the contest, we’d appreciate you sharing this post or the contest link to people who might be interested in participating. We'd also encourage you to comment on this post if you have ideas you'd like to see tried, e.g., by newcomers to alignment research. Best of luck!
We're grateful to Owain Evans, Jeff Wu, Evan Hubinger, and Richard Ngo for helpful feedback on this post.
One possibility: 'macaronic' adversarial prompts https://arxiv.org/pdf/2208.04135.pdf#page=6