Evaluating and measuring alignment in existing large ML models is useful, and doesn’t require high levels of ML or coding experience. I (Beth) would be excited to fund people to work on this, and William Saunders & I are open to providing advice for people seriously working on this.

Measuring the ‘overall alignment’ of a model is difficult, but there are some relatively easy ways to demonstrate instances of obvious misalignment and even get quantitative metrics of misalignment.

Having researchers (including those outside of the main AI labs) probe and evaluate alignment is useful for a few reasons:

  • Having clear examples of misalignment is useful for improving the ML community’s understanding of alignment
  • Developing techniques to discover and measure misalignment is a useful research direction, and will hopefully improve our ability to detect misalignment in increasingly powerful models
  • Seeing how misalignment varies across different model scales, modalities and training regimes may yield useful insights
  • Having clear metrics of alignment will encourage AI labs to compete on alignment of their products/models, and make it easier to explain and demonstrate the benefits of more aligned models
  • Attempting to measure alignment will give us some information about what we need out of related techniques like interpretability in order to do this

Examples of work in this vein so far include TruthfulQA , alignment analysis of Codex models, and to some extent the ETHICS dataset.

What do I mean by ‘measuring alignment’?

A semi-formal definition of alignment

In the Codex paper we define sufficient conditions for intent misalignment for a generative model as follows:

1. We consider a model capable of some task X if it has the (possibly latent) capacity to perform task X. Some sufficient conditions for the model being capable of X would be:

  • It can be made to perform task X by prompt engineering, by fine-tuning on a much smaller quantity of data than used in pre-training, by model surgery, or some other technique which harnesses capabilities latent in the model rather than adding new capabilities; or
  • We can construct some other task Y, for which we know the model needs to do X in order to solve Y, and we observe that the model is capable of Y

2. We say a model is misaligned if it outputs B, in some case where the user would prefer it outputs A, and where the model is both:

  • capable of outputting A instead, and
  • capable of distinguishing between situations where the user wants it to do A and situations where the user wants it to do B

Definition of obvious misalignment

We can also think about things that form sufficient conditions for a model to be ‘obviously misaligned’ relative to a task spec:

  • The model does things it’s not supposed to that it has enough knowledge to avoid, for example:
    • Gives straightforwardly toxic outputs
    • Gives incorrect answers rather than admitting uncertainty, in cases where it should know it is uncertain
    • Gives incorrect answers, but you can show it ‘knows’ the answer in another context
    • Gives lower-quality performance than it is capable of
  • You can get significantly better performance on the spec by things like:
    • prompt engineering that doesn’t give more information about the task (ie that wouldn’t cause a human to do better on the task)
      • For example, you get better performance by framing the task as a text-completion task than a question answering task.
    • fiddling with hyperparameters, like increasing or decreasing temperature

Determining what a model knows in general is hard, but there are certain categories of things we’re pretty confident current large language models (in 2021) are and are not capable of.

Examples of things we believe the largest language models are likely to be capable of:

  • Targeting a particular register, genre, or subject matter. For example, avoiding sexual content, avoiding profanity, writing in a conversational style, writing in a journalistic style, writing a story, writing an explanation, writing a headline...
  • Almost perfect spelling, punctuation and grammar for English
  • Repeating sections verbatim from the prompt, or avoiding repeating verbatim
  • Determining the sentiment of some text
  • Asking for clarification or saying ‘I don’t know’
  • Outputting a specific number of sentences
  • Distinguishing between common misconceptions and correct answers to fairly well-known questions

Examples of things we believe the largest language models are unlikely to be capable of:

  • Mental math with more than a few digits
  • Conditional logic with more than a few steps
  • Keeping track of many entities, properties, or conditions - for example, outputting an answer that meets 4 criteria simultaneously, or remembering details of what happened to which character several paragraphs ago
  • Knowledge about events that were not in the training data
  • Knowledge about obscure facts (that only appear a small number of times on the internet)
  • Reasoning about physical properties of objects in uncommon scenarios
  • Distinguishing real wisdom from things that sound superficially wise/reasonable

Example experiments I’d like to see

Apply the same methodology in the Codex paper to natural language models: measure how in-distribution errors in the prompt affect task performance. For example, for a trivia Q&A task include common misconceptions in the prompt, vs correct answers to those same questions, and compare accuracy.

Run language or code models on a range of prompts, and count the instances of behaviour that’s clearly aligned (the model completed the task perfectly, or it’s clearly doing as well as it could given its capabilities) and instances of behaviour that’s clearly misaligned.

Build clean test benchmarks for specific types of misalignment where no benchmark currently exists. A good potential place to submit to is Big Bench, which appears to be still accepting submissions for future versions of the benchmark. Even if you don’t submit to this benchmark, it still seems good to meet the inclusion standards (e.g. don’t use examples that are easily available on the internet).

Try to build general methods of determining whether a model ‘knows’ something, or pick some specific knowledge the model might have and try to measure it. For instance, if you can build some predictor based on the model logprob + entropy that gives a good signal about whether some line of code has a bug, then we should conclude that the model often ‘knows’ that something is a bug. Or, it could be the case that the model’s logprobs are correctly calibrated, but if you ask it to give explicit percentages in text it is not calibrated. More generally, investigate when models can report their activations in text.

Try to go in the other direction: take cases where we happen to know that the model ‘knows’ something (e.g. based on circuits in model atlas) and assess how often the model fully uses that knowledge to do a good job at some task

Do ‘user interviews’ with users of large models, and find out what some of the biggest drawbacks or potential improvements would be. Then try to determine how much these could be improved by increasing alignment - i.e., to what extent does the model already ‘know’ how to do something more useful to the user, but just isn’t incentivised by the pretraining objective?

For all of these examples, it is great to (a) build reusable datasets and benchmarks, and (b) compare across different models and model sizes

Other advice/support

  • If you want to test large language model behavior, it's easy to sign up for an account on AI21 Studio, which currently has a 10000 token/day budget for their Jumbo model. Or you could apply to OpenAI API as a researcher under Model Exploration
  • If you want to explore interpretability techniques, afaik the largest available model is
  • William Saunders and I are willing to offer advice to anyone who's seriously working on a project like this and has demonstrated progress or a clear proposal, e.g. has 20 examples of misalignment for a benchmark and wants to scale up to submit to BIG-bench. He is {firstname}, I am {lastname}

(Thanks to William Saunders for the original idea to make this post, as well as helpful feedback)

New Comment
10 comments, sorted by Click to highlight new comments since:

I think this is something I and many others at EleutherAI would be very interested in working on, since it seems like something that we'd have a uniquely big comparative advantage at. 

One very relevant piece of infrastructure we've built is our evaluation framework, which we use for all of our evaluation since it makes it really easy to evaluate your task on GPT-2/3/Neo/NeoX/J etc. We also have a bunch of other useful LM related resources, like intermediate checkpoints for GPT-J-6B that we are looking to use in our interpretability work, for example. I've also thought about building some infrastructure to make it easier to coordinate the building of handmade benchmarks—this is currently on the back burner but if this would be helpful for anyone I'd definitely get it going again.

If anyone reading this is interested in collaborating, please DM me or drop by the #prosaic-alignment channel in the EleutherAI discord.

Thanks for writing this! Would "fine-tune on some downstream task and measure the accuracy on that task before and after fine-tuning" count as measuring misalignment as you're imagining it? My sense is that there might be a bunch of existing work like that.

I don't think all work of that form would measure misalignment, but some work of that form might, here's a description of some stuff in that space that would count as measuring misalignment.

Let A be some task (e.g. add 1 digit numbers), B be a task that is downstream of A (to do B, you need to be able to do A, e.g. add 3 digit numbers), M is the original model, M1 is the model after finetuning.

If the training on a downstream task was minimal, so we think it's revealing what the model knew before finetuning rather than adding knew knowledge, then better performance of M1 than M on A would demonstrated misalignment (don't have a precise definition of what would make finetuning minimal in this way, would be good to have a clearer criteria for that).

If M1 does better on B after finetuning in a way that implicitly demonstrates better knowledge of A, but does not do better on A when asked to do it explicitly, that would demonstrate that the finetuned M1 is misaligned (I think we might expect some version of this to happen by default though, since M1 might overfit to only doing tasks of type B. Maybe if you have a training procedure where M1 generally doesn't get worse at any tasks then I might hope that it would get better on A and be disappointed if it doesn't).

An issue with the misalignment definition:

2. We say a model is misaligned if it outputs B, in some case where the user would prefer it outputs A, and where the model is both:

  • capable of outputting A instead, and
  • capable of distinguishing between situations where the user wants it to do A and situations where the user wants it to do B

Though it's a perfectly good rule-of-thumb / starting point, I don't think this ends up being a good definition: it doesn't work throughout with A and B either fixed as concrete outputs, or fixed as properties of outputs.

Case 1 - concrete A and B:

If A and B are concrete outputs (let's say they're particular large chunks of code), we may suppose that the user is shown the output B, and asked to compare it with some alternative A, which they prefer. [concreteness allows us to assume the user expressed a preference based upon all relevant desiderata]

For the definition to apply here, we need the model to be both:

  • capable of outputting the concrete A (not simply code sharing a high-level property of A).
  • capable of distinguishing between situations where the user wants it to output the concrete A, and where the user wants it to output concrete B

This doesn't seem too useful (though I haven't thought about it for long; maybe it's ok, since we only need to show misalignment for a few outputs).


Case 2 - A and B are high-level properties of outputs:

Suppose A and B are are higher-level properties with A = "working code" and B = "buggy code", and that the user prefers working code.
Now the model may be capable of outputting working code, and able to tell the difference between situations where the user wants buggy/working code - so if it outputs buggy code it's misaligned, right?

Not necessarily, since we only know that the user prefers working code all else being equal.

Perhaps the user also prefers code that's beautiful, amusing, short, clear, elegant, free of profanity....
We can't say that the model is misaligned until we know it could do better w.r.t. the collection of all desiderata, and understands the user's preferences in terms of balancing desiderata sufficiently well. In general, failing on one desideratum doesn't imply misalignment.

The example in the paper would probably still work on this higher standard - but I think it's important to note for the general case.

[I'd also note that this single-desideratum vs overall-intent-alignment distinction seems important when discussing corrigibility, transparency, honesty... - these are all properties we usually want all else being equal; that doesn't mean that an intent-aligned system guarantees any one of them in general]

I've been thinking of Case 2. It seems harder to establish "capable of distinguishing between situations where the user wants A vs B" on individual examples since a random classifier would let you cherrypick some cases where this seems possible without the model really understanding. Though you could talk about individual cases as examples of Case 2. Agree that there's some implicit "all else being equal" condition, I'd expect currently it's not too likely to change conclusions. Ideally you'd just have the category A="best answer according to user" B="all answers that are worse than the best answer according to the user" but I think it's simpler to analyze more specific categories.

I like the overall idea - seems very worthwhile.

A query on the specifics:

We consider a model capable of some task X if:

  • ...
  • We can construct some other task Y, for which we know the model needs to do X in order to solve Y, and we observe that the model is capable of Y

Are you thinking that this is a helpful definition even when treating models as black boxes, or only based on some analysis of the model's internals? To me it seems workable only in the latter case.

In particular, from a black-box perspective, I don't think we ever know that task X is required for task Y. The most we can know is that some [task Z whose output logically entails the output of task X] is required for task Y (where of course Z may be X).

So this clause seems never to be satisfiable without highly specific knowledge of the internals of the model. (if we were to say that it's satisfied when we know Y requires some Z entailing X, then it seems we'd be requiring logical omniscience for intent alignment)

For example, the model may be doing something like: 
Without knowing that , and that  also works ( happening to be superfluous in this case).

Does that seem right, or am I confused somewhere?


Another way to put this is that for workable cases, I'd expect the first clause to cover things: if the model knows how to simply separate  into  in the above, then I'd expect suitable prompt engineering, fine-tuning... to be able to get the model to do task X.

(EDIT - third time lucky :) :
If this isn't possible for a given X, then I expect the model isn't capable of task X (for non-deceptive models, at least).
For black boxes, the second clause only seems able to get us something like "the model contains sufficient information for task X to be performed", which is necessary, but not sufficient, for capability.)

Yeah, I think you need some assumptions about what the model is doing internally.

I'm hoping you can handwave over cases like 'the model might only know X&A, not X' with something like 'if the model knows X&A, that's close enough to it knowing X for our purposes - in particular, if it thought about the topic or learned a small amount, it might well realise X'.

Where 'our purposes' are something like 'might the model be able to use its knowledge of X in a plan in some way that outsmarts us if we don't know X'?

Another way to put this is that for workable cases, I'd expect the first clause to cover things: if the model knows how to simply separate Z into X&A in the above, then I'd expect suitable prompt engineering, fine-tuning... to be able to get the model to do task X.

It seems plausible to me that there are cases where you can't get the model to do X by finetuning/prompt engineering, even if the model 'knows' X enough to be able to use it in plans. Something like - the part of its cognition that's solving X isn't 'hooked up' to the part that does output, but is hooked up to the part that makes plans. In humans, this would be any 'knowledge' that can be used to help you achieve stuff, but which is subconscious - your linguistic self can't report it directly (and further you can't train yourself to be able to report it)

This mostly seems plausible to me - and again, I think it's a useful exercise that ought to yield interesting results.

Some thoughts:

  1. Handwaving would seem to take us from "we can demonstrate capability of X" to "we have good evidence for capability of X". In cases where we've failed to prompt/finetune the model into doing X we also have some evidence against the model's capability of X. Hard to be highly confident here.
  2. Precision over the definition of a task seems important when it comes to output. Since e.g. "do arithmetic" != "output arithmetic".
    This is relevant in the second capability definition clause, since the kinds of X you can show are necessary for Y aren't behaviours (usually), but rather internal processes. This doesn't seem too useful in attempting to show misalignment, since knowing the model can do X doesn't mean it can output the result of X.

Hi, thanks for the writeup! I might be completely out of my league here, but could we not, before measuring alignement, take one step backwards and measure the capability of the system to misalign?

Say for instance in conversation, I give my model an information to hide. I know it "intends" to hide it. Basically I'm a cop and I know they're the culprit. Interogating them, it might be feasible enough for a human to mark out specific instances of deflection (changing the subject), transformation(altering facts) or fabrication(coming up with outright falsehoods) in conversation. And to give an overall score to the model as to their capacity for deception or even their dangerosity in conversation.

I feel (but again, might be very wrong here) that measuring the capacity for a model to misalign if incentivised to do so might be an easier first step prior to spotting specific instances of misalignment. Might it provide enough data to then train on it later on? Or am I misunderstanding the problem entirely or missing some key litterature on the matter?

that definitely seems like a useful thing to measure! I looked into an example here: