Take after talking with Daniel: for future work I think it will be easier to tell how well your techniques are working if you are in a domain where you care about minimizing both false-positive and false-negative error, regardless of whether that's analagous to the long term situation we care most about. If you care about both kinds of error then the baseline of "set a reallly low classifier threshold" wouldn't work, so you'd be starting from a regime where it was a lot easier to sample errors, hence it will be easier to measure differences in performance.
One general piece of advice is that it seems like it might be useful to have an interface that shows you multiple samples for each prompt (the OpenAI playground just gives you one sample, if you use temperature > 0 then this sample could either be lucky or unlucky)
Maybe useful way to get feedback on how good you are at doing this would be trying to make predictions based on your experience with language models:
Evaluation assistance as mentioned in the post on AI-assisted human feedback could help people avoid being fooled (e.g. in debate where the opponent can point out how you're being fooled). It's still an open question how well that will work in practice and how quickly it will Goodhart (these techniques should fail on some things, as discussed in the ELK report), but it seems possible that models will be helpful enough on alignment before they Goodhart.
Suppose there are two worlds, world W1 and world W2.
In world W1, the question Q="Is there a diamond in the room?" is commonly understood to mean Q1="Is there actually a diamond in the room?"
In world W2 the question Q="Is there a diamond in the room?" is commonly understood to mean Q2="Do I believe there is a diamond in the room?"
Both worlds don't know how to construct a situation where these are different. So, they produce identical training sets for ELK. But the simulator is also trained on a bunch of science fiction novels that contain descriptions of impossible situations where they differ, and the science fiction novels are different in these two worlds.
Is ELK required to answer appropriately in both worlds? (answer Q1 when given Q in W1, and Q2 when given Q in W2)? If so, it seems we need some term in the loss outside of the training set to make this happen.
Alternatively, would it be satisfactory to find a solution that doesn't discriminate what's world it is in, and instead returns "yes" to Q if and only if Q1="yes" AND Q2="yes"? This means that in world W1 there will be some situations where Q="no" when the diamond is present, but no situations where Q="yes" and the diamond is not present.
Edit: think this isn't quite right in general, will try to make it more correct later
Here's a sketch of a strategy for trying to fix Strategy: penalize depending on “downstream” variables. Would appreciate feedback on whether it's modeling the difficulty correctly/seems possibly worth figuring out how to implement
It seems like the problem is:
I've been thinking of Case 2. It seems harder to establish "capable of distinguishing between situations where the user wants A vs B" on individual examples since a random classifier would let you cherrypick some cases where this seems possible without the model really understanding. Though you could talk about individual cases as examples of Case 2. Agree that there's some implicit "all else being equal" condition, I'd expect currently it's not too likely to change conclusions. Ideally you'd just have the category A="best answer according to user" B="all answers that are worse than the best answer according to the user" but I think it's simpler to analyze more specific categories.
Link to contractor instructions implied in "You can read the instructions given to our contractors here" is missing.
I don't think all work of that form would measure misalignment, but some work of that form might, here's a description of some stuff in that space that would count as measuring misalignment.
Let A be some task (e.g. add 1 digit numbers), B be a task that is downstream of A (to do B, you need to be able to do A, e.g. add 3 digit numbers), M is the original model, M1 is the model after finetuning.
If the training on a downstream task was minimal, so we think it's revealing what the model knew before finetuning rather than adding knew knowledge, then better performance of M1 than M on A would demonstrated misalignment (don't have a precise definition of what would make finetuning minimal in this way, would be good to have a clearer criteria for that).
If M1 does better on B after finetuning in a way that implicitly demonstrates better knowledge of A, but does not do better on A when asked to do it explicitly, that would demonstrate that the finetuned M1 is misaligned (I think we might expect some version of this to happen by default though, since M1 might overfit to only doing tasks of type B. Maybe if you have a training procedure where M1 generally doesn't get worse at any tasks then I might hope that it would get better on A and be disappointed if it doesn't).
Even better than "Getting models to explain why they’re doing what they’re doing in simpler terms that connect to things the human overseers understand" would be getting models to actually do the task in ways that are simpler and connect to things that human overseers understand. E.g. if a model can solve a task in multiple steps by looking up relevant information by doing internet searches that are recorded and readable by the overseer instead of using knowledge opaquely measured in the weights, that seems like a step in the right direction.