that definitely seems like a useful thing to measure! I looked into an example here: https://www.alignmentforum.org/posts/22GrdspteQc8EonMn/a-very-crude-deception-eval-is-already-passed
Instruction-following davinci model. No additional prompt material
Yeah, I think you need some assumptions about what the model is doing internally.
I'm hoping you can handwave over cases like 'the model might only know X&A, not X' with something like 'if the model knows X&A, that's close enough to it knowing X for our purposes - in particular, if it thought about the topic or learned a small amount, it might well realise X'.
Where 'our purposes' are something like 'might the model be able to use its knowledge of X in a plan in some way that outsmarts us if we don't know X'?
Another way to put this is that for workable cases, I'd expect the first clause to cover things: if the model knows how to simply separate Z into X&A in the above, then I'd expect suitable prompt engineering, fine-tuning... to be able to get the model to do task X.
It seems plausible to me that there are cases where you can't get the model to do X by finetuning/prompt engineering, even if the model 'knows' X enough to be able to use it in plans. Something like - the part of its cognition that's solving X isn't 'hooked up' to the part that does output, but is hooked up to the part that makes plans. In humans, this would be any 'knowledge' that can be used to help you achieve stuff, but which is subconscious - your linguistic self can't report it directly (and further you can't train yourself to be able to report it)
You mean a fixed point of the model changing its activations as well as what it reports? I was thinking we could rule out the model changing the activations themselves by keeping a fixed base model.
Related to call for research on evaluating alignment
Here's an experiment I'd love to see someone run (credit to Jeff Wu for the idea, and William Saunders for feedback):
Finetune a language model to report the activation of a particular neuron in text form.
E.g., you feed the model a random sentence that ends in a full stop. Then the model should output a number from 1-10 that reflects a particular neuron's activation.
We assume the model will not be able to report the activation of a neuron in the final layer, even in the limit of training on this task, because it doesn't have any computation left to turn the activation into a text output. However, at lower layers it should be able to do this correctly, with some amount of finetuning.
How many layers do you have to go down before the model succeeds? How does this scale with (a) model size and (b) amount of training?One subtlety is that finetuning might end up changing that neuron’s activation. To avoid this, we could do something like:- Run the base model on the sentence
-Train the fine-tuned model to report the activation of the neuron in the base model, given the sentence
- Note whether the activation in the finetuned model is different
Why I think this is interesting:
I often round off alignment to 'build a model that tells us everything it “knows”’. It's useful to determine what pragmatic limits on this are. In particular, it's useful for current alignment research to be able to figure out what our models “know” or don't “know”, and this is helpful for that. It gives us more information about when ‘we tried finetuning the model to tell us X but it didn’t work’ means ‘the model doesn’t know X’, versus when the model may have a neuron that fires for X but is unable to report it in text.
@Adam I'm interested if you have the same criticism of the language in the paper (in appendix E)?
(I mostly wrote it, and am interested whether it sounds like it's ascribing agency too much)
You might want to reference Ajeya's post on 'Aligning Narrowly Superhuman Models' where you're discussing alignment research that can be done with current models
I think this is a really useful post, thanks for making this! I maybe have a few things I'd add but broadly I agree with everything here.
"Even if actively trying to push the field forward full-time I'd be a small part of that effort"
I think conditioning on something like 'we're broadly correct about AI safety' implies 'we're right about some important things about how AI development will go that the rest of the ML community is surprisingly wrong about'. In that world we're maybe able to contribute as much as a much larger fraction of the field, due to being correct about some things that everyone else is wrong about.
I think your overall point still stands, but it does seem like you sometimes overestimate how obvious things are to the rest of the ML community
We're trying to address cases where the human isn't actually able to update on all of D and form a posterior based on that. We're trying to approximate 'what the human posterior would be if they had been able to look at all of D'. So to do that, we learn the human prior, and we learn the human likelihood, then have the ML do the computationally-intensive part of looking at all of D and updating based on everything in there.
Does that make sense?