Tuesday, September 21st 2021
4Beth Barnes8hWHEN CAN MODELS REPORT THEIR ACTIVATIONS? Related to call for research on evaluating alignment [] Here's an experiment I'd love to see someone run (credit to Jeff Wu for the idea, and William Saunders for feedback): Finetune a language model to report the activation of a particular neuron in text form. E.g., you feed the model a random sentence that ends in a full stop. Then the model should output a number from 1-10 that reflects a particular neuron's activation. We assume the model will not be able to report the activation of a neuron in the final layer, even in the limit of training on this task, because it doesn't have any computation left to turn the activation into a text output. However, at lower layers it should be able to do this correctly, with some amount of finetuning. How many layers do you have to go down before the model succeeds? How does this scale with (a) model size and (b) amount of training? One subtlety is that finetuning might end up changing that neuron’s activation. To avoid this, we could do something like: - Run the base model on the sentence -Train the fine-tuned model to report the activation of the neuron in the base model, given the sentence - Note whether the activation in the finetuned model is different Why I think this is interesting: I often round off alignment to 'build a model that tells us everything it “knows”’. It's useful to determine what pragmatic limits on this are. In particular, it's useful for current alignment research to be able to figure out what our models “know” or don't “know”, and this is helpful for that. It gives us more information about when ‘we tried finetuning the model to tell us X but it didn’t work’ means ‘the model doesn’t know X’, versus when the model may have a neuron that fires for X but is unable to report it in text.

Sunday, September 19th 2021
Saturday, September 18th 2021
Tuesday, September 14th 2021
Monday, September 13th 2021
Sunday, September 12th 2021
