Interpretability research is sometimes described as neuroscience for ML models. Neuroscience is one approach to understanding how human brains work. But empirical psychology research is another approach. I think more people should engage in the analogous activity for language models: trying to figure out how they work just by looking at their behavior, rather than trying to understand their internals.
I think that getting really good at this might be a weird but good plan for learning some skills that might turn out to be really valuable for alignment research. (And it wouldn’t shock me if “AI psychologist” turns out to be an economically important occupation in the future, and if you got a notable advantage from having a big head start on it.) I think this is especially likely to be a good fit for analytically strong people who love thinking about language and are interested in AI but don’t love math or computer science.
I'd probably fund people to spend at least a few months on this; email me if you want to talk about it.
Some main activities I’d do if I was a black-box LM investigator are:
The skills you’d gain seem like they have a few different applications to alignment:
My guess is that this work would go slightly better if you had access to someone who was willing to write you some simple code tools for interacting with models, rather than just using the OpenAI playground. If you start doing work like this and want tools, get in touch with me and maybe someone from Redwood Research will build you some of them.
Maybe useful way to get feedback on how good you are at doing this would be trying to make predictions based on your experience with language models:
Yeah I think things like this are reasonable. I think that these are maybe too hard and high-level for a lot of the things I care about--I'm really interested in questions like "how much less reliable is the model about repeating names when the names are 100 tokens in the past instead of 50", which are much simpler and lower level.
Curated. I've heard a few offhand comments about this type of research work in the past few months, but wasn't quite sure how seriously to take it.
I like this writeup for spelling out details of why it blackbox investigators might be useful, what skills it requires and how you might go about it.
I expect this sort of skillset to have major limitations, but I think I agree with the stated claims that it's a useful skillset to have in conjunction with other techniques.
One general piece of advice is that it seems like it might be useful to have an interface that shows you multiple samples for each prompt (the OpenAI playground just gives you one sample, if you use temperature > 0 then this sample could either be lucky or unlucky)
Yeah I wrote an interface like this for personal use, maybe I should release it publicly.