The case for becoming a black-box investigator of language models — AI Alignment Forum

x

The case for becoming a black-box investigator of language models — AI Alignment Forum

Interpretability research is sometimes described as neuroscience for ML models. Neuroscience is one approach to understanding how human brains work. But empirical psychology research is another approach. I think more people should engage in the analogous activity for language models: trying to figure out how they work just by looking at their behavior, rather than trying to understand their internals.

I think that getting really good at this might be a weird but good plan for learning some skills that might turn out to be really valuable for alignment research. (And it wouldn’t shock me if “AI psychologist” turns out to be an economically important occupation in the future, and if you got a notable advantage from having a big head start on it.) I think this is especially likely to be a good fit for analytically strong people who love thinking about language and are interested in AI but don’t love math or computer science.

I'd probably fund people to spend at least a few months on this; email me if you want to talk about it.

Some main activities I’d do if I was a black-box LM investigator are:

Spend a lot of time with them. Write a bunch of text with their aid. Try to learn what kinds of quirks they have; what kinds of things they know and don’t know.
Run specific experiments. Do they correctly complete “Grass is green, egg yolk is”? Do they know the population of San Francisco? (For many of these experiments, it seems worth running them on LMs of a bunch of different sizes.)
Try to figure out where they’re using proxies to make predictions and where they seem to be making sense of text more broadly, by taking some text and seeing how changing it changes their predictions.
Try to find adversarial examples where they see some relatively natural-seeming text and then do something really weird.

The skills you’d gain seem like they have a few different applications to alignment:

As a language model interpretability researcher, I’d find it very helpful to talk to someone who had spent a long time playing with the models I work with (currently I’m mostly working with gpt2-small, which is a 12 layer model). In particular, it’s much easier to investigate the model when you have good ideas for behaviors you want to explain, and know some things about the model’s algorithm for doing such behaviors; I can imagine an enthusiastic black-box investigator being quite helpful for our research.
I think that alignment research (as well as the broader world) might have some use for prompt engineers–it’s kind of fiddly and we at Redwood would have loved to consult with an outsider when we were doing some of it in our adversarial training project (see section 4.3.2 here).
I’m excited for people working on “scary demos”, where we try to set up situations where our models exhibit tendencies which are the baby versions of the scary power-seeking/deceptive behaviors that we’re worried will lead to AI catastrophe. See for example Beth Barnes’s proposed research directions here. A lot of this work requires knowing AIs well and doing prompt engineering.
It feels to me like “have humans try to get to know the AIs really well by observing their behaviors, so that they’re able to come up with inputs where the AIs will be tempted to do bad things, so that we can do adversarial training” is probably worth including in the smorgasbord of techniques we use to try to prevent our AIs from being deceptive (though I definitely wouldn’t want to rely on it to solve the whole problem). When we’re building actual AIs we probably need the red team to be assisted by various tools (AI-powered as well as non-AI-powered, eg interpretability tools). We’re working on building simple versions of these (e.g. our red-teaming software and our interpretability tools). But it seems pretty reasonable for some people to try to just do this work manually in parallel with us automating parts of it. And we’d find it easier to build tools if we had particular users in mind who knew exactly what tools they wanted.
- Even transformatively intelligent and powerful AIs seem to me to be plausibly partially understandable by humans. Eg it seems plausible to me that these systems will communicate with each other at least partially in natural language, or that they'll have persistent memory stores or cognitive workspaces which can be inspected.

My guess is that this work would go slightly better if you had access to someone who was willing to write you some simple code tools for interacting with models, rather than just using the OpenAI playground. If you start doing work like this and want tools, get in touch with me and maybe someone from Redwood Research will build you some of them.