The "Zoom In" work is aimed at understanding what's going on in neural networks as a scientific question, not directly tackling mesa-optimization. This work is relevant to more application-oriented interpretability if you buy that understanding what is going on is an important prerequisite to applications.

As the original article put it:

And so we often get standards of evaluations more targeted at whether an interpretability method is useful rather than whether we’re learning true statements.

Or, as I put it in Embedded Curiosities:

One downside of disc

... (read more)

Zoom In: An Introduction to Circuits

by Evan Hubinger 2 min read10th Mar 202010 comments

24


Chris Olah and the rest of the rest of the OpenAI Clarity team just published “Zoom In: An Introduction to Circuits,” a Distill article about some of the transparency research they've been doing which I think is very much worth taking a look at. I'll try to go over some of my particular highlights here, but I highly recommend reading the full article.

Specifically, I have previously written about Chris's belief that the field of machine learning should be more like the natural sciences in seeking understanding first and foremost. I think “Zoom In” is a big step towards making something like that a reality, as it provides specific, concrete, testable claims about neural networks upon which you might actually be able to build a field. The three specific claims presented in the article are:

Claim 1: Features

Features are the fundamental unit of neural networks. They correspond to directions [in the space of neuron activations]. These features can be rigorously studied and understood.

Claim 2: Circuits

Features are connected by weights, forming circuits. These circuits can also be rigorously studied and understood.

Claim 3: Universality

Analogous features and circuits form across models and tasks.

“Zoom In” provides lots of in-depth justification and examples for each of these claims which I will mostly leave to the actual article. Some highlights, however:

  • How do convolutional neural networks (CNNs) detect dogs in an orientation-invariant way? It turns out they pretty consistently separately detect leftward-facing and rightward-facing dogs, then union the two together.
  • How do CNNs detect foreground-background boundaries? It turns out they use high-low frequency detectors—which look for high-frequency patterns on one side and low-frequency patterns on the other side—in a bunch of different possible orientations.

What's particularly nice about “Zoom In”'s three claims in my opinion, however, is that they give other researchers a foundation to build upon. Once it's established that neural networks have meaningful features and circuits in them, discovering new such circuits becomes a legitimate scientific endeavor—especially if, as the third claim suggests, those features and circuits are universal across many different networks. From “Zoom In:”

One particularly challenging aspect of being in a pre-paradigmatic field is that there isn’t a shared sense of how to evaluate work in interpretability. There are two common proposals for dealing with this, drawing on the standards of adjacent fields. Some researchers, especially those with a deep learning background, want an “interpretability benchmark” which can evaluate how effective an interpretability method is. Other researchers with an HCI background may wish to evaluate interpretability methods through user studies.

But interpretability could also borrow from a third paradigm: natural science. In this view, neural networks are an object of empirical investigation, perhaps similar to an organism in biology. Such work would try to make empirical claims about a given network, which could be held to the standard of falsifiability.

Why don’t we see more of this kind of evaluation of work in interpretability and visualization? Especially given that there’s so much adjacent ML work which does adopt this frame! One reason might be that it’s very difficult to make robustly true statements about the behavior of a neural network as a whole. They’re incredibly complicated objects. It’s also hard to formalize what the interesting empirical statements about them would, exactly, be. And so we often get standards of evaluations more targeted at whether an interpretability method is useful rather than whether we’re learning true statements.

Circuits side steps these challenges by focusing on tiny subgraphs of a neural network for which rigorous empirical investigation is tractable. They’re very much falsifiable: for example, if you understand a circuit, you should be able to predict what will change if you edit the weights. In fact, for small enough circuits, statements about their behavior become questions of mathematical reasoning. Of course, the cost of this rigor is that statements about circuits are much smaller in scope than overall model behavior. But it seems like, with sufficient effort, statements about model behavior could be broken down into statements about circuits. If so, perhaps circuits could act as a kind of epistemic foundation for interpretability.

I, for one, am very excited about circuits as a direction for building up an understanding-focused interpretability field and want to congratulate Chris and the rest of OpenAI Clarity for putting in the hard work of doing the foundational work necessary to start building a real field around neural network interpretability.

24