[AN #147]: An overview of the interpretability landscape

Rohin Shah

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers (Peter Hase and Owen Shen) (summarized by Rohin): This is basically 3 months worth of Alignment Newsletters focused solely on interpretability, wrapped up into a single post. The authors provide summaries of 70 (!) papers on the topic, and include links to another 90. I’ll focus on their opinions about the field in this summary.

The theory and conceptual clarity of the field of interpretability has improved dramatically since its inception. There are several new or clearer concepts, such as simulatability, plausibility, (aligned) faithfulness, and (warranted) trust. This seems to have had a decent amount of influence over the more typical “methods” papers.

There have been lots of proposals for how to evaluate interpretability methods, leading to the problem of too many standards. The authors speculate that this is because both “methods” and “evaluation” papers don’t have sufficient clarity on what research questions they are trying to answer. Even after choosing an evaluation methodology, it is often unclear which other techniques you should be comparing your new method to.

For specific methods for achieving interpretability, at a high level, there has been clear progress. There are cases where we can:

1. identify concepts that certain neurons represent,

2. find feature subsets that account for most of a model's output,

3. find changes to data points that yield requested model predictions,

4. find training data that influences individual test time predictions,

5. generate natural language explanations that are somewhat informative of model reasoning, and

6. create somewhat competitive models that are inherently more interpretable.

There does seem to be a problem of disconnected research and reinventing the wheel. In particular, work at CV conferences, work at NLP conferences, and work at NeurIPS / ICML / ICLR form three clusters that for the most part do not cite each other.

Rohin's opinion: This post is great. Especially to the extent that you like summaries of papers (and according to the survey I recently ran, you probably do like summaries), I would recommend reading through this post. You could also read through the highlights from each section, bringing it down to 13 summaries instead of 70.

TECHNICAL AI ALIGNMENT

LEARNING HUMAN INTENT

TuringAdvice: A Generative and Dynamic Evaluation of Language Use (Rowan Zellers et al) (summarized by Rohin): There are two main ways in which current NLP models are evaluated: quality of generations (how sensible the generated language looks), and correctness (given some crisp question or task, does the model output the right answer). However, we often care about using models for tasks in which there is no literally correct answer. This paper introduces an evaluation method for this setting: TuringAdvice. Models are presented with a situation in which a human is asking for advice, and the model must provide a helpful response. To score models, the resulting responses are compared against good human responses. The model’s response is successful if its advice is at least as helpful to the advice-seeker as human-written advice.

The authors collect a dataset of situations from Reddit, and for the human-written advice they take the most upvoted top-level comment on the post. A finetuned T5 model achieves a score of 14%, while prompted GPT-3 achieves a score of 4%. In contrast, taking the secondmost upvoted top-level comment would give a score of 41%, and a model that gave advice about as good as the typical best advice from a human would get 50%. The paper also presents several qualitative failures in which the models seem to have significant misunderstandings of the situation (though I can’t tell how cherrypicked these are).

Rohin's opinion: I really like the fact that we’re studying a fuzzy task directly, and using human evaluation to determine how well the model performs. (Though note that this is not the first benchmark to do so.)

Recursive Classification: Replacing Rewards with Examples in RL (Benjamin Eysenbach et al) (summarized by Rohin): Previous work has suggested learning a reward model from examples of successfully solving the task. This paper suggests that rather than a two stage process of learning a reward model and then optimizing it using RL, we can instead directly learn a policy from the examples by building an equivalent of Bellman backups that apply directly to examples (rather than having to go through intermediate rewards). Their experiments show that this works well.

OTHER PROGRESS IN AI

DEEP LEARNING

Branch Specialization (Chelsea Voss et al) (summarized by Rohin): Neural network architectures sometimes have different “branches”, where later features can depend on earlier features within the same branch, but cannot depend on features in parallel branches. This post presents evidence showing that in these architectures, branches often tend to specialize in particular types of features. For example:

1. The first two layers in AlexNet are split into two branches. In one branch, the first layer tends to learn black and white Gabor filters, while in the other branch, the first layer tends to learn low-frequency color detectors. This persists across retraining, or even training on a different dataset of natural images, such as Places (rather than ImageNet).

2. All 9 of the black and white vs. color detectors in mixed3a are in mixed3a_5x5 (p < 1e-8). All 30 of the curve-related features in mixed3b are in mixed3b_5x5 (p < 1e-20). There are confounds here, but also good reasons to expect that it is in fact branch specialization.

Given that branch specialization seems to be robust and consistent even across datasets, a natural hypothesis is that it is reflecting a structure that already exists. Even if you didn’t have branching, it seems likely that the model would still learn very similar neurons, and it seems plausible that e.g. the weights connecting the first-layer black-and-white Gabor filters to the second-layer color detectors are effectively zero. With branching, you learn the same features in such a way that all the weights that previously were effectively zero now don’t exist because they would be crossing branches. This would look like having the Gabor filters in one branch and the color detectors in the other branch.

Rohin's opinion: I find the hypothesis the authors propose quite compelling (and this is very similar to the hypothesis that neural networks tend to be modular, which we discuss more below). Partly, this is because it has a common-sense explanation: when designing an organization, you want to put related functions in the same group to minimize the communication across groups. Here, the full network is the organization, the branches are an explicit constraint on communication, and so you want to put related functions in the same branch.

At the end of the article, the authors also suggest that there could be a connection with the way that different regions of the brain are specialized to particular tasks. I’ll go further than the authors in my speculation: it seems plausible to me that this specialization is simply the result of the brain’s learning algorithm reflecting the structure of the world through specialization. (Though it seems likely that the different areas of the brain must at least have different “architectures”, in order for the same tasks to be routed to the same brain regions across humans.) But the case of AlexNet demonstrates that in theory, the only thing you need for specialization to arise is a restriction on the communication between one part of the architecture and the other.

Clusterability in Neural Networks (Daniel Filan, Stephen Casper, Shlomi Hod et al) (summarized by Zach): Neural networks are often construed as lacking internal structure. In this paper, the authors challenge the predominant view and hypothesize that neural networks are more clusterable than is suggested by chance. To investigate the claim, the authors partition the network into groups where most of the edge weight is between neurons in the same group. The authors find that the quality of these groups improves after training, as compared to randomly initialized networks. However, this only holds for certain training setups. Despite this limitation, the authors show it's possible to promote clusterability with little to no effect on accuracy.

In experiments, the authors compare the clusterability of trained networks to randomly initialized networks and trained networks with shuffled weights. They focus on multi-layer perceptrons (MLPs) and convolutional networks with dropout regularization. They also run experiments with pruned networks or networks where 'unimportant' edges are removed. They find that MLP networks have clusterable neurons at rates higher than chance, but have mixed results for convolutional networks.

The authors hypothesize that clusterability is more likely to arise when different features of the input can be computed in parallel without communication between the features (which is very similar to the hypothesis in the previous paper). To test the hypothesis, they combine examples from the datasets into pairs and then train the neural network to make a double-prediction in a side-by-side setup. Intuitively, the network would need to look at each pair separately, without any need to combine information across the two sides. They find that this setup results in increased modularity.

Zach's opinion: The experiments in this paper are well-designed. In particular, I found the side-by-side experiment setup to be a clever way to test the ideas presented in the paper. Having said that, the actual results from the experiments are mixed. The paper's strongest results are for the clusterability of pruned networks, while evidence for the clusterability of convolutional networks is quite mixed. However, pruning is not common practice. Additionally, in an intuitive sense, pruning a network seems as though it could be defined in terms of clusterability notions, which limits my enthusiasm for that result.

Rohin's opinion: I feel like there are quite a few interesting next questions for the study of modularity in neural networks:

1. Does modularity become more obvious or pronounced as we get to larger models and more complex and realistic data?

2. How well do networks cluster when you are looking for hundreds or thousands of clusters (rather than the 12 clusters considered in this paper)?

3. To what extent is modularity a result of any training, vs. a result of the specific data being trained on? If you train a network to predict random labels, will it be less modular as a result?

One challenge here is in ensuring that the clustering algorithm used is still accurately measuring modularity in these new settings. Another challenge is whether networks are more modular just because in a bigger model there are more chances to find good cuts. (In other words, what's the default to which we should be comparing?) For example, the paper does present some results with models trained on ImageNet, and ResNet-50 gets one of the highest clusterability numbers in the paper. But the authors do mention that the clustering algorithm was less stable, and it's not clear how exactly to interpret this high clusterability number.

Weight Banding (Michael Petrov et al) (summarized by Rohin): Empirically, when training neural networks on ImageNet, we can commonly observe “weight banding” in the final layer. In other words, the neurons in the final layer pay very strong attention to the vertical position of features, and ignore the horizontal position of features. This holds across InceptionV1, ResNet50, and VGG19, though it doesn’t hold for AlexNet.

If you rotate the training data by 90 degrees, then the phenomenon changes to have vertical striping, that is, we now pay strong attention to the horizontal position of features. This suggests that this phenomenon is being driven somehow by the ImageNet data.

The authors hypothesize that this is caused by the neural network needing to recover some spatial information that was reduced by the previous average pooling layer (which is not present in AlexNet). They try removing this layer, which causes the effect to go away in Inception, but not in VGG19. They seem to think that it also goes away in ResNet50, but when I look at the results, it seems like the phenomenon is still there (though not as strongly as before).

They try a bunch of other architectural interventions on a simplified architecture and find that weight banding persists across all of these.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.