Biological neural networks (i.e. brains) and artificial neural networks have sufficient commonalities that it's often reasonable to treat our knowledge about one as a good starting point for reasoning about the other. So one way to predict how the field of neural network interpretability will develop is by looking at how neuroscience interprets the workings of human brains. I think there are several interesting things to be learned from this, but the one I'll focus on in this post is the concept of modularity: the fact that different parts of the brain carry out different functions. Neuroscientists have mapped many different skills (such as language use, memory consolidation, and emotional responses) to specific brain regions. Note that this doesn’t always give us much direct insight into how the skills themselves work - but it does make follow-up research into those skills much easier. I’ll argue that, for the purposes of AGI safety, this type of understanding may also directly enable important safety techniques.
What might it look like to identify modules in a machine learning system? Some machine learning systems are composed of multiple networks trained on different objective functions - which I’ll call architectural modularity. But what I’m more interested in is emergent modularity, where a single network develops modularity after training. Emergent modularity requires that the weights of a network give rise to a modular structure, and that those modules correspond to particular functions. We can think about this both in terms of high-level structure (e.g. a large section of a neural network carrying out a broad role, analogous to the visual system in humans) or lower-level structure, involving a smaller module carrying out more specific functions. (Note that this is a weaker definition than the one defended in philosophy by Fodor and others - for instance, the sets of neurons don’t need to contain encapsulated information.)
In theory, the neurons which make up a module might be distributed in a complex way across the entire network with only tenuous links between them. But in practice, we should probably expect that if these modules exist, we will be able to identify them by looking at the structure of connections between artificial neurons, similar to how it’s done for biological neurons. The first criterion is captured in a definition proposed by Filan et al. (2021).: a network is modular to the extent that it can be partitioned into sets of neurons where each set is strongly internally connected, but only weakly connected to other sets. They measure this by pruning the networks, then using graph-clustering algorithms, and provide empirical evidence that multi-layer perceptrons are surprisingly modular.
The next question is whether those modules correspond to internal functions. Although it’s an appealing and intuitive hypothesis, the evidence for this is currently mixed. On one hand, Olah et al.’s (2020) investigations find circuits which implement human-comprehensible functions. And insofar as we expect artificial neural networks to be similar to biological neural networks, the evidence from brain lesions in humans and other animals is compelling. On the other hand, they also find evidence for polysemanticity in artificial neural networks: some neurons fire for multiple reasons, rather than having a single well-defined role.
If it does turn out to be the case that structural modules implement functional modules, though, that has important implications for safety research: if we know what types of cognition we’d like our agents to avoid, then we might be able to identify and remove the regions responsible for them. In particular, we could try to find modules responsible for goal-directed agency, or perhaps even ones which are used for deception. This seems like a much more achievable goal for interpretability research than the goal of “reading off” specific thoughts that the network is having. Indeed, as in humans, very crude techniques for monitoring neural activations may be sufficient to identify many modules. But doing so may be just as useful for safety as precise interpretability, or more so, because it allows us to remove underlying traits that we’re worried about merely by setting the weights in the relevant modules to zero - a technique which I’ll call module pruning.
Of course, removing significant chunks of a neural network will affect its performance on the tasks we do want it to achieve. But it’s possible that retraining it from that point will allow it to regain the functionality we’re interested in without fully recreating the modules we’re worried about. This would be particularly valuable in cases where extensive pre-training is doing a lot of work in developing our agents’ capabilities, because that pre-training tends to be hard to control. For instance, it’s difficult to remove all offensive content from a large corpus of internet data, and so language models trained on such a corpus usually learn to reiterate that offensive content. Hypothetically, though, if we were able to observe small clusters of neurons which were most responsible for encoding this content, and zeroed out the corresponding parameters, then we could subsequently continue training on smaller corpora with more trustworthy content. While this particular example is quite speculative, I expect the general principle to be more straightforwardly applicable for agents that are pre-trained in multi-agent environments, in which they may acquire a range of dangerous traits like aggression or deception.
Module pruning also raises a counterintuitive possibility: that it may be beneficial to train agents to misbehave in limited ways, so that they develop specific modules responsible for those types of misbehaviour, which we can then remove. Of course, this suggestion is highly speculative. And, more generally, we should be very uncertain about whether advanced AIs will have modules that correspond directly to the types of skills we care about. But thinking about the safety of big neural networks in terms of emergent modules does seem like an interesting direction - both because the example of humans suggests that it’ll be useful, and also because it will push us towards lower-level and more precise descriptions of the types of cognition which our AIs carry out, and the types which we’d like to prevent.
Why would that be our default expectation? We don't have direct access to all of the underlying parameters in the brain. We can't even simulate it yet, let alone take a gradient.
Lots of reasons. Neural networks are modelled after brains. They both form distributed representations at very large scales, they both learn over time, etc etc. Sure, you've pointed out a few differences, but the similarities are so great that this should be the main anchor for our expectations (rather than, say, thinking that we'll understand NNs the same way we understand support vector machines, or the same way we understand tree search algorithms, or...).
I'm not convinced that these similarities are great enough to merit such anchoring. Just because NNs have more in common with brains than with SVMs, does not imply that we will understand NNs in roughly the same ways that we understand biological brains. We could understand them in a different set of ways than we understand biological brains, and differently than we understand SVMs.
Rather than arguing over reference class, it seems like it would make more sense to note the specific ways in which NNs are similar to brains, and what hints those specific similarities provide.
Perhaps a good way to summarize all this is something like "qualitatively similar models probably work well for brains and neural networks". I agree to a large extent with that claim (though there was a time when I would have agreed much less), and I think that's the main thing you need for the rest of the post.
"Ways we understand" comes across as more general than that - e.g. we understand via experimentally probing physical neurons vs spectral clustering of a derivative matrix.
The statement seems almost tautological – couldn't we somewhat similarly claim that we'll understand NNs in roughly the same ways that we understand houses, except where we have reasons to think otherwise? The "except where we have reasons to think otherwise" bit seems to be doing a lot of work.
Compare: when trying to predict events, you should use their base rate except when you have specific updates to it.
Similarly, I claim, our beliefs about brains should be the main reference for our beliefs about neural networks, which we can then update from.
I agree that the phrasing could be better; any suggestions?
I actually think you could just drop that intro altogether, or move it later into the post. We do have pretty good evidence of modularity in the brain (as well as other biological systems) and in trained neural nets; it seems to be a pretty common property of large systems "evolved" by local optimization. And the rest of the post (as well as some of the other comments) does a good job of talking about some of that evidence. It's a good post, and I think the arguments later in the post are stronger than that opening.
(On the other hand, if you're opening with it because that was your own main prior, then that makes sense. In that case, maybe note that it was a prior for you, but that the evidence from other directions is strong enough that we don't need to rely much on that prior?)
Thanks, that's helpful. I do think there's a weak version of this which is an important background assumption for the post (e.g. without that assumption I'd need to explain the specific ways in which ANNs and BNNs are similar), so I've now edited the opening lines to convey that weak version instead. (I still believe the original version but agree that it's not worth defending here.)
Yeah, I'm not trying to say that the point is invalid, just that phrasing may give the point more appeal than is warranted from being somewhat in the direction of a deepity. Hmm, I'm not sure what better phrasing would be.
Relevant related work : NNs are surprisingly modular
On the topic of pruning neural networks, see the lottery ticket hypothesis
I believe Richard linked to Clusterability in Neural Networks, which has superseded Pruned Neural Networks are Surprisingly Modular.
The same authors also recently published Detecting Modularity in Deep Neural Networks.
It's true! Altho I think of putting something up on arXiv as a somewhat lower bar than 'publication' - that paper has a bit of work left.