If the trajectory of the deep learning paradigm continues, it seems plausible to me that in order for applications of low-level interpretability to AI not-kill-everyone-ism to be truly reliable, we will need a much better-developed and more general theoretical and mathematical framework for deep learning than currently exists. And this sort of work seems difficult. Doing mathematics carefully - in particular finding correct, rigorous statements and then finding correct proofs of those statements - is slow. So slow that the rate of change of cutting-edge engineering practices significantly worsens the difficulties involved in building theory at the right level of generality. And, in my opinion, much slower than the rate at which we can generate informal observations that might possibly be worthy of further mathematical investigation. Thus it can feel like the role that serious mathematics has to play in interpretability is primarily reactive, i.e. consists mostly of activities like 'adding' rigour after the fact or building narrow models to explain specific already-observed phenomena.
My impression however, is that the best applied mathematics doesn’t tend to work like this. My impression is that although the use of mathematics in a given field may initially be reactive and disunited, one of the most lauded aspects of mathematics is a certain inevitability with which our abstractions take on a life of their own and reward us later with insight, generalization, and the provision of predictions. Moreover - remarkably - often those abstractions are found in relatively mysterious, intuitive ways: i.e. not as the result of us just directly asking "What kind of thing seems most useful for understanding this object and making predictions?" but, at least in part, as a result of aesthetic judgement and a sense of mathematical taste. One consequence of this (which is a downside and also probably partly due to the inherent limitations of human mathematics) is that mathematics does not tend to act as an objective tool that you can bring to bear on whatever question it is that you want to think about. Instead, the very practice of doing mathematics seeks out the questions that mathematics is best placed to answer. It cannot be used to say something useful about just anything; rather it finds out what it is that it can say something about.
Even after taking into account these limitations and reservations, developing something that I'm clumsily thinking of as 'the mathematics of (the interpretability of) deep learning-based AI' might still be a fruitful endeavour. In case it is not clear, this is roughly speaking, because a) Many people are putting a lot of hope and resources into low-level interpretability; b) Its biggest hurdles will be making it 'work' at large scale, on large models, quickly and reliably; and c) - the sentiment I opened this article with - doing this latter thing might well require much more sophisticated general theory.
In thinking about some of these themes, I started to mull over a couple of illustrative analogies or examples. The first - and more substantive example - is algebraic topology. This area of mathematics concerns itself with certain ways of assigning mathematical (specifically algebraic) information to shapes and spaces. Many of its foundational ideas have beautiful informal intuitions behind them, such as the notion that a shape my have enough space in it to contain a sphere, but not enough space to contain the ball that that sphere might have demarcated. Developing these informal notions into rigorous mathematics was a long and difficult process and learning this material - even now when it is presented in its best form - is laborious. The mathematical details themselves do not seem beautiful or geometric or intuitive; and it is a slow and alienating process. One has to begin by working with very low-level concrete details - such as how to define the boundary of a triangle in a way that respects the ordering of the vertices - details that can sometimes seem far removed from the type of higher-level concepts that one was originally trying to capture and say something about. But once one has done the hard work of carefully building up the rigorous theory and internalizing its details, the pay-off can be extreme. Your vista opens back up and you are rewarded with very flexible and powerful ways of thinking (in this case about potentially very complicated higher-dimensional shapes and spaces). Readers may recognize this general story as a case of Terry Tao's now very well-known "three stages" of mathematical education, just applied specifically to algebraic topology. I additionally want to point out that within pure mathematics, algebraic topology often has an applicable and computational flavour too, in the sense that there is something like a toolkit of methods from algebraic topology that one can bring to bear on previously unseen spaces and shapes in order to get information about them. So, to try to summarize and map this story onto the interpretability of deep learning-based AI, some of my impressions are that:
The second illustrative example that I have in mind is mathematical physics. This isn't a subject that I know a lot about and so it's perfectly possible that I end up misrepresenting things here, but essentially it is the prototypical example of the kind of thing I am getting at. In very simplified terms, successes of mathematical physics might be said to follow a pattern in which informal and empirically-grounded thinking eventually results in the construction of sophisticated theoretical and mathematical frameworks, which in turn leads to the phase in which the cycle completes and the mathematics of those frameworks provide real-world insights and predictions. Moreover, this latter stage often doesn't look like stating and proving theorems, but rather 'playing around' with the appropriate mathematical objects at just the right level of rigour, often using them over and over again in computations (in the pure math sense of the word) pertaining to specific examples. One can imagine wishing that something like this might play out or 'the mathematics of interpretability'.
Perhaps the most celebrated set of examples of this kind of thing are from the representation theory of Lie groups. Again, I know little about the physics so will avoid going into detail but the relevant point here is that the true descriptive, explanatory and predictive relevance of something like the representation theory of Lie groups was not unlocked by physicists alone. The theory only became quite so highly-developed because a large community of 'pure' mathematicians pursuing all sorts of related questions to do with smooth manifolds, groups, representation theory in general etc. helped to mature the area.
One (perhaps relatively unimportant) difference between this story and the one we want to tell for AI not-kill-everyone-ism is that the typical mathematician studying, say, representation theory in this story might well have been doing so for mostly 'pure' mathematical reasons (and not because they thought their work might one day be part of a framework that predicts the behaviour of fundamental forces or something), whereas we are suggesting developing mathematical theory while remaining guided by the eventual application to AI. A more important difference - and a more genuine criticism of this analogy - is that mathematical physics is of course applied to the real, natural world. And perhaps there really is something about nature that makes it fundamentally amenable to mathematical description in a way that just won't apply to a large neural network trained by some sort of gradient descent? Indeed one does have the feeling that the endeavour we are focussing on would have to be underpinned by a hope that there is something sufficiently 'natural' about deep learning systems that will ultimately make at least some higher-level aspects of them amenable to mathematical analysis. Right now I cannot say how big of a problem this is.
I will try to sum up:
I'm very interested in comments and thoughts.