There’s a common perception that various non-deep-learning ML paradigms - like logic, probability, causality, etc - are very interpretable, whereas neural nets aren’t. I claim this is wrong.

It’s easy to see where the idea comes from. Look at the sort of models in, say, Judea Pearl’s work. Like this:


It says that either the sprinkler or the rain could cause a wet sidewalk, season is upstream of both of those (e.g. more rain in spring, more sprinkler use in summer), and sidewalk slipperiness is caused by wetness. The Pearl-style framework lets us do all sorts of probabilistic and causal reasoning on this system, and it all lines up quite neatly with our intuitions. It looks very interpretable.

The problem, I claim, is that a whole bunch of work is being done by the labels. “Season”, “sprinkler”, “rain”, etc. The math does not depend on those labels at all. If we code an ML system to use this sort of model, its behavior will also not depend on the labels at all. They’re just suggestively-named LISP tokens. We could use the exact same math/code to model some entirely different system, like my sleep quality being caused by room temperature and exercise, with both of those downstream of season, and my productivity the next day downstream of sleep.


We could just replace all the labels with random strings, and the model would have the same content:


Now it looks a lot less interpretable.

Perhaps that seems like an unfair criticism? Like, the causal model is doing some nontrivial work, but connecting the labels to real-world objects just isn’t the problem it solves?

… I think that’s true, actually. But connecting the internal symbols/quantities/data structures of a model to external stuff is (I claim) exactly what interpretability is all about.

Think about interpretability for deep learning systems. A prototypical example for what successful interpretability might look like is e.g. we find a neuron which robustly lights up specifically in response to trees. It’s a tree-detector! That’s highly interpretable: we know what that neuron “means”, what it corresponds to in the world. (Of course in practice single neurons are probably not the thing to look at, and also the word “robustly” is doing a lot of subtle work, but those points are not really relevant to this post.)

The corresponding problem for a logic/probability/causality-based model would be: take a variable or node, and figure out what thing in the world it corresponds to, ignoring the not-actually-functionally-relevant label. Take the whole system, remove the labels, and try to rederive their meanings.

… which sounds basically-identical to the corresponding problem for deep learning systems.

We are no more able to solve that problem for logic/probability/causality systems than we are for deep learning systems. We can have a node in our model labeled “tree”, but we are no more (or less) able to check that it actually robustly represents trees than we are for a given neuron in a neural network. Similarly, if we find that it does represent trees and we want to understand how/why the tree-representation works, all those labels are a distraction.

One could argue that we’re lucky deep learning is winning the capabilities race. At least this way it’s obvious that our systems are uninterpretable, that we have no idea what’s going on inside the black box, rather than our brains seeing the decorative natural-language name “sprinkler” on a variable/node and then thinking that we know what the variable/node means. Instead, we just have unlabeled nodes - an accurate representation of our actual knowledge of the node’s “meaning”.


13 comments, sorted by Click to highlight new comments since: Today at 3:49 PM
New Comment

Two potentially relevant distinctions:

  • Is your model produced by optimization for accuracy, or by optimizing local pieces for local accuracy (e.g. for your lane prediction algorithm accurately predicting lane boundaries), or by human engineering (e.g. pieces of the model reflecting facts that humans know about the world)? Most practical systems will do some of each.
  • Is your policy divided into pieces (like a generative model, inference algorithm, planner) which are optimized for local objectives, or is it all optimized end-to-end for performance? If most of your compute goes into inference and planning, then the causal model inside an AI with a given level of performance is likely to be much smaller (and have simpler concepts) than a neural net with the same performance.

Those are kind of similar to each other; they are both talking about how much you optimize pieces to do a job defined by human expectations (with human reasoning about how the pieces fit together) vs optimizing end to end.

I think these are real distinctions that increase how far you can scale before running into catastrophic misalignment. They may come with big performance tradeoffs, since end-to-end optimization is quite powerful. You may be able to claw back some runtime inefficiency by distilling a slow inference or planning process into a faster neural network, but this probably only gets you so far and reintroduces some alignment issues.

As far as I can tell, people who feel like probabilistic models are better than deep learning are mostly optimistic because of this kind of distinction.

Great points. I definitely agree with your argument quantitatively: these distinctions mean that a probabilistic model will be quantitatively more interpretable for the same system, or be able to handle more complex systems for a given interpretability metric (like e.g. "running into catastrophic misalignment").

That said, it does seem like the vast majority of interpretability for both probabilistic and ML systems is in "how does this internal stuff correspond to stuff in the world". So qualitatively, it seems like the central interpretability problem is basically the same for both.

Yeah, I agree that if you learn a probabilistic model then you mostly have a difference in degree rather than difference in kind with respect to interpretability. It's not super clear that the difference in degree is large or important (it seems like it could be, just not clear). And if you aren't willing to learn a probabilistic model, then you are handicapping your system in a way that will probably eventually be a big deal.

It sounds to me like, in the claim "deep learning is uninterpretable", the key word in "deep learning" that makes this claim true is "learning", and you're substituting the similar-sounding but less true claim "deep neural networks are uninterpretable" as something to argue against. You're right that deep neural networks can be interpretable if you hand-pick the semantic meanings of each neuron in advance and carefully design the weights of the network such that these intended semantic meanings are correct, but that's not what deep learning is. The other things you're comparing it to that are often called more interpretable than deep learning are in fact more interpretable than deep learning, not (as you rightly point out) because the underlying structures they work with is inherently more interpretable, but because they aren't machine learning of any kind.

I think that a probabilistic generative model with fewer nodes / elements can match the performance of a deep net with more nodes. But I think "fewer nodes" is still going to be millions or billions or trillions of nodes for an AGI, and I strongly agree with you that such a thing is going to be uninterpretable by default (i.e., absent some revolutionary advance in interpretability techniques). People doing probabilistic programming research seem to disagree with this (i.e. that their research is leading towards uninterpretable-by-default systems), for reasons I can't understand.

Adding some thoughts as someone who works on probabilistic programming, and has colleagues who work on neurosymbolic approaches to program synthesis:

  • I think a lot of Bayes net structure learning / program synthesis approaches (Bayesian or otherwise) have the issue of uninformative variable names, but I do think it's possible to distinguish between structural interpretability and naming interpretability, as others have noted.
  • In practice, most neural or Bayesian program synthesis applications I'm aware of exhibit something like structural interpretability, because the hypothesis space they live in is designed by modelers to have human-interpretable semantic structure. Two good examples of this are the prior over programs that generate handwritten characters in Lake et al (2015), and the PCFG prior over Gaussian Process covariance kernels in Saad et al (2019). See e.g. Figure 6 on how you perform analysis on programs generated by this prior, to determine whether a particular timeseries is likely to be periodic, has a linear trend, has a changepoint, etc.
  • Regarding uninformative variable names, there's ongoing work on using natural language to guide program synthesis, so as to come up with more language-like conceptual abstractions (e.g. Wong et al 2021). I wouldn't be surprised if these approaches could also be extended to come up with informative variable and function names / comments. A related line of work is that people are starting to use LLMs to deobfuscate code (e.g. Lachaux et al 2021), and I expect the same techniques will work for synthesized code.

For these reasons, I'm more optimistic about the interpretability prospects of learning approaches that generate models or code that look like traditional symbolic programs, relative to end-to-end deep learning approaches. (Note that neural networks are also "symbolic programs", just written with a more restricted set of [differentiable] primitives, and typically staying within a set of widely used program structures [i.e. neural architectures]). 

The more difficult question IMO is whether this interpretability comes at the cost of capabilities. I think this is possibly true in some domains (e.g. learning low-level visual patterns and cues), but not others (e.g. learning the compositional structure of e.g. furniture-like objects).

Thanks for your reply!

In practice, most neural or Bayesian program synthesis applications I'm aware of exhibit something like structural interpretability, because the hypothesis space they live in is designed by modelers to have human-interpretable semantic structure. Two good examples of this…

When I squint out towards the horizon, I see future researchers trying to do a Bayesian program synthesis thing that builds a generative model of the whole world—everything from “tires are usually black”, to “it’s gauche to wear white after labor day”, to “in this type of math problem, maybe try applying the Cauchy–Schwarz inequality”, etc. etc. etc.

I’m perfectly happy to believe that Lake et al. can program-synthesis a little toy generative model of handwritten characters such that it has structural interpretability. But I’m concerned that we’ll work our way up to the thing in the previous paragraph, which might be a billion times more complicated, and it will no longer have structural interpretability.

(And likewise I’m concerned that solutions to “uninformative variable names” won’t scale—e.g., how are we going to automatically put English-language labels on the various intuitive models / heuristics that are involved when Ed Witten is thinking about math, or when MLK Jr is writing a speech?)

I'm more optimistic about the interpretability prospects of learning approaches that generate models or code that look like traditional symbolic programs, relative to end-to-end deep learning approaches [emphasis added]

Nominally, I agree with this. But “relative to” is key here.

Your takeaway seems to be “OK, great, let’s do probabilistic generative models, they’re better!”.

By contrast, my perspective is: “If we take the probabilistic generative model approach, we’re in huge trouble with respect to interpretability, oh man this is really really bad, we gotta work on this ASAP!!! (Oh and by the way if we take the deep net approach then it’s even worse.)”.

We could probably use a term or a phrase for this concept since it keeps coming up and is a fundamental problem. How about:

Any model simple enough to be interpretable is too simple to be useful.


Any model which appears both useful and interpretable is uninterpretable.

On the contrary, I think there exist large, complex, symbolic models of the world that are far more interpretable and useful than learned neural models, even if too complex for any single individual to understand, e.g.:

- The Unity game engine (a configurable model of the physical world)
- Pixar's RenderMan renderer (a model of optics and image formation)
- The GLEAMviz epidemic simulator (a model of socio-biological disease spread at the civilizational scale)

Humans are capable of designing and building these models, and learning how to build/write them as they improve their understanding of the world. The difficult part is how we can recapitulate that ability -- program synthesis is only in its infancy in it's ability to do so, but IMO contemporary end-to-end deep learning methods seem unlikely to deliver here if want both interpretability and usefulness.

I agree that gwern’s proposal “Any model simple enough to be interpretable is too simple to be useful” is an exaggeration. Even the Lake et al. handwritten-character-recognizer is useful.

I would have instead said “Any model simple enough to be interpretable is too simple to be sufficient for AGI”.

I notice that you are again bringing the discussion back to a comparison between program synthesis world-models versus deep learning world-models, whereas I want to talk about the possibility that neither would be human-interpretable by the time we reach AGI level.

I agree with your point that blobs of bayes net nodes aren't very legible, but I still think neural nets are relevantly a lot less interpretable than that! I think basically all structure that limits how your AI does its thinking is helpful for alignment, and that neural nets are pessimal on this axis.

In particular, an AI system based on a big bayes net can generate its outputs in a fairly constrained and structured way, using some sort of inference algorithm that tries to synthesize all the local constraints. A neural net lacks this structure, and is thereby basically unconstrained in the type of work it's allowed to perform.

All else equal, more structure in your AI should mean less room for dangerous computations, and lower the surface area you need to inspect.

I'd crystallize the argument here as something like: suppose we're analyzing a neural net doing inference, and we find that its internal computation is implementing <algorithm> for Bayesian inference on <big Bayes net>. That would be a huge amount of interpretability progress, even though the "big Bayes net" part is still pretty uninterpretable.

When we use Bayes nets directly, we get that kind of step for free.

... I think that's decent argument, and I at least partially buy it.

A neural net lacks this structure, and is thereby basically unconstrained in the type of work it's allowed to perform.

That said, if we compare a neural net directly to a Bayes net (as opposed to inference-on-a-Bayes-net), they have basically the same structure: both are circuits. Both constrain locality of computation.

Any network big enough to be interesting is big enough that the programmers don't have the time to write decorative labels. If you had some algorithm that magically produced a bays net with a billion intermediate nodes that accurately did some task, then it would also be an obvious black box. No one will have come up with a list of a billion decorative labels.