One pretty major problem with today’s interpretability methods (e.g. work by Chris Olah & co) is that we have to redo a bunch of work whenever a new net comes out, and even more work when a new architecture or data modality comes along (like e.g. transformers and language models).

This breaks one of the core load-pillars of scientific progress: scientific knowledge is cumulative (approximately, in the long run). Newton saw far by standing on the shoulders of giants, etc. We work out the laws of mechanics/geometry/electricity/chemistry/light/gravity/computation/etc, and then reuse those laws over and over again in millions of different contexts without having to reinvent the core principles every time. Economically, that’s why it makes sense to pump lots of resources into scientific research: it’s a capital investment. The knowledge gained will be used over and over again far into the future.

But for today’s interpretability methods, that model fails. Some principles and tools do stick around, but most of the knowledge gained is reset when the next hot thing comes out. Economically, that means there isn’t much direct profit incentive to invest in interpretability; the upside of the research is mostly the hope that people will figure out a better approach.

One way to frame the Selection Theorems agenda is as a strategy to make interpretability transferable and reusable, even across major shifts in architecture. Basic idea: a Selection Theorem tells us what structure is selected for in some broad class of environments, so operating such theorems “in reverse” should tell us what environmental features produce observed structure in a trained net. Then, we run the theorems “forward” to look for the analogous structure produced by the same environmental features in a different net.

How would Selection Theorems allow reuse?

An example story for how I expect this sort of thing might work, once all the supporting pieces are in place:

While probing a neural net, we find a circuit implementing a certain filter.

We’ve (in this hypothetical) previously shown that a broad class of neural nets learn natural abstractions, so we work backward through those theorems and find that the filter indeed corresponds to a natural abstraction in the dataset/environment.

We can then work forward through the theorems to identify circuits corresponding to the same natural abstraction in other nets (potentially with different architectures) trained on similar data.

In this example, the hypothetical theorem showing that a broad class of neural nets learn natural abstractions would be a Selection Theorem: it tells us about what sort of structure is selected for in general when training systems in some class of environments.

More generally, the basic use-case of Selection-Theorem-based interpretability would be:

Identify some useful interpretable structure in a trained neural net (similar to today’s work).

Work backward through some Selection Theorems to find out what features of the environment/architecture selected for that structure.

Work forward through the theorems to identify corresponding internal structures in other nets.

Assuming that the interpretable structure shows up robustly, we should expect a fairly broad class of environments/architectures will produce similar structure, so we should be able to transfer structure to a fairly broad class of nets - even other architectures which embed the structure differently. (And if the interpretable structure doesn’t show up robustly, i.e. the structure is just an accident of this particular net, then it probably wasn’t going to be useful for future nets anyway.)

Other Connections

This picture connects closely to the convergence leg of the Natural Abstraction Hypothesis: “a wide variety of cognitive architectures learn and use approximately-the-same internal summaries”. The Selection Theorem approach to interpretability looks for those convergent summaries, and tries to directly prove that they are convergent. The Selection Theorems approach is also potentially more general than that; for instance, we might characterize the conditions in which an internal optimization process is selected for, or a self-model. But that sort of high-level structure is probably a ways out yet; I expect theorems about simple classes of natural abstractions to come along first.

It also connects closely to the Pointers Problem. Once we can locate convergent “concepts” in a trained system, and know what real-world things they correspond to, we can potentially “point” systems directly at particular real-world things.

Why Not Just…

One could imagine pursuing a similar strategy without the Selection Theorems. For instance, why not just directly look for isomorphic structures in two neural nets?

… ok, that question basically answers itself. “Look for isomorphic structures in two neural nets” immediately brings to mind superposed images of a grad student sobbing and a neon sign flashing “NP COMPLETE”.

(As with any problem which immediately brings to mind superposed images of a grad student sobbing and a neon sign flashing “NP COMPLETE”, someone will inevitably suggest that we Use Neural Networks to solve it. In response to that suggestion, I will severely misquote James Mickens:

This idea seemed promising until John realized THAT IT WAS THE WORST IDEA EVER. The point of interpretability is that we don’t trust the magic black boxes without directly checking what’s going on inside of them, so asking a magic black box to interpret another magic black box is like asking Godzilla to prevent Mega-Godzilla from terrorizing Japan. THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO.

)

Point is, to translate structure from one net to another, especially across an architectural shift, we need some way to know what to look for and where to look for it. Selection Theorems will, I expect, give us that.

The other problem with just looking for similar structures directly is that we have no idea when an architectural shift will break our methods. I expect Selection Theorems will be general enough to basically solve that problem.

The Gap

There’s still a wide gap to cross before we can directly use Selection Theorems for interpretability. We need the actual theorems.

The good news is, interpretability has relatively good feedback loops, and we can potentially carry those over to research on Selection Theorems. We can look for structure in a net, and empirically test how sensitive that structure is to features in the environment, in order to provide data and feedback for our theorem-development.

One pretty major problem with today’s interpretability methods (e.g. work by

Chris Olah & co) is that we have to redo a bunch of work whenever a new net comes out, and even more work when a new architecture or data modality comes along (like e.g. transformers and language models).This breaks one of the core load-pillars of scientific progress: scientific knowledge is cumulative (approximately, in the long run). Newton saw far by standing on the shoulders of giants, etc. We work out the laws of mechanics/geometry/electricity/chemistry/light/gravity/computation/etc, and then reuse those laws over and over again in millions of different contexts without having to reinvent the core principles every time. Economically, that’s why it makes sense to pump lots of resources into scientific research: it’s a capital investment. The knowledge gained will be used over and over again far into the future.

But for today’s interpretability methods, that model fails. Some principles and tools do stick around, but most of the knowledge gained is reset when the next hot thing comes out. Economically, that means there isn’t much direct profit incentive to invest in interpretability; the upside of the research is mostly the hope that people will figure out a better approach.

One way to frame the

Selection Theoremsagenda is as a strategy to make interpretability transferable and reusable, even across major shifts in architecture. Basic idea: a Selection Theorem tells us what structure is selected for in some broad class of environments, so operating such theorems “in reverse” should tell us what environmental features produce observed structure in a trained net. Then, we run the theorems “forward” to look for the analogous structure produced by the same environmental features in a different net.## How would Selection Theorems allow reuse?

An example story for how I expect this sort of thing might work, once all the supporting pieces are in place:

natural abstractions, so we work backward through those theorems and find that the filter indeed corresponds to a natural abstraction in the dataset/environment.In this example, the hypothetical theorem showing that a broad class of neural nets learn natural abstractions would be a Selection Theorem: it tells us about what sort of structure is selected for in general when training systems in some class of environments.

More generally, the basic use-case of Selection-Theorem-based interpretability would be:

Assuming that the interpretable structure shows up robustly, we should expect a fairly

broadclass of environments/architectures will produce similar structure, so we should be able to transfer structure to a fairly broad class of nets - even other architectures which embed the structure differently. (And if the interpretable structure doesn’t show up robustly, i.e. the structure is just an accident of this particular net, then it probably wasn’t going to be useful for future nets anyway.)## Other Connections

This picture connects closely to the convergence leg of the Natural Abstraction Hypothesis: “a wide variety of cognitive architectures learn and use approximately-the-same internal summaries”. The Selection Theorem approach to interpretability looks for those convergent summaries, and tries to directly prove that they are convergent. The Selection Theorems approach is also potentially more general than that; for instance, we might characterize the conditions in which an internal optimization process is selected for, or a self-model. But that sort of high-level structure is probably a ways out yet; I expect theorems about simple classes of natural abstractions to come along first.

It also connects closely to the Pointers Problem. Once we can locate convergent “concepts” in a trained system, and know what real-world things they correspond to, we can potentially “point” systems directly at particular real-world things.

## Why Not Just…

One could imagine pursuing a similar strategy without the Selection Theorems. For instance, why not just directly look for isomorphic structures in two neural nets?

… ok, that question basically answers itself. “Look for isomorphic structures in two neural nets” immediately brings to mind superposed images of a grad student sobbing and a neon sign flashing “NP COMPLETE”.

(As with any problem which immediately brings to mind superposed images of a grad student sobbing and a neon sign flashing “NP COMPLETE”, someone will inevitably suggest that we Use Neural Networks to solve it. In response to that suggestion, I will severely misquote

James Mickens:)

Point is, to translate structure from one net to another, especially across an architectural shift, we need some way to know what to look for and where to look for it. Selection Theorems will, I expect, give us that.

The other problem with just looking for similar structures directly is that we have no idea when an architectural shift will break our methods. I expect Selection Theorems will be general enough to basically solve that problem.

## The Gap

There’s still a wide gap to cross before we can directly use Selection Theorems for interpretability. We need the actual theorems.

The good news is, interpretability has relatively good feedback loops, and we can potentially carry those over to research on Selection Theorems. We can look for structure in a net, and empirically test how sensitive that structure is to features in the environment, in order to provide data and feedback for our theorem-development.