Chris Olah wrote the following topic prompt for the Open Phil 2021 request for proposals on the alignment of AI systems. We (Asya Bergal and Nick Beckstead) are running the Open Phil RFP and are posting each section as a sequence on the Alignment Forum. Although Chris wrote this document, we didn’t want to commit him to being responsible for responding to comments on it by posting it.
Summary: We would like to see research building towards the ability to "reverse engineer" trained neural networks into human-understandable algorithms, enabling auditors to catch unanticipated safety problems in these models.
Potential safety failures of neural networks might be thought of as falling into two broad categories: known safety problems and unknown safety problems. A known safety problem is one which can be easily anticipated in advance of deploying a model or easily observed in the model's behavior. Such safety failures can be easily caught with testing, and it seems reasonable to hope that they will be fixed with human feedback. But it seems like there’s much less of a clear story for how we’ll resolve unknown safety problems -- the things we didn’t think to test for and wouldn’t obviously give feedback to fix.
Even if we’re able to anticipate certain safety problems, we may not know if we’re sufficiently disincentivizing them. A model might behave well on the training distribution, then unexpectedly exhibit some safety failure in a context it wasn’t trained in. In the extreme, a model might make a “treacherous turn” -- it may use its understanding of the training setup to deliberately behave well only during training, then pursue different goals once it knows it’s outside of the training distribution.
In traditional software engineering, our ability to mitigate unanticipated safety problems largely flows from our ability to understand and carefully reason about code. While testing may catch various kinds of easy to observe or anticipated problems, we rely on code reviews, careful engineering, and even systematic verification to avoid other problems. These approaches are only possible because we can understand code for normal computer programs, something we don’t, by default, get with neural networks.
Neural network parameters can be seen as the assembly instructions of a complex computer program. If it was possible to reverse engineer the parameters of trained neural networks into human-understandable algorithms, it may enable us to catch safety problems the same way we are able to in code.
Recent research has shown that it is possible to reverse engineer modern neural networks into human understandable computer programs, at least on a small scale. The Circuits Thread on Distill contains many examples of reading human understandable algorithms off the weights of neural networks.
There’s also some evidence that this kind of analysis can reveal unanticipated problems and concerns. An analysis of CLIP found that the model had neurons related to race, gender, age, religions, LGBT status, mental health, physical disability, pregnancy, and parental status. We can also mechanistically observe concerning uses of these protected attributes, such as an Asian culture neuron increasing the probability of “I feel pressured” or an LGBT neuron increasing the probability of “I feel accepted.” Although there is significant attention to bias in machine learning, it tends to be focused on a couple categories such as race and gender. Surfacing that a model represents other protected attributes is a proof of concept that mechanistic interpretability can surface unanticipated concerns in state of the art models.
We would like to see more research aimed at mechanistically understanding neural networks, at seriously reverse engineering trained models into understandable programs. While we think it’s most likely this will take the form of work in a similar vein to Circuits, we’re also open to other ideas. Research projects should meet the following desiderata:
At this stage, we’re interested in work that generically makes progress towards mechanistically understanding neural networks, but it’s worth noting that there are specific questions which are of particular interest from a safety perspective. For example:
Work that addresses one of these would be of particular interest.
The following sections will describe some more specific research directions we think are promising.
A useful aspirational goal in this work is to fully reverse engineer any modern neural network, such as an ImageNet classifier or modern language model. There are a number of ways you could potentially operationalize “fully reverse engineer”:
(It’s worth noting that all these can also be applied to parts of a model in addition to the full thing, and in fact have all roughly been achieved for curve circuits.)
At the moment, no one fully understands any non-trivial neural network. Demonstrating that it’s possible to do so is both a natural milestone, and might make society much more willing to invest in this kind of reverse engineering. This goal is aspirational, but a successful project should somehow advance us towards this goal.
We’re open to a range of possibilities as to how it could happen, as long as it’s a genuine milestone towards understanding powerful models. For example, this could be achieved by simply reverse engineering existing models, but a model engineered to be interpretable in some way also seems like fair game, provided it’s equally performant and easy to train. (At the moment, InceptionV1 is by far the closest thing we have to a fully reverse engineered model, about 30% reverse engineered as measured by neuron count. But reverse engineering another model would be equally useful as a milestone.)
Note that even if we are able to fully understand a modern neural network, we are far from our ultimate, even more ambitious goal: a general method for understanding neural networks of arbitrary size and sophistication.
In the Circuits approach, neural networks are composed of features and circuits. Characterizing features and circuits is the most direct way to move us towards fully understanding neural networks. Examples of doing this kind of work can be found in the Circuits thread (see especially Curve Detectors, Curve Circuits, and High-Low Frequency Detectors; see also Visualizing Weights for more detailed discussion of methods for studying circuits).
There are several reasons why it might be useful to study specific features and circuits, especially as a starter project:
Results on new features and circuits are especially interesting if:
The main weakness of the circuits approach is that, by focusing on small scale structure, it may not be able to scale to understanding large scale models. Curve Circuits is the largest example of reverse engineering a circuit to date, at ~50K parameters. This is many orders of magnitude smaller than modern language models. How might we bridge the gap?
One promising direction is to find additional structure which greatly simplifies mechanistically understanding neural networks, or at least simplifies understanding the safety relevant features. The circuits thread describes two such types of structure: “motifs” (recurring patterns in circuits, such as equivariance) and “structural phenomena” (large scale patterns in how neural networks are organized, such as branch specialization).
Equivariance can simplify circuits in early vision by as much as a factor of 50x, so there’s precedence for discovering structure that can give order of magnitude simplifications of neural networks. Similarly, one could imagine a world where something similar to work on branch specialization or research on modularity, can break neural networks into components, where only some of the components are necessary to audit for safety, massively reducing the work needed to ensure models are safe.
Research projects in this category would ideally:
Understanding the natural equivariance to simple transformations like rotation, scale and hue dramatically simplify the study of early vision in conv nets. Are there other kinds of equivariance that might simplify late vision? What about language models?
A fictional example of what a big success for this approach might look like is “It turns out large language models can mostly be understood in terms of a couple large families of neurons. One family stores factual knowledge with neurons parameterized by variables A, B, and C; the other major types of neurons are …”
A number of papers (eg. Filan et al. 2021) have been written about modularity in neural networks. Can we understand how these structures relate to the computation being performed in neural networks (branch specialization takes preliminary steps in this direction for the special case of architecturally enforced branches)? How well do various notions of modularity organize and simplify the features and circuits in neural networks? Are there other notions that better leverage the graph structure of neural networks to organize them? Can this work be used to isolate parts of a neural network relevant to particular safety questions (see “Specific Questions of Interest” above) and target them?
A fictional example of what a big success for this approach might look like is “We found a subgraph of neurons which is responsible for social reasoning and it’s only 0.1% of the model. If the model was deliberately trying to mislead us, it would turn up here.”
In addition to these approaches, we’re excited about other approaches to scaling interpretability to much larger models. These approaches should fit the general desiderata listed in the introduction, and have a clear story for why their success would enable significantly better scaling. They should also consider how feasible the approach is to study right now; for example, using large numbers of humans, or integrating with some alignment scheme that provides automation, might ultimately be crucial for scaling interpretability but challenging to study now.
Another major challenge to circuits is polysemanticity, where some neurons respond to multiple unrelated features. One theory is that polysemanticity occurs because models would ideally represent more features than they have neurons, and exploit the fact that high dimensional spaces can have many “almost orthogonal” directions to store these features in “superposition” across neurons. Polysemanticity seems related to questions about disentangling representations.
Polysemanticity makes circuits much harder to study and audit, and increases the risk that analysis will miss important structures. For example, imagine one is trying to audit a model for potential bias issues (we see bias as a proxy for a much broader class of safety concerns, but can be studied in models that exist today). If one was to find a circuit where a “female” neuron excites a “nurse” neuron but inhibits a “doctor” neuron, auditors would likely conclude that was an example of bias. But if a “wheel/female/tree” neuron excites a “car/nurse/waterfall” neuron, and there are a number of other neurons which seem to have overlapping features, it’s less clear what to make of it.
This list is far from comprehensive. There are many other questions we’d be interested to see proposals related to. To give a few examples:
We expect there are many other promising ideas we haven’t listed or considered. We would be excited to receive proposals for other interpretability projects that could help us catch unanticipated safety problems or guarantee good behavior in unusual contexts, if they meet the desiderata in the introduction. Proposals related to scaling mechanistic interpretability to larger models are of particular interest.
Several of the Circuits articles provide colab notebooks reproducing the results in the article, which may be helpful references if one wants to do Circuits research on vision models. In particular, Visualizing Weights focuses on demonstrating and explaining some of the fundamental techniques. If one wants to study multimodal neurons, OpenAI has some code that may be helpful.
Nick Bostrom describes this failure mode in Superintelligence, p. 117:
“…one idea for how to ensure superintelligence safety… is that we validate the safety of a superintelligent AI empirically by observing its behavior while it is in a controlled, limited environment (a “sandbox”) and that we only let the AI out of the box if we see it behaving in a friendly, cooperative, responsible manner. The flaw in this idea is that behaving nicely while in the box is a convergent instrumental goal for friendly and unfriendly AIs alike. An unfriendly AI of sufficient intelligence realizes that its unfriendly final goals will be best realized if it behaves in a friendly manner initially, so that it will be let out of the box. It will only start behaving in a way that reveals its unfriendly nature when it no longer matters whether we find out; that is, when the AI is strong enough that human opposition is ineffectual.” Additional discussions of the possibility of such failure modes can be found in Hubinger et al.’s Risks from Learned Optimization in in Advanced Machine Learning Systems (section 4, “Deceptive Alignment”) and Luke Muelhauser’s post, “Treacherous turns in the wild”. ↩︎
Several of the Circuits articles provide colab notebooks reproducing the results in the article, which may be helpful references if one wants to do Circuits research on vision models.
I'm starting to reproduce some results from the Circuits thread. It took me longer than expected just to find these colab notebooks so I wanted to share more specifically in case it saves anyone else some time.
The text "colab" isn't really turned up in a targeted Google search on Distill and the Circuits thread. Also if you open a post like Visualizing Weights and do a Ctrl+F or Cmd+F search for "colab", you won't turn up any results either.
But if you open that same post and scroll down, you'll see button links to open the colab notebooks that look like this. These will also turn up in in a Ctrl+F or Cmd+F search for "notebook" in case you want to jump around to different colab examples in the Circuits thread.
I left some comments here. My overall takeaway (as someone who hasn't worked in the area but cares about alignment) is that I'm very excited about this kind of interpretability research and especially work that focuses on small parts of models without worrying too much about scalability. It seems like interpretability could provide compelling warnings about future risks, is a big part of many of the existing concrete stories for how we might get aligned AI, and is reasonably likely to be helpful in unpredictable ways.