Big thanks to Leon Lang, Jérémy Scheurer, Adam Gleave, and Shoshannah Tekofsky for their feedback on a draft of this post, to Euan McLean (via FAR AI) for his feedback and a lot of help with editing, and to everyone else who discussed this agenda with me, in particular Johannes Treutlein for frequent research check-ins!


  • My current agenda is to develop a formal framework for thinking about abstractions of computations. These abstractions are ways to partially describe the “algorithm” a neural network or other computation is using, while throwing away irrelevant details.
  • Ideally, this framework would tell us 1) all possible abstractions of a given computation, and 2) which of these are most useful (for a specific purpose, such as detecting deception). “Useful” doesn’t necessarily mean “easily human-understandable”—I see that as an open question.
  • I anticipate the main applications to alignment to be automating interpretability or mechanistic anomaly detection. There are also potential connections to other alignment topics, such as natural abstractions or defining terms like “search process”.
  • This agenda is at an early stage (I have been thinking about it for ~2 months, and about related topics for another ~2 months before that). So feedback now could change my future direction.
  • I also list a few potential projects that seem self-contained. If you’re interested in working on any of those, or collaborating in some other way, please get in touch!


I encourage you to skip around and/or only read parts of the post. Here’s an overview:


This post doesn’t contain any actual theorems or experiments, so if you’re only interested in that, you can stop reading.


Humans can’t just look at the weights of a neural network and tell what it’s doing. There are at least two reasons for this:

  1. Neural network weights aren’t a format we’re great at thinking about.
  2. Neural networks are often huge.

The second point would likely apply to any system that does well on complicated tasks. For example, neural networks are decision trees, but this doesn’t mean we can look at the decision tree corresponding to a network and understand how it works. To reason about these systems, we will likely have to simplify them, i.e. throw away details that are irrelevant for whatever we want to find out. In other words, we are looking for abstractions of computations (such as neural networks).

Abstractions are already how we successfully reason about many other complicated systems. For example, if you want to understand the Linux kernel, you wouldn’t start by reading the entire source code top to bottom. Instead, you’d try to get a high-level understanding—what are the different modules, how do they interact? Similarly, we use pseudocode to more easily communicate how an algorithm works, abstracting away low-level details.

If we could figure out a general way to find useful abstractions of computations, or at least of neural networks, perhaps we could apply this to understand them in a similar way. We could even automate this process and mechanically search for human-understandable abstractions.

Making things easier to understand for humans isn’t the only application of abstractions. For example, abstractions have been used for more efficient theorem proving and in model checking (e.g. abstract interpretation). These abstractions need not ever be used by humans. While I don’t think we will be able to formally verify the properties we are most interested in for neural networks, the underlying ideas might still apply. Another example is mechanistic anomaly detection, where we only care about whether a computation “uses the same mechanism” on a new input as on earlier ones. Again, this doesn’t require that we understand those mechanisms. Because of this, I’m looking for a general theory of abstractions and what makes them useful, not just a theory of human-understandable abstractions.

What are abstractions of computations?


I’m thinking about ways we can throw away some information about a computational process, while keeping other parts. There are (roughly speaking) two kinds of information we can throw away:

  1. Details of the input-output behavior. For example, perhaps we only care about the 5 most significant digits of the output.
  2. Details of the internal implementation. For example, the description of a program in pseudocode might exactly specify the input-output behavior without specifying low-level implementation details (e.g. which sorting algorithm to use when the pseudocode says “sort this list”).

In practice we are often interested in abstractions that remove both types of information. (In fact, you have to throw away some mechanistic information to throw away information about behavior—given the exact implementation, the input-output behavior can be recovered.)

Let’s look at some examples:

  • Modeling a computer as executing machine code is an abstraction of logic gates.
  • As the name suggests, higher-level languages are abstractions of lower-level languages/machine code.
  • Type signatures: we can forget the behavior of parts of a computation and only remember the input and output types.
  • If I simulate Game of Life within Game of Life, then the simulated layer is a good abstraction of the low-level layer. More generally, using one computational model to simulate another one gives good abstractions.
  • Mechanistic interpretations of neural networks tend to be good abstractions. One class of examples is looking at a subcircuit of a network that explains a specific behavior (such as the indirect object identification circuit in GPT2-small). This throws away all information about parts of the network outside that subcircuit, and consequently about behavior of the network on most inputs. Additionally, essentially all mechanistic interpretations throw away some information about low-level details (such as the exact weights), which means they only approximately explain the network output.
  • Modularity/clusterability is a different way of abstracting networks (or other computations). For example, imagine a network that can play chess via learned AlphaZero-style MCTS. You could say “at a very high level, this network runs MCTS over moves—this submodule is the rollout policy, and this one the value function”. This throws away most of the information about both the algorithm of the network and its input-output behavior. But it does retain some interesting information.

Formal framework

I’m uncertain about what the right formal framework looks like. In this subsection and the next, I’ll describe my current guess, and then mention some issues it has in Limitations of this framework.


To formalize abstractions of computations, let’s first formalize what we mean by “computations”. I’m currently using a pretty simple and general approach: computations have a set M of memory states and a function  that performs a single computational step. We also need a function  that tells us whether the computation has finished. Finally, we have a set  of possible input values and a set  of possible output values, with functions  and  that encode inputs as memory states and decode back to outputs respectively. Any computation induces a function  by executing  until the computation terminates.[1] This induced function describes the input-output behavior.

For example, in a deterministic finite automaton, the memory state is the current state of the automaton plus the remaining input. In a Turing machine, the memory state is the tape and the state register.


We can abstract a computation by throwing away information about its memory state. This means an abstraction is a function  for some set of abstract states . Since the abstract state is a deterministic function of the full state, it can contain at most the same information as the full state, and typically less (if  is non-injective).

An arbitrary function  might not have anything to do with the “structure” of the computation, since it ignores the computational step . To make  an abstraction of the computation defined by , we need to be able to “mirror” the computational step within the abstraction.

We can show all of this in a commutative diagram as follows:

 in the top row is the full computational step, and  is the abstraction mapping on memory states. For a good abstraction, there should exist a dashed map  that makes the diagram commute. Note that we can assume without loss of generality that  is surjective, in which case  is uniquely determined by  and .

Commutativity of this diagram means we can either execute a step in the full computation and then abstract the state, or first abstract the state and then execute the computational step in the abstraction. You can think of this as a consistency condition.

This type of commutative diagram is used to describe abstractions in contexts other than computations as well, see Appendix: Related work for several examples, as well as my earlier postsAppendix: Examples of abstractions of computations describes several examples of abstractions within this framework.

In practice, most abstractions will not achieve exact commutativity: they will be leaky. So a full version of this framework should formalize approximate consistency. This likely means we want a distribution over inputs or memory states, and that the abstract computational step  will compute a probability distribution over next states instead of just a single one.

What makes some abstractions better than others?

The consistency condition from the previous section is not sufficient to get “good abstractions”. There are two trivial ways to get commutativity with clearly useless abstractions: 

  1. We can use the full computation as the abstraction, i.e. use  and .
  2. On the other extreme, we can use an abstraction that doesn’t track any information, by using a singleton set for .

We can rule out these failure modes with two additional desiderata:

  1. Require a minimal abstraction, under some measure of size or complexity.
  2. Require the abstraction to be useful for predicting specific aspects of the output that we care about.

I currently think these two desiderata, along with the consistency condition, might be enough. Admittedly, a lot of the difficulty in formalizing good abstractions is hidden in the vagueness of these goals.

It’s also possible we need additional criteria to actually get “white box” abstractions that really reflect the computational structure. But I’m hoping we can get that “for free” with the right formalization of the desiderata above.

Limitations of this framework

I’m aware of three issues with the current framework:

  1. Approximate consistency: As mentioned, the consistency condition ought to be approximate, and making that modification will introduce some additional changes, such as making the abstract computational step stochastic.
  2. Temporal abstraction: The commutative diagram above requires a one-to-one mapping between computational steps in the full and abstracted computation. But in many cases, several steps in the full computation correspond to just a single step in the abstract one. Sometimes, we can work around this in hacky ways, but even that may fail (see the Turing machines section in the appendix for a potential example). A better framework would incorporate not just abstractions of states, but also temporal abstractions, i.e. combining multiple steps into one.
  3. Encoding: There’s no mention of the “encoding” or “representation” of a computation. Forcing everything into the format of memory states and computational steps already discards some aspects. Put a different way: I think this is a good framework for thinking about problem 2. from the introduction (neural networks are too large to understand), but not as good for thinking about problem 1. (weights are an unintuitive representation for humans).

I’m optimistic that limitation 1. and 2. can be resolved by moderate extensions of the framework, leaving the key ideas intact. This is much less clear for 3., but it’s also not obvious that this should be “fixed” within the framework—maybe it’s better to think of 3. as something to be addressed completely separately.


Isn’t this just mechanistic interpretability?

Yeah, kind of. I think of it as a more theoretical approach that complements empirical interpretability research and has a similar impact story for alignment. If you want to improve circuit-style analysis, empirical work seems more promising. But my guess is that there are other useful ways of abstracting neural networks that we currently don’t make enough use of. The clearest example is looking for submodules, and there could also be less obvious ones that only a more general study of abstractions of computations will reveal.

Compared to typical interpretability research, I’d say this agenda …

  • … takes a more theoretical approach.
  • … aims more strongly at automating interpretability, with a top-down approach: some other efforts look more like automating larger and larger pieces of manual techniques. In contrast, I’m trying to figure out what interpretability could look like if we wanted to automate the entire process.
  • … puts less importance on human-understandability. Interpretability typically assumes the end goal is an explanation that humans can understand. From the perspective of abstractions of computations, it’s not obvious that’s what we should aim for —simplifying a computation could also be useful for automated verification or other applications.
  • … aims for a “breadth-first” overview of what abstractions might be out there, as opposed to diving deep into one type of abstraction that seems to work well.

That being said, mechanistic interpretability, especially some flavors such as causal scrubbing, are very closely related to this agenda. See the Related work section for more discussion.

Isn’t this just NN modularity/clusterability?

It’s meant to be more general than just modularity, though I’m currently unsure how much.

Modularity is a prototypical example of the kinds of abstractions I have in mind. Abstractions of computations often work by dividing big computations into submodules. Functions in programming languages are a clear example. I am definitely interested in applying this type of abstraction to neural networks too. However, I am hoping there might be other useful abstractions, so for now I want to stay on the lookout for those.

Does this have anything to do with natural abstractions?

Probably, but I’m unsure about the details. I discuss this a bit more in the related work section. In brief: a key similarity is the assumption that things “abstract well” and that some abstractions are “better” or “more natural” than others.[2] But in terms of the mathematical formalism, the framework I described above seems closer to certain other work than to John’s definition of natural abstractions.

What about mechanistic anomaly detection?

I haven’t thought about this much yet, but it seems to me that abstractions of computations could be a candidate for defining what it means for “mechanisms” to “explain” an output. Abstractions of programs could also be closely related to abstractions of proofs because of the Curry-Howard isomorphism. And abstractions of proofs might be a type of heuristic argument (though probably not quite the type ARC is studying). So that’s another reason to think there’s a connection here. See a bit more discussion in the related work section.

To me, mechanistic interpretability, ELK, and mechanistic anomaly detection seem to overlap a lot (this isn’t a new take of course, see e.g. Paul on why indistinguishable mechanisms would also be bad news for other agendas). I basically think of abstractions of computations as adding yet another perspective on similar problems. I don’t have clear reasons to think it’s a better perspective, it just seems about as reasonable and less explored. But it wouldn’t shock me if I just ended up converging to e.g. a mechanistic anomaly detection view.

Why abstractions of computations, and not just abstractions of neural networks?

A few responses:

  • Generalizing can sometimes make problems easier: if there is a simple general framework for abstractions of computations, then this may be easier to find than the special case for neural networks. The main reason is that if the framework has to work for more cases, that gives you more examples you can test it on.[3] It could also turn out that abstracting computations in general is too messy to be useful. If this turns out to be the case, I’ll specialize more.
  • am putting extra weight on NN-like computations: while I often talk about “abstractions of computations” and draw on examples from different computational models, I’m especially interested in fixed computational graphs/circuits. These are still more general than feed-forward NNs, but not by nearly as much.
  • It’s unclear how much generality we need: should we focus on transformers? Feed-forward neural networks? Include networks with some recurrent structure? Even that might not be general enough if the AI as a whole has some scaffolding that connects multiple neural networks (or multiple instances of the same network). AlphaZero is an example. It’s far from obvious to me that focusing on, say, monolithic feedforward networks is sufficient.

So what’s the plan in more detail?

I’m currently thinking about two broad questions:

  1. What’s a general framework that can describe many or all of the examples of abstractions I want to fit in?
  2. What makes some abstractions more useful than others (for a particular purpose we care about)? How can I formalize this within the framework from question 1?

I outlined my current guess for question 1 above and also briefly discussed question 2. But there’s definitely lots of work to be done on both.

If I had answers to these questions, the ideal next step would be to develop efficient algorithms that can find good abstractions of a given network or other computation. There is a spectrum of different outcomes, depending on how hard this turns out to be:

  • Maybe we can fully solve this problem and develop a general and efficient abstraction-finding algorithm.
  • More plausibly, the problem of finding good abstractions turns out to be very challenging in some situations. But we can at least formally specify the problem, and either find special-purpose algorithms for practically important cases, or throw some general optimization method at it that works well enough in practice.
  • Maybe a full specification of “good abstractions” is out of reach, but at least thinking about the problem generates ideas for interpretability techniques outside of what people are currently considering.

The abstractions we find this way could either be human-understandable ones, which would mean automating interpretability, or they might be useful for an approach like mechanistic anomaly detection (as discussed earlier in the Q&A).

This is the most straightforward path to impact, but there are some speculative connections to other problems in alignment:

  • Thinking about abstractions of computations might be a useful lens to think about abstractions in general. In some sense, everything is a computation, and causal graphs provide an example of a formalism that’s used both for describing computations and modeling the real world. So perhaps, thinking more about this could lead to insights about e.g. conditions under which the natural abstraction hypothesis holds.
  • Concepts like a “search process” or “goal-directedness” are clearly helpful abstractions. And arguably, we should think of them as abstractions of computations.[4] So maybe thinking about abstractions of computations could help nail down these elusive concepts. John Wentworth made roughly that point on the AXRP podcast (and maybe elsewhere), though he talks about “mathematical abstractions” and those are probably not quite the same thing. Personally, I think if your main goal is to define these concepts, you should tackle them directly, rather than first trying to find a theory of abstractions and then using that. But it’s a nice potential side payoff, and if it does work, it might give us definitions for many concepts in one swoop.

This sounds like a difficult philosophical problem that’s really hard to make progress on.

I think that’s one of the biggest potential weaknesses of this agenda, but I don’t think it’s a decisive one. A few counterpoints (not meant to completely refute this criticism):

  • Based on a cursory literature search, I’ve found surprisingly little work that really engages with the questions I’m interested in (see Appendix: Related work). Most of the relevant work is just in a similar cluster, or looking at some special case or application. Maybe that’s because this problem is obviously really hard or mis-guided, but maybe there just hasn’t been that much effort trying to solve this problem in particular. In that case we don’t know how hard it is. One sidenote: I wouldn’t be surprised if I missed a lot of existing work and would be quite interested in that.
  • Similarly vague ideas have been successfully formalized before. Consider e.g. information theory, causality, descriptive complexity, or the idea of “symmetry”. We arguably don’t have a full understanding of any of these yet, but we have a good enough one that we can use them as tools to solve other problems.
  • Maybe we can’t avoid solving some problems that are hard in this way to solve alignment. I’m personally a bit on the fence about that. I would obviously like to avoid solving hard problems if possible, but there’s also a risk that avoiding anything philosophically tricky just means we end up ignoring the core problems.

What about empirical tests?

No concrete plans yet, but doing so is definitely part of the goal! At the moment, the main way to test a proposed definition of good abstractions is to check if(a) it captures examples of good abstractions we know of, and (b) it doesn’t produce any nonsensical ones. It’s especially exciting if it produces intuitive abstractions that we didn’t think of in advance.

At some point, experiments are vital because otherwise we can at best test a definition against our intuition of what is “good”, not against what’s actually helpful in practice. My current sense is that I’ll need to work on the theoretical side a bit longer before I even have anything to implement though.

A good testing ground in the future could be Tracr. For example: write a program, compile it into a transformer, then see whether we can automatically recover the abstractions we had on the program side. I’d guess the first experiments should be a bit simpler though.

So in summary, why are you excited about this agenda?

Three main reasons:

  1. If it’s wildly successful, I think this would constitute significant progress in alignment via automating interpretability or maybe mechanistic anomaly detection.
  2. To me, this agenda feels closely connected to several promising topics in alignment, such as mechanistic interpretability, (natural) abstractions, ELK/ontology identification, and heuristic arguments. So if the specific plan I’ve outlined above doesn’t pan out, there’s a decent chance there would still be some ideas applicable to one of these. At least, I expect working on this will help me a lot if I later think about one of these other topics.
  3. Despite point 2, this particular approach seems meaningfully distinct from any of the other ones people are pursuing. (I would guess that several people have thought along the same lines before, but at least based on public writing I’ve seen, it doesn’t seem to be anyone’s main framing right now.)

Point 3 is some evidence that other approaches might be better, but I don’t put too much weight on that since the explanation of “there aren’t enough people to explore every promising avenue” seems sufficient. Still, I would be very interested to hear from people who’ve thought about something close to this agenda and then discarded it.

Points 2 and 3 are in some conflict, and I think one reasonable criticism of this agenda is that it’s in a cluster that’s already a bit crowded relative to alignment overall, and that it would be more important to look for completely different agendas.

And what criticisms are you most worried about?

All of the following seem similarly important to me:

  • <Something not on this list that I’ve just missed completely>
  • This sounds like one of these difficult philosophical problems that’s really hard to make progress on.
  • This approach is too general, there isn’t much useful stuff we can say about abstractions of large classes of computations. Or similarly: there just isn’t a nice theory of abstraction to discover here—the entire concept of “abstractions of computations” turns out to be really messy.
  • Even if you were successful in terms of finding a good framework to define abstractions, and understanding what makes some of them more useful to us, this won’t actually help. For example, the framework just tells you to do the thing interpretability researchers are already doing, or finding good abstractions efficiently isn’t possible in important cases.

Interlude: why mechanistic abstractions?

Ultimately, we care about having a model with good input-output behavior. So why do we want to mirror the internal structure of our network at all? In other words, why do we want abstractions of computations, instead of just abstractions of black-box functions? For example, say I give you a simple Python program and guarantee you that its input-output behavior is a good approximation of the neural network. But there’s a caveat: the way in which the program achieves this behavior is totally different—even at high levels of abstractions, it uses an entirely different algorithm. This doesn’t seem like a huge issue: having this program, along with the guarantee of sufficiently similar behavior, would still be massively useful.

I think this is a pretty important question for this entire agenda (and for mechanistic interpretability). Here’s my current hypothesis:

The most tractable way to get robust approximations of input-output behavior for complicated computations is to approximate their internal structure.

I won’t justify this much here, partly because I don’t have great justifications beyond “seems intuitively plausible” and “seems true in some examples”. Investigating this more is something I'm interested in.

At best, this hypothesis holds when we want a good approximation across many inputs. For example, if our goal is only to find a simpler approximation that’s valid on a narrow distribution, we could just train a simpler model from scratch. But we have no guarantees that this simpler model also generalizes similarly to the more sophisticated one.

Some potential projects

Here’s a (non comprehensive) list of subproblems that seem mostly self-contained and of reasonable scope. If you are interested in tackling any of these, please let me know! (I can at least make sure you’re not duplicating work that I or someone else has done in the meantime, and I might be able to provide some guidance.)

Characterizing abstractions for specific computational models

The framework above gives a potential definition of abstractions, and in principle this determines all the abstractions of any given computational model. But it’s a very indirect specification, so it may be hard to work with. For example, given a binary circuit, it’s not obvious how we should think about the set of all its abstractions.

It would be nice to find other characterizations of the set of all abstractions. This seems most likely to be fruitful when first specializing to one computational model (such as circuits). If characterizing the entire set in a nice way isn’t possible, it would also be interesting to look for easy-to-think-about classes of abstractions (similar to Appendix: Examples of abstractions of computations).

This is an important direction because it can address one of my key uncertainties: are there any non-obvious useful abstractions that we should be using? Or are modularity and interpreting subcircuits the only relevant ones? Having more examples should also help with other open questions (for example, the next one).

Do behavioral approximations “factor through” mechanistic ones?

As discussed in Why mechanistic abstractions?, my hypothesis for why we care about mechanistic abstractions/interpretability is: The most tractable way to get robust approximations of input-output behavior for complicated computations is to approximate their internal structure.

A conjecture related to this is:

For any two circuits where there is a short proof/explanation of their extensional equality, this proof “factors through” some low-level abstraction, i.e. it essentially works by showing that the circuits are structurally very similar.[5]

The idea is that there is a connection between minimal proof length and “mechanistic similarity”: if two circuits are equal at a low level of (mechanistic) abstraction, it ought to be easier to prove they have the same input-output behavior, and perhaps vice versa.

This is of course very informal, and the most important part of answering this question may be getting to the point where we can formalize the conjecture, rather than actually (dis)proving it. It also wouldn’t surprise me if (any reasonable formalization of) the conjecture is just false (though I would guess it is often true for circuits/programs we encounter in practice, so it might still be true with additional assumptions).

This feels similar in spirit to the conjecture that “everything happens for a reason” (from the Heuristic arguments paper, appendix C). If I give you two boolean circuits that have exactly the same input-output behavior, should you expect that there is some structure in the circuits themselves that compactly explains this? If so, that explanation might be effectively saying “these two circuits are identical under the following abstraction”. I haven’t thought about whether there is a strong formal connection here or not.

Test the framework on more examples

This is essentially dual to Characterizing abstractions for specific computational models: look for ways in which we intuitively want to abstract computations, then see if the framework can capture them or not. (If not, potentially adapt the framework.)

Do unique minimal abstractions always exist?

ETA: The answer to this question is "yes", see this shortform. I don't think this is important progress on the agenda (see the shortform), just want to make sure no one wastes time on this subproblem.

We can define a partial ordering on abstractions as follows: let , for  be abstractions, where  is the set of memory states, and  some set of abstract states. We say that  is more fine-grained (i.e. lower-level/more informative) than , written , if there is some map  such that . In words,  contains all the information we need to compute .

We might want to use this ordering as a definition of “minimal” abstractions, which was one desideratum in What makes some abstractions better than others?. In other words, out of all the abstractions that satisfy the consistency condition, and that contain a certain piece of information about the output of the computation, we might want to pick the minimal one according to this ordering. However, since this is only a partial order, it’s not obvious that a unique minimal abstraction always exists, even if everything is finite. I have some preliminary results that suggest these unique minimal abstractions do in fact exist, but it’s for a slightly different setting and not yet fully worked out.

Even if this was successful, it probably wouldn’t be the definition of “good abstraction” we ultimately want to use. It effectively puts infinite weight on the consistency condition and on representing some information, which in practice might often lead to using the full computation as the “abstraction”. Still, it could be a step in the right direction.

Incorporate temporal abstractions and approximate consistency

As mentioned above, we might want to collapse multiple computational steps in the full model to a single step in the abstraction. The current framework doesn’t allow this. I think it should likely be extended (or, if that’s not possible, the entire framework might need to be re-assessed). Another important step is to incorporate approximate consistency, i.e. only demand that the commutative diagram commutes approximately. This will mean additional extensions of the framework, likely making abstract computational steps stochastic as well as requiring an input distribution.

Work out connections to existing work more precisely

Appendix: Related work mentions many existing notions of abstractions, many of which I suspect should fit into the framework I’ve described (or suitable modifications thereof). Working this out in more detail could be a good test, similar to Test the framework on more examples.

On a more conceptual level, I still feel confused about some aspects of how this agenda and the specific framework relate to other alignment agendas, such as John Wentworth’s work on (natural) abstractions and ARC’s mechanistic anomaly detection and heuristic arguments agenda. I think it should be possible to draw much crisper connections than I do in this post.

Final words

This agenda is in an early stage (I’ve been thinking about it for ~2 months). So if you have any feedback, either on the entire approach or on details, I highly appreciate it!

Also, if you’re working on similar things, or might want to start working on similar things, please feel free to reach out! (Just to chat, or to potentially collaborate.) LessWrong DM or for example.

Appendix: Related work

In this appendix, I’ll give a brief overview of existing research I’ve looked at. The AI alignment section below is reasonably complete, I think (though with many open questions). You should consider all the other sections a preliminary list of potentially relevant fields, rather than a full literature review.

AI alignment

John Wentworth’s work on (natural) abstractions

John Wentworth has written a lot about abstractions. (And even more.) Going through all his ideas and trying to connect them with this agenda would be a significant project in itself (though one potentially worth doing). For now, I’ll just give my most important thoughts without too much justification:

  • Lots of John’s examples are about the physical world instead of computations. I suspect this isn’t a huge deal (the physical world and computations don’t seem that different).
  • One thing that may be a bigger deal: John’s work makes heavy use of probabilities and mutual information, which are conspicuously absent in this post. Probability distributions will play a role for my framework once I formalize approximate commutativity (since that likely requires a distribution over inputs). But it’s not clear that randomness will play the same role as in John’s work. For example, if you just have fully known deterministic computations, there’s no noise that can wipe out low-level details over a distance. It’s possible though that there is a deeper version of those ideas that does carry over.
  • One connection is that what good abstractions of a computation are might depend on the input distribution to that computation, and thus on good abstractions in the real world (if that’s where inputs are coming from). For example, say a neural network has a tree detection subcircuit. If the input distribution is just white noise, it’s much less clear that thinking of this as a meaningful subcircuit is still right.[6]
  • My guess is that in the end, there’s a pretty general core of what abstractions are that applies to subroutines in programs as well as to flowers or other things in the physical world. I personally like the framing of thinking mainly about abstractions of computations, but I might also draw inspiration from other examples, and I’d hope that some of my work would transfer if I switched to thinking about abstractions in a different context.
  • In terms of the specific mathematical formalization, my framework seems quite different from John’s, at least on the surface. In fact, it’s much closer to other existing work, such as Abstractions of causal models or Bisimulation. It would be exciting if these approaches converged at some point, which seems speculative but not impossible.

Mechanistic interpretability

Mechanistic interpretability broadly construed (e.g. as in Daniel Filan's early post) seems very similar to looking for good abstractions of neural networks, except with the stronger focus of aiming for abstractions that are human-understandable. This is related to one of the limitations of my framework: I don’t think as much about how to encode things in an easy to understand way, whereas this is an important aspect of interpretability (in addition to reducing the size of the thing that needs to be understood).

In current practice, my sense is that a lot of mechanistic interpretability can be seen as finding subcircuits of a neural network that describe some small part of its behavior (see my Subsets and quotients in interpretability post). Focusing on these subcircuits throws away a lot of other information about the computational graph. For lots of subcircuits, this would leave you with basically no predictive power. Subcircuits like the indirect object identification one are interesting because they still let you make some predictions about behavior without knowing about the rest of the network, so they’re good ways of throwing away parts of the computational graph, i.e. good abstractions. Compared to this type of “circuit-style” mechanistic interpretability research, abstractions of computations differ mainly by aiming to use more general types of abstraction.

One other post I want to highlight is Daniel Filan’s Verification and Transparency. He points out how both verification and transparency serve similar purposes and how some things (such as type annotations in programming languages) can be seen as instances of both. My current guess is that abstractions of computations are the more general view that connects the two: they make things more transparent (because they are simpler) and they help with verification (again because they are simpler, as hypothesized in Do extensional approximations “factor through” mechanistic ones?).

Modularity/Clusterability of neural networks

Daniel Filan and others have worked on clusterability of neural networks, i.e. dividing the neurons in a network into natural clusters or submodules using graph cuts. There also seems to be newer work on NN modularity that I’m not familiar with yet. This survey has a brief overview. Additionally, there are several posts by the 2022 AI safety camp modularity team and more posts by 2022 SERI MATS scholars exploring how to define and look for modularity in neural networks. Thane Ruthenis has argued for focusing on interfaces between certain submodules, such as world models and search/planning modules.

As discussed in the Q&A, modularity is a key example of the type of abstraction I’m interested in, but I’m hoping to also study a broader class of abstractions.

Causal scrubbing

In the Mechanistic interpretability section, I talked informally about the “predictive power” of a subcircuit. Causal scrubbing formalizes this by defining whether an interpretation of a subcircuit “explains” a given behavior. So I think of this as formalizing one desideratum for a specific type of “good abstractions” of computation. (Other desiderata are still informal; for example, we might want the proposed interpretation to be as “simple” as possible.) I tentatively think it should be possible to model causal scrubbing as a commutativity condition in my framework but haven’t worked it out yet (doing so is on my agenda).

An important similarity between causal scrubbing and this agenda is that both are aiming at automating interpretability research. (In my case, I’m less firm on the exact end game, but this is at least one major motivation.) They also both seem like candidates for formalizing when mechanisms explain a behavior for mechanistic anomaly detection (or at least I think abstractions seem potentially applicable to that).

ARC’s mechanistic anomaly detection and heuristic arguments agenda

ARC wants to develop a method for mechanistic anomaly detection: say we have a function  and an input distribution  (the training distribution in an ML context). Given some new input to , we would like to say whether the output of  is due to the “same reasons” that explain its behavior on , or due to some other “mechanism” that wasn’t necessary to explain the behavior on . (See their post for an explanation of how this relates to ELK and other alignment problems.)

Developing an algorithm for mechanistic anomaly detection first requires clarifying what we mean by “mechanisms” “explaining” model behavior. ARC’s approach is trying to formalize heuristic arguments and then use those as the definition of “explanations”.

There are two reasons to think that abstractions of computations could be relevant for this agenda (and vice versa):

  • Mechanisms that can explain certain behavior sounds quite close to what I have in mind for (good) abstractions of computations. For example, subcircuits are a typical example of an abstraction of a computation (you throw away all but one part of the circuit), and they can explain specific model behaviors, as mechanistic interpretability research shows. I consider causal scrubbing an important example of what a framework for abstractions should capture, and Paul mentions causal scrubbing as a candidate for heuristic arguments too.
  • Abstractions of computations might in fact be related directly to heuristic arguments: abstractions of programs seem similar to abstractions of proofs because of the Curry-Howard isomorphism. And abstractions of proofs might be a type of heuristic argument: they are not themselves proofs, but approximate them in some sense; they are shorter than full proofs; and they might give a good high-level explanation for “why” something is true. (Even if this is the case, it’s not clear they are the kind of heuristic arguments ARC is currently working on, e.g. they might not be defeasible.)

Abstractions of causal models

Starting with Rubenstein et al. (2017), there have been a few papers on when a structural causal model (SCM) is an abstraction of another SCM. Follow-up papers are Beckers & Halpern (2019a)Beckers & Halpern (2019b) (which introduces approximate abstraction), Rischel & Weichwald (2021), and Otsuka & Saigo (2022) (the last two use a category theoretical framework). Zennaro (2022) reviews these different frameworks. John Wentworth has also written a summary/review of Beckers & Halpern.

I haven’t studied all of these in detail, but as far as I can tell, the approach behind all of them is the same type of commutative diagram that I’m using (and that occurs elsewhere, e.g. in Bisimulation). The idea that’s specific to causal abstraction is that the horizontal arrows are causal interventions. The same type of diagram is also used in a recent paper by Taco Cohen to define causal models of an underlying true causal structure (he refers to this as a “naturality” condition, but doesn’t make explicit any category theoretical interpretation). In that paper, there are two horizontal arrows, one a causal intervention and one some other process that turns states into “outcomes”.

Building on these causality papers, there is also a line of work from Stanford that applies causal abstractions to neural networks. The idea is: we can think of the computational graph of a neural network as a causal model. Given some hypothesis for a higher-level causal model that we think the network might be implementing, and a specification for how the nodes of that higher-level model map into the network, we can test interventional consistency (i.e. check whether the high-level hypothesis is indeed a causal abstraction of the network). This is very reminiscent of Causal scrubbing.

Overall, causal abstractions seem highly relevant for abstractions of computations, given the connection between causal models and computational graphs.

Artificial intelligence

Abstractions in general are relevant all over the place in AI, but I’ll focus on two directions that seem potentially related to abstractions of computations specifically.

Reinforcement learning

Reinforcement learning is interesting because it combines both state abstractions and temporal abstractions (i.e. high-level descriptions of state space and high-level descriptions of actions to take). As mentioned in Limitations of this framework, it seems that my current framework is missing temporal abstraction, and RL could provide inspiration on how to combine this with (memory) state abstraction. See for example David Abel's thesis and its Background section (note that it uses the name action abstractions instead of temporal abstractions).

Logical reasoning and theorem proving

There is a 1992 paper simply called A theory of abstraction, which introduces a framework for abstractions of formal systems. Section 8 has an extensive list of examples from earlier work described in this framework, lots of it on theorem proving but also other areas of AI. One reason why abstractions for theorem proving may be interesting is the connection to heuristic arguments and mechanistic anomaly detection I discussed above.

Computer Science

Abstractions of computations are obviously a big topic in CS. It’s clear there are tons of examples of abstraction to be found, Wikipedia has an overview page. Less clear to me is to what extent people have explored what abstractions of computations are in general.

Below, I’ve collected the subfields that seem most relevant.

Programming languages

As mentioned earlier, programming languages are abstractions of lower-level computational models, such as machine code, but this is just a source of examples rather than a definition of abstractions in general. Programming languages also provide methods for creating computations with obvious abstractions. For example, functions let you decompose a program into submodules, and a very obvious abstraction is to simply specify what some of these functions do instead of how they work. Even this seems mostly just like a source of examples though. Individual programming languages will have less to say on the subject of what kind of thing abstractions are in general.[7] However, the field of programming language design might have things to say about that, since it is essentially in the business of finding good abstractions of computational models.

Formal specification

One important type of abstraction I discussed is to “black box” parts of a computation, i.e.  forget how that part is computed and just describe what the input-output behavior is. One question is then: how should the input-output behavior be represented? We probably don’t want to actually use a giant lookup table (and we shouldn’t have to if the program implements some conceptually simple behavior). Formal specification could hold an answer to that subproblem.

Abstract interpretation (and other uses of abstraction for model checking)

Based on a quick glance, this seems highly relevant, but I haven’t spent much time looking into it yet (in particular, whether the formalism there is promising as a general definition of abstractions).


Bisimulation is a notion of intensional/mechanistic equivalence between transition systems, i.e. it describes when two transition systems should be considered “the same” in terms of their computational structure. Probably even more importantly for this agenda, there is also the one-sided version of this, one system simulating another one (but not necessarily vice versa). This seems like a good candidate for abstractions specifically of transition systems.

People have studied generalizations of (bi-)simulation using category theory (“coalgebraic bisimulation”) and this was in fact what led me to think about abstractions using (co)algebras.

Plagiarism detectors

If you want to tell whether a piece of software plagiarizes another one, you probably don’t just want to check whether their source code is literally the same. Instead, you’re more interested in whether they’re equivalent at some higher level of abstraction. So I’d hoped that plagiarism detectors for code might contain some interesting ideas for defining “mechanistic similarity” of programs (h/t to Stuart Russell and Shreyas Kapur for independently making that suggestion). Unfortunately this doesn’t look too promising based on a cursory exploration: it seems plagiarism detectors often focus on efficiently checking whether two programs have subsnippets that match exactly. They do often “normalize” source code by removing whitespace etc., but that happens in an ad-hoc way that differs from method to method.


I assume that philosophers have thought, if not about exactly my question, then at least related ones. However, this is one of the many things I haven’t gotten around to looking into yet. I expect this might be less helpful than more mathematical approaches like the ones in causal abstraction, but still worth checking.

Appendix: Examples of abstractions of computations

In this appendix, we’ll look at some examples of abstractions within the framework outlined above. Most of them are in circuits, though we also briefly discuss Turing machines. This part may be a bit harder to follow than the rest of the post since I’ll only give rough descriptions of how these examples work.


Circuits are directed acyclic graphs where each node is equipped with …

  • … a domain of possible output values, and
  • … a function that takes outputs from all parent nodes and produces a value in this node’s output domain.

To turn this into a concrete model of how computations are performed step by step, we also choose a topological ordering of the circuit.[8] The state/memory of the computation assigns to each node either an output value or an “undefined” label. Initially, all nodes except inputs are undefined. Then we go over nodes in the chosen topological order and determine one new value in each computational step.

Black box

Let’s start by considering the trivial “black box” abstraction that exists for any computation (not just circuits). This abstraction takes a memory state and maps it to the final output of the circuit (by running through the entire remaining computation). The abstract computational step ( in the consistency diagram), is simply the identity function. This makes the consistency diagram commute perfectly.

Quotienting output space

We can throw away even more information. Namely, instead of mapping memory states to the output, we can map them to some equivalence class of outputs. For a binary circuit, we may be interested e.g. in only the parity of all the output bits. In an arithmetic circuit, perhaps we only want the output up to some precision. Again, this abstraction is applicable to any computational model, not just circuits.[9]


Next, let’s start keeping some information about the actual circuit, instead of treating it as a black box. One key class of examples are “subcircuit” abstractions. This simply means we keep the values for some of the nodes while forgetting all the other ones. To perform computational steps within our abstraction, we might sometimes need input values we don’t have (if there are nodes we represent in the abstraction but which have parents we discarded). There are different choices we could make here, for example just setting assuming these inputs are zero (or some other default value that makes sense in the context of the types of values the circuit works with). This is essentially the same as the choice of ablations in mechanistic interpretability. It seems likely that resampling ablations (as used by causal scrubbing) are generally a better idea than e.g. zero ablations. This requires us to have access to some input distribution, but that seems necessary anyway in order to talk about approximate consistency, as discussed in the main post.

Quotienting the circuit

Another important type of abstraction are quotients of circuits, which correspond to division into submodules. I think of this as black-boxing not the entire circuits, but only certain subcircuits. This abstraction is defined by a partition of the set of nodes (as opposed to specifying one subset of the set of nodes, as in the subcircuit abstraction we just discussed). The intuitive version of what we want to do is the following: we construct a new circuit by “quotienting out” the partition, i.e. turning each equivalence class of nodes into a single node, which implements the function previously implemented by the entire equivalence class. Note that this does put some restrictions on what the partition can be (because we need this quotient circuit to still be acyclic). To formalize this within the current framework, we need to decide how to map memory states of the full circuit to memory states for the quotient circuit. One simple way is to map an equivalence class in which every node is computed to the appropriate output values, and to map any equivalence class with any “undefined” values to “undefined”. To be able to mirror computational steps in the quotient circuit, we also need to add a memory field in the abstraction that tells us whether each equivalence class is about to be “completed” (e.g. just store the step of the topological sort we’re currently at—this can just be increased by 1 in each abstract computational step). That seems a bit inelegant and could be handled more sensibly with a more general framework that allows temporal abstractions. 

Associative trees

I’ll finish with a somewhat less central example for some variety. Say I have a boolean circuit that computes the AND of  inputs. A simple way to do this would be a binary tree of AND gates. But you could also imagine other implementations that compute the same function using only NOT and OR. So a reasonable abstraction is “a tree using only AND gates”, which tells us something about the algorithm but ignores the details of how the tree is structured (e.g. is it a tree with depth , or one with depth ?)

One reason I’m interested in this example is that I initially thought it would be hard to describe in my current framework. But here’s an approach that I think should work: given the memory state of the circuit, i.e. node values for some prefix of the topological order, we can determine the current “frontier” of nodes, i.e. all the nodes we still need to compute future values. Our abstraction will be to take the AND of all the values on this frontier. For a tree of AND gates, this is always the same as the output of the circuit, so we can use the identity as the computational step within this abstraction. Also note that the abstraction will be the same for any tree of AND gates. On the other hand, assume there is a NOT gate at any point in the circuit. Then for some inputs, the AND along the “frontier” will flip once we reach that NOT gate. The abstraction is thus different from the pure AND tree (and also no longer satisfies the consistency condition).

The same approach can be applied to gates other than AND of course, as long as they’re associative (which ensures that all trees consisting only of that gate type are extensionally equal).

Turing machines

Consider a Universal Turing machine , i.e. a Turing machine that takes as input an encoding of some Turing machine  and an input  to , and then outputs what  would output on . If we fix a Turing machine  and “partially apply”  to , we get a Turing machine that’s extensionally equal to  (except that its input/output may be encoded in a different way). This Turing machine essentially simulates  within . Intuitively,  itself is a very natural abstraction of this more complicated computation.

Typical constructions of universal Turing machines encode the state and tape of the simulated machine M in a pretty straightforward way (see e.g. this one). That means it’s easy to map from the state & tape of  to the corresponding state & tape of , which gives us our abstraction. To mirror computational steps, we essentially just want to do computations in  as usual.

There’s one obstacle here, which we already discussed in Limitations of this framework: multiple computational steps in  correspond to a single step in . I’m not sure whether there’s a hack to work around this, like there was for Quotienting the circuit. In any case, this is a strong hint that temporal abstraction should be part of the framework.

One other open question is to what extent this works for arbitrary universal Turing machines. While I would guess that any explicit construction of UTMs encodes the state of the simulated machine, it's not obvious to me that this is necessary. You could for example imagine a universal Turing machine that first transforms the simulated machine into some different, extensionally equal one, and then simulates this transformed machine. Even then, the transformed machine might have to be mechanistically quite similar to the original one, so we would still get a good abstraction at some higher level. I’m not sure if this can ever break down completely. (If it does, that’s not necessarily a counterexample to the framework—it seems quite good for this framework to be able to detect whether a universal Turing machine works the “normal” way or does something crazy.)

  1. ^

    We have to either restrict ourselves to computations that terminate on all inputs, or have a default output for computations that don’t terminate.

  2. ^

    I think there probably isn’t a clean division into “natural” and “not natural”, so maybe John and I aren’t quite on the same page here—I’m not sure what exactly his take on this is.

  3. ^

    For a closely related point, see my Too much structure post.

  4. ^

    This seems likely in the case of search processes. Goal-directedness is less clear, you could go for purely behavioral definitions as well.

  5. ^

    A weaker and easier to formalize version would be: in any such case, there exists a proof of similar length that shows equality up to an abstraction (but which can be different from the original proof).

  6. ^

    Though I do also have a fairly strong intuition that there is something input-independent we can say—the presence of a tree detector is extremely unlikely in a random network, and seeing one is a strong hint that there were trees in the training distribution. So I shouldn’t need to tell you what the input distribution is for you to say “this is an interesting subcomponent of the network”.

  7. ^

    An exception might be if some languages are specifically designed to make it easy to build new abstractions and work with them?

  8. ^

    Alternatively, we could compute values in “maximal layers”, I haven’t checked yet whether that changes the picture much.

  9. ^

    Things seem slightly messier if we want to only represent the circuit as a black box on a subset of inputs, i.e. “forget” the behavior on other inputs. This is possible under the computational model where we keep around old node values. But if we introduce “garbage collection” and throw away inputs once they’re no longer needed, it might be impossible to determine whether a memory state came from one of the inputs we’re interested in or not. I’m not entirely sure how to interpret this: it could mean that this abstraction is in fact somewhat weird from a mechanistic perspective, or that this framework isn’t as clean as one might hope, and the precise computational model matters a lot. Perhaps a better approach is to handle subsets of inputs by restricting the input distribution that we’ll probably want to introduce at some point.

New Comment