This work was supported by OAK, a monastic community in the Berkeley hills. It could not have been written without the daily love of living in this beautiful community. The work involved in writing this cannot be separated from the sitting, chanting, cooking, cleaning, crying, correcting, fundraising, listening, laughing, and teaching of the whole community.

This write-up benefited from feedback from David Kristofferson, Andrew Critch, Jason Crawford, Abram Demski, and Ben Pence. Mistakes and omissions are entirely the responsibility of the author.


How is it that we solve engineering problems? What is the nature of the design process that humans follow when building an air conditioner or computer program? How does this differ from the search processes present in machine learning and evolution?

We study search and design as distinct approaches to engineering. We argue that establishing trust in an artifact is tied to understanding how that artifact works, and that a central difference between search and design is the comprehensibility of the artifacts produced. We present a model of design as alternating phases of construction and factorization, resulting in artifacts composed of subsystems that are paired with helpful stories. We connect our ideas to the factored cognition thesis of Stuhlmüller and Christiano. We also review work in machine learning interpretability, including Chris Olah’s recent work on decomposing neural networks, Cynthia Rudin’s work on optimal simple models, and Mike Wu’s work on tree-regularized neural networks. We contrast these approaches with the joint production of artifacts and stories that we see in human design. Finally we ponder whether an AI safety research agenda could be formulated to automate design in a way that would make it competitive with search.

Introduction

Humans have been engineering artifacts for hundreds of thousands of years. Until recently, we seem to have mostly solved engineering problems using a method I will call design: understanding the materials at hand and building things up incrementally. This is the approach we use today when building bridges, web apps, sand castles, pencil sharpeners, air conditioners, and so on.

But a new and very different approach to engineering has recently come online. In this new approach, which I will call search, we specify an objective function, and then set up an optimization process to evaluate many possible artifacts, picking the one that is ranked highest by the objective function. This approach is not the automation of design; it’s internal workings are actually nothing like design, and the artifacts it produces are very unlike the artifacts produced by design.

The design approach to engineering produces artifacts that we can understand through decomposition. A car, for example, is decomposable into subsystems, each of which are further decomposable into parts, and so on down a long hierarchy. This decomposition is not at all simple, and low quality design processes can produce artifacts that are unnecessarily difficult to decompose, yet understanding how even a poorly designed car works is much easier than understanding how a biological tree works.

When we design an artifact, we seem to factorize it into comprehensible subsystems as we go. These subsystems are themselves artifacts resulting from a design process, and are constructed not just to be effective with respect to their intended purpose, but also to be comprehensible: that is, they are structured so as to permit a simple story to be told about them that helps us to understand how to work with them as building blocks in the construction of larger systems. I will call this an abstraction layer, and it consists of, so far as I can tell, an artifact together with a story about the artifact that is simultaneously simple and accurate.

Not all artifacts permit a story that is both simple and accurate. An unwieldy artifact may not be understandable except by considering the artifact in its entirety. In design, we do not produce artifacts and then come up with stories about them afterwards, but instead we seem to build artifacts and their stories simultaneously, in a way where the existence of a simple and accurate story to describe an artifact becomes a design goal in the construction of the artifact itself.

Search, on the other hand, does not operate on the basis of abstraction layers. When we use a machine learning system to search a hypothesis space for a neural network that correctly differentiates pictures of dogs from pictures of cats, the artifact we get back is not built up from abstraction layers, at least not nearly to the extent that artifacts produced by human design are. And we wouldn’t expect it to be, because the search processes used in machine learning have neither the intent nor the need to use explicit abstraction layers.

The absence of abstraction layers in artifacts produced by search is no impediment to the effectiveness of search in finding a solution to the specified problem. But this absence of abstraction layers is an impediment to our human ability to trust the artifacts produced by search. When I find some computer code on stackoverflow for solving some problem, I may copy it and use it within my own program, but not before understanding how it works. Similarly, when FAA approves a new aircraft for flight, it does so not just on the basis of empirical tests of the new aircraft, but also of a description provided by the manufacturer of each of the aircraft’s subsystems and how they work. For simple artifacts we may be willing to establish trust on the basis of empirical tests alone, but for sophisticated artifacts, such as an aircraft or autonomous car or artificial intelligence, we must be able to understand how it works in order to decide whether to trust it.

Design produces decomposable artifacts, for which trust can be built up by reading the stories attached to each abstraction layer. We can verify these stories by further decomposing the subsystems underneath each abstraction layer. Search does not produce decomposable artifacts and for that reason we have no way to establish trust in the artifacts produced by it, except by black-box testing, which is only appropriate for simple artifacts.

Unfortunately, we have discovered how to automate search but we have not yet discovered how to automate design. We are therefore able to scale up search by bringing to bear enormous amounts of computing power, and in this way we are solving problems using search that we are not able to solve using design. For example, it is not currently known how to use design to build a computer program that differentiates cat pictures from dog pictures, yet we do know how to solve this problem using search. We are therefore rapidly producing artifacts that are effective but not trustworthy, and we are finding ourselves tempted by economic incentives to deploy them without establishing trust in them.

The machine learning sub-field of explainability (and it’s related but distinct cousin, interpretability) is concerned with establishing trust in artifacts produced by search. There is important work in this area of relevance to the thesis presented in this paper, and we review some of it in a later section. Overall, work in this field seems to be concerned with one of the following:

  • Models that are simple enough that they can be comprehended without any accompanying stories

  • Manual decomposition of trained models

  • Post-hoc explanations that aim to persuade but may not be accurate depictions of what’s really going on.

All three of these are distinct from our view that comprehensibility comes from the pairing of artifacts with helpful stories about them.

As we begin to construct sophisticated AI systems using search, we venture into dangerous territory. We have the tools of machine learning that may soon be capable of producing AI systems that are highly effective, yet for which we have no way to establish trust. It will be a great temptation for us to deploy them without establishing trust in them, since enormous economic prizes are on offer.

To resolve this, we here propose that we should work to automate design to such a point that design can scale with computing power to the same extent as search. Under this approach we would investigate the nature of the design process that humans use to construct artifacts based on abstraction layers, and attempt to automate it. We may find it possible to automate the whole process of design, or we may automate just part of the process, leaving humans involved in other parts.

The remainder of this document is organized as follows. First we work through a simple example to make clear the distinction between search and design. Then we describe a model of design as alternating construction and factorization. Following this we argue that search can be viewed as a construction process lacking any factorization step. We then work through definitions of the terms abstraction layer and comprehensible artifact upon which much of the material in this report is predicated. We draw a connection to the factored cognition thesis, and also to the machine learning fields of explainability and interpretability. We conclude by sketching an AI safety research agenda that would aim to allow comprehensible design to scale with computing power, in order for it to become competitive with search as a process for developing advanced artificial intelligence systems. In an appendix we present notes from an informal inquiry in which we observed the author’s own process of writing software over a few hours and compared what we saw to the model presented in this document.

Example: Sorting integers

Suppose I need an algorithm that sorts lists of integers from smallest to largest, but I'm not aware of any good sorting algorithms. Suppose it's critical that the algorithm correctly sort integers in all cases, but that I've never encountered the computer science field of sorting, so I don't have any of the concepts or proofs found in computer science textbooks. I consider two approaches:

  • Train a neural network to sort integers

  • Write a sorting algorithm from scratch in Python

Let's consider the machine learning approach. This problem has some attractive characteristics:

  • I already understand what it means for a sequence of integers to be sorted, so I can provide a perfectly consistent training signal

  • I can easily generate unlimited training examples, so I immediately have access to an infinitely large dataset

  • I have a precise understanding of the range of possible inputs, so although I cannot train on every possible sequence of integers, I can at least be confident that I have not missed anything in my own understanding

Now let’s say that I succeed in training a neural network to sort integers, and I test it on many test cases and it works flawlessly. Am I ready to deploy this in a safety-critical scenario where a single incorrect output will lead to the death of many living beings?

Those familiar with practical machine learning will shudder at this point. A neural network is an unwieldy thing. It is composed of millions or perhaps billions of parameters. No matter how many test cases I run and verify, there are infinitely more that I didn’t try, and in fact there is always a maximum length list and a maximum size integer among my test set — can I really trust that the neural network will correctly sort integers when the length of the list o size of the integer greatly exceeds this maximum tested size? It is difficult to convey just how uncomfortable I would feel in making the leap from "I tested this on this many test cases" to "I’m ready to deploy this and swear on my life that it absolutely will sort integers correctly in all cases".

If I was given the task of determining whether a neural network correctly implemented sorting, what I would actually do is the following. I would examine the operations and coefficients contained in each layer of the network and attempt to extract an understanding of what each piece was doing. I would feed examples into the network and watch how each integer was processed. I would try to watch closely enough to get some insight into how the network was functioning. I would likely end up jotting down little fragments of pseudocode as I unpacked progressively larger subnetworks, then I'd try to understand the network as a whole on the basis of the pseudocode I'd written down. Ultimately, if I did succeed at extracting an understanding of the network, I’d need to decide whether or not the network constituted a correct sorting algorithm, and I might use some formal or informal verification method to do this.

The point is that to really trust that this network will work in all cases, I would want to decompose it and understand it piece by piece.

Now in this example we have taken a problem (sorting integers) that has a known direct solution, and we have compared that direct solution to learning a solution from training data. It is no surprise that the direct solution is the better choice. It is not that we should replace all instances of machine learning with direct solutions: the whole point of machine learning is that we can solve problems for which we do not have a direct solution. Instead, the point of this example is to provide an intuition for where our trust in artifacts comes from, and specifically why it is so important that artifacts are structured in a way that supports us in decomposing and understanding them.

Design as construction and factorization

In design, we build a thing up to meet some requirements, based on an understanding of the materials we have to work with. For example, when we build a shed, we have wood for framing, wood for sheathing, concrete to anchor it into the ground, and so on. We don’t need to know the details of how the two-by-four sections of wood were cut from their source material, or how they were transported, or how they were priced by the store that sold them: there is an abstraction layer upon which we can consider the two-by-fours as basic building blocks with a few known properties: that they are strong enough to support the structure of a shed; that they can be cut to arbitrary lengths; that they can be cut at an angle; that they cannot be made to bend; and so on.

Similarly, when I write some software that uses the Postgres database software to store and retrieve information, I do not need to understand the full internal workings of Postgres. I can consider the database system to be a kind of material that I am working with, and a good database system is designed such that I can understand how to use it without understanding everything about it. I know that when I execute such-and-such an SQL statement, a "row" gets added to a "table", and when I execute such-and-such an SQL statement, I get that same "row" back. But the concepts of "row" and "table" are high-level stories that we use because they are helpful, not because they give a complete description of which bits will be in which of the computer’s memory locations at which time.

One part of design consists of using existing materials to build something. We might call this "construction". If I write a python program containing a single "main" function that pulls in a bunch of libraries and does some computation, then I’m doing construction. But very quickly I will start to factor my program out into functions so that I can more easily test it, grow it, and understand it in a way that allows me to see that it is correct. We might call this "factorization". In factorization, I’m looking for ways that I can take, for example, a relatively complicated gradient descent algorithm and say "This chunk of code finds a local minima of a convex objective. You must give it an objective function and a starting point for optimization and it will give you back the optimum of the function." In this way I can stop remembering the details of the implementation of the algorithm, which are many and can be very subtle, and instead remember only the story about the implementation. This helpful story about an artifact, together with the artifact itself, is what I call an abstraction layer.

We are familiar with looking for simple explanations of existing things from our theory of epistemics. This is different. Here we are constructing something in a way that makes it possible to tell a simple story about it. We are designing the artifact together with an explanation/concept/story about it all within the same process. This makes it possible to do sophisticated engineering much more quickly than if we built unwieldy artifacts and then tried to come up post-hoc stories about them, because we design our artifacts with the goal of making them comprehensible. We have been working for centuries to tease apart the mechanisms making up, for example, biological trees (and we are not done yet), whereas we have successfully built search engines comprising billions of lines of code in just a few years.[1]

Construction and factorization proceed in a loop. When I begin a new software project, I often work entirely within a single function for a while. Once I have a few basic pieces in place, I run them in order to test that they work and find out more about the materials I’m working with — for example, I might write code that performs a call to a remote API and then prints the server’s response in order to find out some things about it that are not covered in the documentation. Once this is all working I might factor this code out into smaller functions with succinct documentation strings such as "performs such-and-such an API call, does such-and-such upon error, returns data in such-and-such a format". Then I return to construction and start putting more pieces in place, perhaps using my existing factorized code as building blocks in the construction of larger pieces.

When there is too much construction and not enough factorization, we end up with unwieldy artifacts. We forget about the details of how the artifacts work and begin to mis-use them, introducing bugs. There is a kind of proliferation, like spaghetti code, or like a house that has had electrical cables strung endlessly from place to place without ever removing older cables. It becomes impossible to make sense of what’s going on. As we start counting the cables and trying to make sense of things, we get lost and forget about the cables we counted at the beginning. What is happening here is that we have a highly complex artifact without handholds, and it is very difficult to design things on top of it because in order to do so we need to somehow fit some story about the artifact into our minds, and human minds cannot work with arbitrarily unwieldy stories. Software engineers experience great pain and frustration in encountering such unwieldy artifacts, and their work tends to slow to a crawl as they give up on having any clear insight into what’s going on and instead proceed by dull trial-and-error[2].

It is also possible for there to be too much factorization and not enough construction. Sometimes in software companies there will be an attempt to pre-factorize a system before anything at all has been constructed. The way this plays out is that we come up with some elaborate set of stories for abstraction layers before we really understand the problem, and begin implementing these abstraction layers. We forget that it is the process of construction that gives us evidence about the materials we are working with, and also about the solution we are seeking. We try to do all the construction in our minds, imagining what the artifact will ultimately look like, and pre-factorize it so that we don’t ever have to backtrack during design. But this over-factorization fails for fundamentally the same reason that over-construction fails: we cannot fit very much complexity into our minds, so our imagined picture of what construction will yield is inaccurate, and therefore our factorization is based on inaccurate stories.

So there is a design loop between construction and factorization in which construction gives us evidence about the nature of the materials and factorization rearranges our construction into a form that permits a compact description of what it is and how it works. In a healthy design loop, construction and factorization are balanced in a way that strictly bounds the amount of complexity that we need to fit into our minds to understand any subset of the system, while maximizing the inflow of evidence about the materials we are working with.

Search as construction without factorization

Search is pure construction, with no factorization. In search, we set up an engineering problem in a way that allows us to perform massive experimentation: trying millions of possible solutions until we find one that meets our requirements. The basic search loop consists of construction and evaluation. In machine learning, for example, evaluation corresponds to computing the objective function and its gradient, while construction corresponds to updating our artifact by moving it a little in the direction suggested by the gradient of the objective function.

This process — construction without factorization — leads to the production of unwieldy artifacts. The reason is that in the space of possible artifacts, the vast majority are very unwieldy, so any process that doesn’t explicitly optimize for comprehensibility leads by default to artifacts that are unwieldy. Imagine stepping through the set of all text files containing syntactically valid python programs. The vast majority of these contain unwieldy spaghetti code. We would have to step through many, many such text files before we found the first one that we might describe as comprehensible, or that would pass code review at a typical software company. Similarly, among the policies representable within the neural network architectures used in machine learning, there is presumably some subset that are internally well-factored in a way that would support human comprehension. But this subset is a tiny manifold within the overall hypothesis space, and even if our starting point for optimization were on this manifold, each gradient step is with high probability going to take us further away from that manifold unless we are explicitly working to stay on the manifold.

And why should a search process factorize its constructions? It has no need for factorization because it does not operate on the basis of abstraction layers. It operates on the basis of trial and error, and under trial-and-error it doesn’t matter whether an artifact is comprehensible or not. This is neither a feature nor a bug of search, it is just the way things are.

But although the absence of factorization is no barrier to search, it is certainly a barrier to our comprehension of the artifacts produced by search. Without handholds in the form of abstraction layers, we have no way to understand how an artifact works, and without understanding how it works it is very difficult to establish trust in it.

Defining comprehensible design

When we design some artifact, we want it to be both effective for its intended purpose, and comprehensible to ourselves and other humans. Being comprehensible really means that there is a story that can be told about the artifact that is both simple and accurate.

Definition. A helpful story is a story about an artifact that is both simple and accurate

What does simple mean? In this context when I say simple, I am referring to a concept that has a shape that is convenient for a human to understand. This may differ from an abstract notion of algorithmic simplicity such as description length, because humans seem to understand concepts through analogies to already existing concepts, so one may be able to quite easily understand certain complex concepts that map onto conceptual foundations already in place, such as a collection of interlinked database tables, while struggling to understand concepts without any pre-existing conceptual foundations, such as the notion of a ring in abstract algebra.

What does accurate mean? It means a story such that interacting with the artifact as though it were really as simple as described in the story does not cause harm or surprise. It means a story that is useful without being manipulative. It means a story that reveals, as far as possible, the direction of its own necessary imprecisions. This is different to pure predictive accuracy: what we care about is stories that make possible the use of the artifacts they refer to in the construction of larger systems, while trimming off details not necessary for this purpose. A story that is accurate only in the sense that it tells us how an artifact will behave may not give us any affordances for using that artifact in further construction.

Next we come to abstraction layers. We have already used the concept of an abstraction layer in discussing construction and factorization above. The definition I will use is:

Definition. An abstraction layer is an artifact together with a helpful story about it

It must be stressed once again that the terrain here is subtly different to that of epistemics, in which we observe some natural artifact and come up with a simple model to explain its behavior. In that domain we end up with a similar pairing between some artifact that is complex and a model or story about it that is simple. But in epistemics we are typically studying some already-existing phenomena, and engage in a process of hypothesizing about it. In design we get to construct the phenomena, and we can therefore shape it such that there exists a story about it that is both simple and accurate. We are allowed to ignore regions of the design space that contain effective artifacts, simply because those artifacts are difficult to understand and do not lend themselves to the construction of stories that are simultaneously simple and accurate. Finding helpful stories is a great challenge! Whenever we have a choice between finding a helpful story for an artifact produced by some black-box process, versus designing the artifact from the ground up to be amenable to a helpful story, we should certainly choose the latter!

Finally we come to a recursive definition for comprehensibility:

Definition. A comprehensible artifact is an abstraction layer that is built up from parts that are themselves comprehensible artifacts, using only a limited amount of construction to bridge the gap between the parts and the whole.

A car is a comprehensible artifact. Considered as a single artifact, a car comes with a set of very simple and very accurate stories about how to use it to drive from one place to another. Additionally, because my car was produced by an engineering process in which the design work needed to be distributed across many human engineers, it is internally structured in a way that supports decomposition. I may consider the car’s engine: it, too, is well-encapsulated, meaning it has a shape that permits helpful stories to be told about it, and the owner’s manual provides many such helpful stories about the car’s engine. Similarly, the other parts that make up the car — chassis, wheels, transmission, and so on — each come with helpful stories that make it possible to understand them without considering all of their details. And these parts are in turn decomposable: the engine is itself made from parts, which, due to the engineering process by which the car was designed, are again decomposable.

A tree is not a comprehensible artifact. We have been trying to map out how trees work for centuries, and we have made much progress, but we are not done. The human body is not a comprehensible artifact. Trees and human bodies both contain subsystems, presumably because natural selection is itself driven somewhat towards decomposability by the limited amount of information that can be stored in the genome, but they are not nearly so easy to understand. Of course this is no criticism of the many wonders produced by natural selection, it is just the way things are.

Consider a neural network trained to classify images as containing either dogs or cats. It is actually quite easy to turn this artifact into an abstraction layer: our helpful story is simply that you pass in an image in some appropriate format, and get back a label that tells you, with some probability, whether the image contains a dog or a cat. We can now use the neural network as a function without considering any of its internal details. But this neural network is certainly not a comprehensible artifact. We do not know how to decompose neural networks into crisp subsystems. There is some work in the field of inspectable machine learning that attempts to do this (see survey below), but this work is far from complete, and a full understanding of how image recognition works in neural networks remains elusive to us.

In the definition of comprehensible artifacts we said that only a limited amount of construction be used to bridge the gap between the parts and the whole. Each time we construct some artifact out of parts, there is some construction necessary to "glue" the parts together. In some cases we can factorize the glue itself, but there is always some glue remaining because the decomposition of parts into sub-parts has to bottom out somewhere with raw machinery that does work, or else our artifacts would be nothing more than empty abstraction layers composed endlessly. In order to use an artifact we generally do not need to understand this glue — that is the whole point of an abstraction layer — but in order to understand how an artifact works, we ultimately need to understand this glue. It is therefore critical to keep an upper bound on the amount of construction per abstraction layer, in order that we can recursively decompose our artifacts and verify their stories without ever needing to understand more than a fixed amount of glue.

Connection to factored cognition

Andreas Stuhlmüller and Paul Christiano have proposed the factored cognition hypothesis: that the thinking processes that constitute human intelligence can be broken down into thought episodes of perhaps just a few minutes in length, with limited communication between episodes. If true, this would open the door to scaling human intelligence by scaling these short thinking episodes, such as in their iterated distillation and amplification proposal.

At the surface level, these ideas are quite distinct from those presented here, for we have not taken any stance on the nature of the computational processes in a potential solution to the comprehensible design problem, and we certainly don’t assume that those processes could be factored in any particular way. We have discussed the factoring of artifacts in our model of comprehensible design but this is quite different from the factoring of cognition as Stuhlmüller and Christiano propose (just as the factoring of a car into assemblies of subsystems is distinct from the factoring of a mind that is designing a car).

Yet at a deeper level, it is often said that the structure of software products reflects the structure of the companies that produce them, and more broadly there is a way in which the internal organization of artifacts produced by minds reflects the internal organization of those minds. This may point to a deeper connection between what could be termed a "factored artifact hypothesis" advanced this document, and the factored cognition thesis of Stuhlmüller and Christiano.

Connection to explainability and interpretability in machine learning

The field of interpretability in machine learning is concerned with techniques for making machine learning models understandable to humans. Out of four literature reviews we read on the topic, all four mentioned trust as one of the central motivators for the field. This aligns closely with our concerns. The field has grown quickly over the past few years and we do not have any intention to cover all of it within this short subsection. We refer readers to the recent and much more comprehensive series of posts here on lesswrong by Robert Kirk and Tomáš Gavenčiak (1, 2, 3), as well as to the literature reviews published by Došilović et al[3], Arrieta et al[4], Adadi et al[5], and Gilpin et al[6]. We also found much value in Christopher Molnar’s textbook Interpretability in Machine Learning, which is freely available online.

The field is broadly concerned with tools that allow humans to develop trust in black box machine learning models by giving humans insight into the internal workings of the black-box models. As per Adadi et al[7], approaches can be categorized along the following three axes:

  • Local vs global. Local approaches aim to help humans to understand a single prediction made by a machine learning model, whereas global approaches aim to help humans understand the model itself. For example, a system that can explain a decision to assign a low credit rating to a particular individual is a local approach, whereas a visualization technique that shows that each layer of a convolutional neural network is building up successively larger models of visual parts is a global approach. In this report we are mostly concerned with global approaches since we want to develop sophisticated AI systems that can be trusted in general to take beneficial actions, rather than having a human review each individual action.

  • Intrinsic vs post-hoc. Intrinsic approaches involve modifying the original learning algorithm in some way, whereas post-hoc approaches do work to improve interpretability after learning has already concluded. Imposing a sparsity prior to encourage the generation of simple models is an example of an intrinsic approach, whereas training a shallow neural network to approximate the predictions made by a deeper neural network after that deep network has already been trained is an example of a post-hoc approach. We have argued in this write-up that intrinsic approaches are more attractive for sophisticated AI systems, due to the opportunity to construct models in a way that permits interpretability. We would be excited to discover post-hoc approaches that work for sophisticated and general AI systems, but we expect that route to be much more challenging.

  • Model-agnostic vs model-specific. Model-agnostic approaches can be used with any type of machine learning model (neural networks, decision trees, linear or logistic regressors, kernel machines, etc), whereas model-specific approaches are predicated on some particular type of model. Model-agnostic approaches tend to treat the underlying model as a black box and therefore tend to fall into the post-hoc dimension from the preceding axis. For advanced AI systems, we believe intrinsic approaches are more promising, so on this axis we expect energy to be correspondingly focused on model-specific approaches.

A classic interpretability method is that of Shapely explanations[8], which assigns, for a model that makes a prediction y based on a set of features x1, … xn, a "contribution" to each feature y1, … yn such that the sum of the individual feature contributions is equal to the original prediction. For linear models, computing such contributions is straightforward: we just take the coefficients of the model as the per-feature contributions. But for nonlinear models it is not so clear how best to assign such contributions. The authors show that there is only one way to assign such contributions if one wishes to adhere to certain accuracy and consistency desiderata. This kind of method seems like a useful tool for checking simple predictive models, but is not going to provide deep insights into models that we suspect implement sophisticated algorithms internally.

In the remainder of this section we review three papers that we examined in detail. We chose these three papers based on citations from the review papers above, and based on recommendations from friends.

Rudin, Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead (2019)[9]

Cynthia Rudin has argued persuasively that it is a mistake for us to construct black-box models and then construct further explanation-producing models that are optimized merely to persuade humans, rather than to be true accounts of how the original model actually works. We very much agree! Instead, she argues that we should train models that have a structure that allows us to understand them directly. She identifies the former (post-hoc explanation production) with "explainable machine learning" and the latter (intrinsically comprehensible models) with "interpretable machine learning". This differs terminologically from several review papers we have read, most of which take these terms to refer to approximately the same overall body of research, but we find it helpful none-the-less.

Rudin offers the provocative hypothesis that most black-box models can be replaced by equally or near-equally accurate interpretable models, and that there is in fact no trade-off between interpretability and accuracy. This is a bold and important claim, and she formalizes the case for it. The gist of her argument is that, due to the finite size of the dataset on which any model is trained, there will generally be many models within a small performance margin of the globally optimal model, and that among these one is likely to find an interpretable model. Put another way: no finite dataset contains enough information to pick out an infinitely precise region within model space, so the question becomes whether the set of close-to-optimal models is large enough to contain an interpretable model. A series of set-approximation theorems in A study in Rashomon curves and volumes[10] argues that one is indeed likely to find an interpretable model within this set.

Rudin’s lab at Duke University is all about training intrinsically interpretable models. Of particular interest are rule lists, which are sequences of logical conditions of the following form:

IF       age between 18-20 and sex is male         THEN predict arrest (within 2 years)
ELSE IF  age between 21-23 and 2-3 prior offenses  THEN predict arrest
ELSE IF  more than three priors                    THEN predict arrest
ELSE     predict no arrest.

A second model category of interest is scoring systems, in which an integer score is computed by summing integer "points" associated with binary conditions, and the final model output is determined by a lookup table indexed by scores:

It was news to us that there are algorithms for the global optimization of such models! These algorithms differ substantially from the heuristic methods best-known in data science. In our experience it is not usually considered important to construct such logical models optimally, since one is usually working within some boosting or bagging framework in which many models of the same type are combined to produce a final output and any deficiency in one particular model is compensated for in the ensuing gradient steps. But this leads to a proliferation of models, which in turn lends to an uninterpretable overall model. Rudin’s approach is to instead focus on achieving optimality in the construction of a single small model, paying a price in terms of implementation difficulty, but reping rewards in terms of models that are simultaneously accurate and interpretable.

This highlights the real trade-off in machine learning, which is not between interpretability and accuracy but between interpretability and ease of implementation. On the one hand we can work with hypothesis classes such as neural networks that have optimization algorithms that are easy[11] to understand and implement but produce models that are difficult to interpret. On the other hand we can work with hypothesis classes such as sparse logical models, which are difficult to optimize but easy to interpret. Yet this difficulty of optimization is a one-time cost to be paid in algorithms research and software engineering. Once effective algorithms have been discovered and implemented we can use them as many times as we want. Furthermore, it may take less computing power to find an optimal logical model than to train a neural network, and it will almost certainly require less expertise to use such algorithms since local optimizers are sensitive to all kinds of initialization and stepping details, whereas global optimizers either find the true global optimum or fail to do so in a reasonable amount of time.

It is exciting to consider what it would look like to use optimal simple models in computer vision or reinforcement learning. An enticing array of difficult optimization problems beckon, with the prize for their solution being the ability to construct simultaneously interpretable and accurate models.

Rudin speculates on such advances herself, in particular with prototype networks in computer vision. These are neural networks in which early layers are ordinary convolution feature-extraction layers, and then later layers reason by explicitly finding correspondences between regions in the input image and similar regions in training images. The final prediction is then made on the basis that if there are many correspondences between an input image and training images labelled "bird", then the input image itself probably contains a bird. This makes the model somewhat interpretable because one can visually inspect the correspondences and understand how the prediction was made on that basis. However, the early convolutional layers of the network remain opaque.

Rudin’s work is exciting and insightful, but we believe she misses the following two points.

First, Rudin’s work focuses on building models that are interpretable by being so simple that we can understand how they work just by looking at them. This is like being handed a tool such as a screwdriver or hammer that is so simple that we understand it immediately without needing to refer to any instruction manual. But other tools — say, a 3D printer — may be difficult to understand just by looking at them, yet easy to understand with the help of an instruction manual. We should be willing to produce complex models if they are shaped in such a way that a simple and accurate instruction manual can be written about them, and if we have methods for producing such instruction manuals. Post-hoc explanations are not good enough here; what we are suggesting is to build models and instruction manuals together in such a way that (1) the instruction manual accurately describes how the model really works, and (2) the instruction manual makes it easy for a human to understand the model. Achieving both of these aims simultaneously will require the model itself to be constrained in its complexity since not all models (presumably) permit any such instruction manual to be written about them, but it should impose less of a constraint than requiring our models to be interpretable on-sight without any instruction manual.

Second, Rudin’s work focuses on simplicity as a proxy for interpretability. But algorithmic simplicity is only a weak proxy for how readily a model can be understood by a human. There are concepts that are quite complex when measured by any abstract measure of complexity (such as description length) that humans will nevertheless reason about quite intuitively, such as complex social situations. On the other hand there are concepts from, for example, abstract algebra, that are simple according to abstract measures of complexity, yet require significant training to understand. For this reason we tend to explain sophisticated concepts by analogy to the concepts most intuitive to us. A better measure of a model’s interpretability to a particular person would be the length of the shortest description of that model in terms of concepts this person has already acquired, where the already-acquired concepts are treated as primitives and do not contribute to the length of a description.

Olah et al., Zoom In: An Introduction to Circuits (2020)[12]

One thread of research within interpretability of great interest and relevance to the present work is Chris Olah’s investigation of circuits in convolutional neural networks trained to perform image classification. This thread of work initially gathered attention with Olah’s 2017 article on visualizing individual neurons as well as whole features (one channel within a layer) by optimizing images and image patches to maximally activate these neurons and features[13]. This produced beautiful dream-like images that gave some insight into what the network was "looking" for within different layers and channels.

Olah now proposes to go beyond mere visualization and initiate what he foresees as a "natural science of interpretability" — studying the structure of trained neural networks in the way we might study the inner workings of plants or animals in biology. In Olah’s words:

Most work on interpretability aims to give simple explanations of an entire neural network’s behavior. But what if we instead take an approach inspired by neuroscience or cellular biology — an approach of zooming in? What if we treated individual neurons, even individual weights, as being worthy of serious investigation? What if we were willing to spend thousands of hours tracing through every neuron and its connections? What kind of picture of neural networks would emerge?

In contrast to the typical picture of neural networks as a black box, we’ve been surprised how approachable the network is on this scale. Not only do neurons seem understandable (even ones that initially seemed inscrutable), but the "circuits" of connections between them seem to be meaningful algorithms corresponding to facts about the world. You can watch a circle detector be assembled from curves. You can see a dog head be assembled from eyes, snout, fur and tongue. You can observe how a car is composed from wheels and windows. You can even find circuits implementing simple logic: cases where the network implements AND, OR or XOR over high-level visual features.

While we find this work to decompose trained neural networks very exciting, it is worth noting just how immense this research program is. After years of work it has become possible to begin to identify neural network substructures that implement elementary logical operations. Decomposing entire networks is still a long way off. And even at that point we are still in the realm of feed-forward networks with no memory, trained to solve conceptually straightforward supervised learning tasks in which the input is a single image and the output is a single label. If we build sophisticated artificial intelligence systems by training neural networks, we can expect that they will contain logic many levels deeper than this.

We very much hope this works proceeds quickly and successfully, but it's apparent difficulty does lend credence to the thesis advanced in this document that post-hoc interpretation of artifacts not optimized for comprehensibility is difficult, and that we should find ways to design artifacts on the basis of abstraction layers, not because they would perform better, but because we would be able to understand and therefore trust (or distrust) them.

Wu et al., Beyond Sparsity: Tree Regularization of Deep Models for Interpretability[14]

In this paper, Wu et al. train neural networks on medical diagnosis tasks, in such a way that the neural networks are well-approximated by decision trees. The idea is that humans can interpret the decision trees, thereby gaining some insight into what the neural network is doing. The contribution of the paper is a regularizer that can be used to train neural networks that have this property.

At a high level this idea was exciting to us. We hoped for a demonstration that neural networks can be trained with regularizers that cause them to take on a form that is well-approximated by a "simple story", if one is willing to take the decision tree as a story.

But on closer inspection the paper is disappointing. The regularization does not cause the network to have an internal structure that approximates a decision tree, it merely causes the outputs of the network to be well-approximated by a decision tree. The decision tree therefore gives no real insight into how the network works, which is the kind of understanding we should be demanding before trusting a statistical model. Furthermore, the authors report that on a binary prediction task the neural network "has predictions that agree with its corresponding decision tree proxy in about 85-90%!o(MISSING)f test examples". That means that in 10-15%!o(MISSING)f training examples, and presumably at least 10-15%!o(MISSING)f the time in the real world, the neural network produces the opposite answer from the decision tree. One might wish to understand why these cases differ in this way, but these are precisely the cases where the decision tree offers no help.

Conclusion: AI safety via automated comprehensible design?

As we consider how to navigate towards a safe and beneficial future of artificial intelligence, we face the following dilemma. On one hand we have search which is moving forward quickly but produces untrustworthy artifacts. On the other hand we have design, which under good conditions can produce trustworthy artifacts, but is moving forward very slowly in the domain of artificial intelligence. Search is automated and is therefore accelerating as more and more compute power becomes available to it; design remains stuck at the fixed pace of human cognition.

In the AI safety community there are two basic views on how to resolve this dilemma. On one hand there are those who seek to rescue search: to find ways to harness the power of modern machine learning in a way that produces trustworthy artifacts. Paul Christiano’s work on iterated distillation and amplification seeks to resolve the dilemma by using imitation learning to gradually amplify human capabilities. CHAI’s work on assistance games seeks to resolve the dilemma by relaxing the requirement that an objective be specified before optimization begins. Geoffrey Irving’s work on debate seeks to resolve the dilemma by having powerful AI systems scrutinize one another’s claims in front of a human judge. We applaud these efforts and wish them success.

On the other hand there are those who say that our best bet is to throw as much human thought as possible at the design approach; that despite the slow progress of manual design to-date it is critical that we acquire a foundational theory of intelligent agency and that we should therefore pursue such a theory with whatever resources we have. This is the perspective of researchers at the Machine Intelligence Research Institute (as we understand it). We applaud these efforts, too, and very much hope for research breakthroughs on this front.

But there is a third option: we could automate design, making it competitive with search in terms of its effectiveness at producing powerful artificial intelligence systems, yet retaining its ability to produce comprehensible artifacts in which we can establish trust based on theories and abstraction layers.

We do not currently know how to automate design. We do not really know what design is, although we hope the ideas presented in this essay are helpful. This is therefore a call for research into the nature of comprehensible design and its possible automation.

To automate design, our fundamental task is to build computer systems capable of producing artifacts together with helpful stories about them. To do this, we will need to understand what makes a story helpful to humans (we have proposed simplicity and accuracy, but this surely just scratches the surface), and we will need to discover how computer systems can produce such stories.

We will need to discover how to deeply integrate story production with artifact production. If we try for post-hoc story production — constructing the artifact first and then fitting a story to it after-the-fact — then we will be solving a much more difficult problem than we need to. We should shape our artifacts so as to make story-writing as easy as possible.

Between construction and factorization, it is factorization that seems at present most mysterious. How might we automate the "carving at the joints" of a messy bit of construction? How might we do this such that an elegant abstraction layer is produced?

To do this, we could start with existing search algorithms (i.e. machine learning) and modify them so that they produce stories together with artifacts. Perhaps such an approach would fit within the existing field of interpretability in machine learning.

Or we could start with existing design processes, carefully examining humans engaged in engineering and attempt to automate some or all of their labor.

Or we could start somewhere else entirely, perhaps with some stroke of genius that points the way towards a different and more trustworthy approach. Let us hope that such a stroke of genius is forthcoming soon.

Appendix: An informal practical investigation

I undertook a small investigation into the nature of engineering by working on a personal software project myself, while noting how I was navigating and problem-solving. I set a timer at 10 minute intervals, at which point I would stop and write down what I was working on, how I knew to work on that, and how I was going about solving the problem.

The project I worked on was a small script in the Go language to manage the creation of Google Cloud Platform (henceforth "GCP") projects, as well as to enable and disable APIs. In this section I will refer to the specifics of the various technologies I was using to solve this problem because I find it helpful to be very concrete when performing investigations such as this. However, if you are unfamiliar with these particular technologies then please know that their specifics are not really central to this section.

The investigation took place over the course of about 5 hours total over 3 days. I made no attempt to formally test any particular hypothesis, although I was interested in whether my experience matched the framework presented in this essay.

The script I wrote was intended to allow one to configure a GCP project by writing a single configuration file specifying the project’s name, ID, billing account, and a list of APIs to be enabled on the project. The script would then make the necessary API calls to GCP to create or update the project as necessary. This was something that I’ve wanted to build for a while because every time I set up a new GCP project I find that I’ve forgotten these finicky project creation steps and have to once again read through the documentation and discover how to do it again. I therefore was excited to build a small tool to automate this.

I began by defining the structure of the configuration file and parsing it in YAML format, then I wrote API calls to create the project if it didn’t already exist, then I wrote API calls to enable and disable APIs based on the configuration file, then I wrote API calls to link a billing account to the project, then finally I wrote a helper command to invoke the standard "gcloud" tool with the relevant project automatically selected.

My findings were as follows.

I was astounded by the wealth and depth of the concepts needed to build this simple tool. In the first few minutes of working I wrote some code to parse some simple command-line arguments to the script. Already here I was operating on the basis of powerful stories about how a computer program is invoked from the command line, what the command line is, and how one typically passes in options on the command line. I was using a particular command-line processing library that is based on defining a struct in Go and tagging the various fields with information about how they are to be mapped to strings passed on the command line. I was already familiar with this library so I could work through this part very quickly. I could just "see" the obvious "right way" to solve the sub-problem of passing in command line arguments.

It did not feel as though my cognition consisted of any kind of "search" over a hypothesis space (although of course I didn’t have full access to all the things happening beneath the subconscious level). It felt more like I was rolling out a recipe that I was already familiar with.

Later, as I was constructing the API calls to create projects and enable and disable APIs, there were an even larger number of concepts in play: concepts about what a project is, what a REST API is, how errors are typically returned from REST APIs, how the GCP client libraries typically wrap these APIs, what a context is in Go, to say nothing of the concepts involved in formulating the Go code itself — which involves understanding functions, variables, structures, if statements, for statements, and so on. And these are really only the low-level concepts that one is concerned with during implementation. There are also the high level concepts of declarative infrastructure and software tooling in general that was very much guiding my approach to solving the problem.

When we write such software, we are building on top of a huge mountain of sophisticated materials (languages, remote APIs, cloud services, and so on). The only way we can make sense of these enormously complex materials is via abstraction layers and the concepts and stories they are predicated on. Navigating this landscape without concepts would be utterly impossible.

At one point I encountered an API call that was returning a 403 Forbidden error code. I initially believed that this was due to an authentication problem, since there is a general concept in REST APIs about which error codes should be used to indicate which kinds of problems, and I expected that the GCP APIs would follow this standard concept. I spent some time trying to debug this by changing the way I was doing authentication, but then I later realized that this error code is returned when one attempts to get information about a project that doesn’t exist yet. This is an interesting example of what happens when one uses a concept that does not match the reality of the materials being described. It’s also interesting that I was the one that created this pairing between material (the API call itself) and concept (the standard conventions for HTTP error codes) in my head. It was not that I read the GCP documentation and it turned out to be incorrect. In fact I did not read the GCP documentation on this particular matter, I simply assumed that the API would conform to standard conventions. I therefore created a kind of abstraction layer of my own over the top of the lower-level abstraction layer provided by GCP. This abstraction layer that I created did not consist of me writing any code but just of me taking a concept I was familiar with (conventions for HTTP error codes) and pairing it with the artifact of these particular GCP APIs.

The way I both discovered and resolved this mismatch between concept and artifact was through experimentation. After I had written the code to call this API, I put in place an "end cap" that allowed me to run my script even though it was far from finished. I did this by writing code to print the data returned from the API call and then exiting. In this way I could look directly at the materials I was working with by inspecting the actual data returned from the server, and see any ways in which the raw materials differed from the concepts I was using to formulate my expectations. I discovered the problem this way, and then later resolved the problem this way also, by trying various IDs of projects that did and did not exist, and noticing that the 403 Forbidden error code was returned for non-existent projects. This is a good example of the way that construction yields "evidence" about the nature of the materials we are working with. While engaging in design it is critical to establish a channel for regularly receiving this evidence.

Most of my work on this project consisted of construction. The APIs I was working with came with client libraries that made the code to perform the API call fairly succinct, and I found no reason to add further abstraction layers. Furthermore I perceived some risk of running into various show-stoppers that would force me to abandon the project entirely (such as there not being any published API to do the things I needed to do in order to have the script operate the way I wanted it to), so I was eager to see the whole thing work end-to-end before engaging in a lot of factorization.

There were places where I did engage in factorization though. One was in writing a poll loop to check on the long-running operation of creating a project. This involved a loop that would repeatedly call the API that checks on the status of a long-running operation on GCP, and return either success or an error message in the case of failure. This piece needed logic so that it would give up after a timeout expired to prevent the possibility of the script executing in a loop indefinitely. The code to perform this poll loop was sufficiently complicated, and there was a sufficiently elegant abstraction layer that could be wrapped around it, that I quickly factored this code out into its own function.

Overall, I was struck by how little this process resembled any kind of optimization process, at any level. Perhaps there was a sophisticated and general optimization process happening at a subconscious level in my brain, but it didn’t appear that way. It appeared that I was rolling out a set of known recipes; that I basically knew what to do at most points and my time was occupied with looking up documentation and trying API calls to discover the specifics of how to do each piece.

Perhaps this was because I was working on a from-scratch project starting with a blank slate, for which I already had a clear goal. Perhaps it was the planning stage, which happened prior to the time during which I was recording my activities at 10 minute intervals, where optimization took place. I did not actually do any formal planning but I had encountered the need for this script several times over a time period of year or so, and each time this happened I further crystalized a rough plan for the script I was implementing here.

Or perhaps it was because the project involved lots of API calls with very little algorithmic complexity. It would be interesting to repeat the experiment while solving some algorithmic programming puzzles.

I was also struck by my ability to work towards a goal that was only vaguely specified. Although I did, as mentioned above, have a sense of what the finished product should look like, this sense was very rough. During construction phases I regularly felt refinements to this rough overall picture snap into place, as I grappled directly with the materials at hand. For example, I decided on the exact structure of the configuration file while writing the code to parse the configuration file, and I made decisions about what would happen if no billing account was specified while writing the API calls to link billing accounts. In this way it felt like my overall plan was itself a kind of high-level concept, and I incrementally refined it to lower-level conceptual clarity as I built up each piece of the puzzle.

In this way, the construction phases brought in evidence not just about the nature of the problem (the nature of the materials I was working with, etc), but also about the nature of the goal. It seemed that I was collecting evidence about the objective as I proceeded. This is very different from a pure search approach in which the objective is specified upfront.


  1. I do not mean to make any claim about the relative complexity of trees versus modern operating systems. I suspect trees are profoundly more sophisticated. ↩︎

  2. We might say that when software engineering becomes search rather than design, it becomes dull and painful. ↩︎

  3. Došilović, F.K., Brčić, M. and Hlupić, N., 2018, May. Explainable artificial intelligence: A survey. In 2018 41st International convention on information and communication technology, electronics and microelectronics (MIPRO) (pp. 0210-0215). IEEE. ↩︎

  4. Arrieta, A.B., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., García, S., Gil-López, S., Molina, D., Benjamins, R. and Chatila, R., 2020. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion, 58, pp.82-115. ↩︎

  5. Adadi, A. and Berrada, M., 2018. Peeking inside the black-box: A survey on Explainable Artificial Intelligence (XAI). IEEE Access, 6, pp.52138-52160. ↩︎

  6. Gilpin, L.H., Bau, D., Yuan, B.Z., Bajwa, A., Specter, M. and Kagal, L., 2018, October. Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA) (pp. 80-89). IEEE. ↩︎

  7. Adadi, A. and Berrada, M., 2018. Peeking inside the black-box: A survey on Explainable Artificial Intelligence (XAI). IEEE Access, 6, pp.52138-52160. ↩︎

  8. Lundberg, S.M. and Lee, S.I., 2017. A unified approach to interpreting model predictions. In Advances in neural information processing systems (pp. 4765-4774). ↩︎

  9. Rudin, C., 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), pp.206-215. ↩︎

  10. Semenova, L. and Rudin, C., 2019. A study in Rashomon curves and volumes: A new perspective on generalization and model simplicity in machine learning. arXiv preprint arXiv:1908.01755. ↩︎

  11. In the sense that it is conceptually easy to understand and implement gradient descent, not that it is computationally easy, or easy to find a global optimum ↩︎

  12. Olah, et al., "Zoom In: An Introduction to Circuits", Distill, 2020. ↩︎

  13. Olah, et al., "Feature Visualization", Distill, 2017. ↩︎

  14. Wu, M., Hughes, M.C., Parbhoo, S., Zazzi, M., Roth, V. and Doshi-Velez, F., 2018, April. Beyond sparsity: Tree regularization of deep models for interpretability. In Thirty-Second AAAI Conference on Artificial Intelligence. ↩︎

New Comment
30 comments, sorted by Click to highlight new comments since:

I liked this post.

I'm not sure that design will end up being as simple as this picture makes it look, no matter how well we understand it---it seems like factorization is one kind of activity in design, but it feels like overall "design" is being used as a kind of catch-all that is probably very complicated.

An important distinction for me is: does the artifact work because of the story (as in "design"), or does the artifact work because of the evaluation (as in search)?

This isn't so clean, since:

  • Most artifacts work for a combination of the two reasons---I design a thing then test it and need a few iterations---there is some quantitative story where both factors almost always play a role for practical artifacts.
  • There seem to many other reasons things work (e.g. "it's similar to other things that worked" seems to play a super important role in both design and search).
  • A story seems like it's the same kind of thing as an artifact, and we could also talk about where *it* comes from. A story that plays a role in a design itself comes from some combination of search and design.
  • During design it seems likely that humans rely very extensively on searching against mental models, which may not be introspectively available to us as a search but seems like it has similar properties.

Despite those and more complexities, it feels to me like if there is a clean abstraction it's somewhere in that general space, about the different reasons why a thing can work.

Post-hoc stories are clearly *not* the "reason why things work" (at least at this level of explanation). But also if you do jointly search for a model+helpful story about it, the story still isn't the reason why the model works, and from a safety perspective it might be similarly bad.

Hey thank you for your thoughts on this post, friend

overall "design" is being used as a kind of catch-all that is probably very complicated

Yes it may be that "automating design" is really just a rephrasing of the whole AI problem. But I'm hopeful that it's not. Keep in mind that we only have to be competitive with machine learning, which means we only have to be able to automate the design of artifacts that can also be produced by black box search. This seems to me to be a lower bar than automating all human capacity for design, or automating design in general.

In fact you might think the machine learning problem statement of "search a hypothesis space for a policy that performs well empirically" is itself a restatement of the whole AI problem, and perhaps a full solution to this problem would in fact be a general solution to AI. But in practice we've been able to make incremental progress in machine learning without needing to solve any AI-complete parts of the problem space (yet).

does the artifact work because of the story (as in "design"), or does the artifact work because of the evaluation (as in search)?

Interesting. I wasn't thinking of the story as playing any causal role in making the artifact work. (Though it's very important that the story convey how the artifact actually works, rather than being optimized for merely being convincing.)

Can you say more about what it would mean for an artifact to work because of the story?

This isn't so clean, since [...] Most artifacts work for a combination of the two reasons---I design a thing then test it and need a few iterations

Yup very much agreed. But the trial-and-error part of design seems to me very different from the evaluate-and-gradient-step part of search. When I write some code and test it and it doesn't work, I almost always get some insight into some general failure mode that I failed to anticipate before, and I update both my story and my code. This might be realizing that some API sometimes return null instead of a string for a certain field, or noticing that negative numbers on the command line look like flags to the argument parser. Perhaps we could view these as gradient steps in story space. Or perhaps that's too much of a stretch.

There seem to many other reasons things work (e.g. "it's similar to other things that worked" seems to play a super important role in both design and search).

Yes when we look at the evolution of, say, a particular product made by a particular company over a timespan of years, it seems that there is a local search happening. Same thing if we look at, say, manufacturing processes for semiconductors. I'm keen to spend more time thinking about this.

A story seems like it's the same kind of thing as an artifact, and we could also talk about where it comes from. A story that plays a role in a design itself comes from some combination of search and design.

Yeah this seems important. If we view a story as a particular kind of artifact then it would be good to get clear on what exactly identifies an artifact as a story. There is a section in the Genesis manifesto (audacious GOFAI manifesto from some cog sci folks at MIT) that talks about internal and external stories, where an internal story is a story represented in mental wetware, and an external story is any object that spontaneously gives rise to an internal story upon examination.

During design it seems likely that humans rely very extensively on searching against mental models, which may not be introspectively available to us as a search but seems like it has similar properties.

In my very limited experiment I was continuously surprised about how non-search-like the practical experience of design was to me. But yes there is a lot happening beneath conscious awareness so I shouldn't update too heavily on this.

if you do jointly search for a model+helpful story about it, the story still isn't the reason why the model works, and from a safety perspective it might be similarly bad

Well when I buy an air conditioner that comes with an instruction manual that includes some details on how the air conditioner is constructed, it's not literally the case that the air conditioner works because of the physical instruction manual. The physical instruction manual is superfluous to the correct functioning of the physical air conditioner. And it's also quite possible that the physical instruction manual was constructed after the physical air conditioner was already constructed, and this doesn't rule out the helpfulness of the instruction manual.

What's important is that the instruction manual conveys a true account of how the air conditioner really works.

Now if we really nailed down what it means for a story to give a true account of how an artifact really works then perhaps we could search over (artifact, story) pairs. But this seems like deeply inaccessible information to me, so I agree this kind of solution can't work. If I'm a search algorithm searching over (artifact, story) pairs then I have no understanding whatsoever about how the artifacts I'm constructing work. I'm just searching. On what basis could I possibly discern whether a certain story faithfully captures how it is that a certain artifact really works?

What we actually need is a process for constructing artifacts that builds them up piece by piece, so that a story can be constructed piece by piece in parallel. Or something else entirely. But it just doesn't seem that search is enough here. I might try to formalize this argument at some point.

I thought this was a great post. One thing which I think this post misses, however, is the extent to which “post-hoc” approaches can be turned into “intrinsic” approaches by using the post-hoc approach as a training objective—in the language of my transparency trichotomy, that lets you turn inspection transparency into training transparency. Relaxed adversarial training is an example of this in which you directly train the model on an overseer's evaluation of some transparency condition given access to inspection transparency tools.

"Search versus design" explores the basic way we build and trust systems in the world. A few notes: 

  • My favorite part is the definitions about an abstraction layer being an artifact combined with a helpful story about it. It helps me see the world as a series of abstraction layers. We're not actually close to true reality, we are very much living within abstraction layers — the simple stories we are able to tell about the artefacts we build. A world built by AIs will be far less comprehensible than the world we live in today. (Much more like biology is, except made by something that is much smarter and faster than us instead of stupider and slower.)
  • The post puts in the time to bring into the conversation a lot of other work that attempts to help build simple stories about the AI artefacts that we are building, which I appreciate.
  • The post is pretty simply written, for me, and I understand all the examples and arguments.
  • It also attempts to (briefly) describe a novel direction of future work for solving the problem of building untrustworthy systems with selection, and that's exciting.

For looking at the alignment problem clearly and with a subtly different frame than other discussions, one that resonates for me, and that points to new frames for a solution, I am voting this post +9.

Now let’s say that I succeed in training a neural network to sort integers, and I test it on many test cases and it works flawlessly. Am I ready to deploy this in a safety-critical scenario where a single incorrect output will lead to the death of many living beings?

Obviously a regular sorting algorithm would be better, but if the choice were between the neural net and a human, and you knew there wasn't going to be any distributional shift, I would pick the neural net.

(I'm assuming "many test cases" = several million test cases, drawn from the test distribution.)

Better than any of these solutions is to not have a system where a single incorrect output is catastrophic.

Obviously a regular sorting algorithm would be better, but if the choice were between the neural net and a human, and you knew there wasn't going to be any distributional shift, I would pick the neural net.

Well, sure, but this is a pretty low bar, no? Humans are terrible at repetitive tasks like sorting numbers.

Better than any of these solutions is to not have a system where a single incorrect output is catastrophic.

Yes very much agreed. It is actually incredibly challenging to build systems that are robust to any particular algorithm failing, especially at the granularity of a sorting algorithm. Can I trust the function that appends items to arrays to always work? Can I trust that the command line arguments I receive are accurate to what the user typed? Can I trust the max function? Can I trust that arithmetic is correctly implemented? Do you know of any work that attempts to understand/achieve robustness at this level? I'd be fascinated to read more about this.

Well, sure, but this is a pretty low bar, no? Humans are terrible at repetitive tasks like sorting numbers.

It may be a low bar, but it seems like the right bar if you're thinking on the margin? It's what we use for nuclear reactors, biosecurity, factory safety, etc.

(See also Engineering a Safer World.)

I think my real complaint here is that your story is getting its emotional oomph from an artificial constraint (every output must be 100% correct or many beings die) that doesn't usually hold, not even for AI alignment. If you told the exact same story but replaced the neural net with a human, the correct response would be "why on earth does your system rely on a human to perfectly sort; go design a new system". I think we should react basically the same way when you tell this story with neural nets.

The broader heuristic I'm using: you should not be relying on stories that seem ridiculous if you replaced AIs with humans, unless you specifically identify a relevant difference between AIs and humans that matters for that story.

(Plausibly you could believe that even if we never built AI systems, humans would still cause an existential catastrophe, and so we need to hold AI systems to a higher standard than humans. If so, it would be good to make this assumption clear, as to my knowledge it isn't standard.)

I think my real complaint here is that your story is getting its emotional oomph from an artificial constraint (every output must be 100% correct or many beings die) that doesn't usually hold, not even for AI alignment

Well OK I agree that "every output must be 100% correct or many beings die" is unrealistic. My apologies for a bad choice of toy problem that suggested that I thought such a stringent requirement was realistic.

But would you agree that there are some invariants that we want advanced AI systems to have, and that we really want to be very confident that our AI systems to satisfy these invariants before we deploy them, and that these invariants really must hold at every time step?

To take an example from ARCHES, perhaps it should be the case that, for every action output at every time step, the action does not cause the Earth's atmospheric temperature to move outside some survivable interval. Or perhaps you say that this invariant is not a good safety invariant -- ok, but surely you agree that there is some correct formulation of some safety invariants that we really want to hold in an absolute way at every time step? Perhaps we can never guarantee that all actions will have acceptable consequences because we can never completely rule out some confluence of unlucky conditions, so then perhaps we formulate some intent alignment invariant that is an invariant on the internal mechanism by which actions are generated. Or perhaps intent alignment is misguided and we get our invariants from some other theory of AI safety. But there are going to be invariants that we want our systems to satisfy in an absolute way, no?

And if we want to check whether our system satisfies some invariant in an absolute way then I claim that we need to be able to look inside the system and see how it works, and convince ourselves based on an understanding of how the thing is assembled that, yes, this python code really will sort integers correctly in all cases; that, yes, this system really is structured such that this intent alignment invariant will always hold; that yes, this learning algorithm is going to produce acceptable outputs in all cases for an appropriate definition of acceptability.

When we build sophisticated systems and we want them to satisfy sophisticated invariants, it's very hard to use end-to-end testing alone. And we are forced to use end-to-end testing alone whenever we are dealing with systems that we do not understand the internals of. Search produces systems that are very difficult to understand the internals of. Therefore we need something beyond search. This is the claim that my integer sorting example was trying to be an intuition pump for. (This discussion is helping to clarify my thinking on this a lot.)

I agree that there are some invariants that we really would like to hold, but I don't think it should necessarily be thought of in the same way as in the sorting example.

Like, it really would be nice to have a 100% guarantee on intent alignment. But it's not obvious to me that you should think of it as "this neural network output has to satisfy a really specific and tight constraint for every decision it ever makes". It's not like for every possible low-level action a neural net is going to take, it's going to completely rethink its motivations / goals and forward-chain all the way to what action it should take. The risk seems quite a bit more nebulous: maybe the specific motivation the agent has changes in some particular weird scenario, or would predictably drift away from what humans want as the world becomes very different from the training setup.

(All of these apply to humans too! If I had a human assistant who was intent aligned with me, I might worry that if they were deprived of food for a long time, they might stop being intent aligned with me; or if I got myself uploaded, then they may see the uploaded-me as a different person and so no longer be intent aligned with me. Nonetheless, I'd be pretty stoked to have an intent aligned human assistant.)

There is a relevant difference between humans and AI systems here, which is that we expect that we'll be ceding more and more decision-making influence to AI systems over time, and so errors in AI systems are more consequential than errors in humans. I do think this raises the bar for what properties we want out of AI systems, but I don't think it gets to the point of "every single output must be correct", at least depending on what you mean by "correct".

Re: the ARCHES point: I feel like an AI system would only drastically modify the temperature "intentionally". Like, I don't worry about humans "unintentionally" jumping into a volcano. The AI system could still do such a thing, even if intent aligned (e.g. if it's user was fighting a war and that was a good move, or if the user wanted to cause human extinction). My impression is that this is the sort of scenario ARCHES is worried about: if we don't solve the problem of humans competing with each other, then humans will fight with more and more impactful AI-enabled "weapons", and eventually this will cause an existential catastrophe. This isn't the sort of thing you can solve by designing an AI system that doesn't produce "weapons", unless you get widespread international coordination to ensure that no one designs an AI system that can produce "weapons".

(Weapons in quotes because I want it to also include things like effective propaganda.)

Planned summary for the Alignment Newsletter:

Deep learning can be thought of as an instance of _search_, in which we design an artifact (machine) simply by looking for an artifact that scores well on some evaluation metric. This is unlike typical engineering, which we might call _design_, in which we build the artifact in such a way that we can also understand it. This is the process that underlies the vast majority of artifacts in the world. This post seeks to understand design better, such that we could design powerful AI systems rather than having to find them using search.
The post argues that design functions by constructing an artifact along with a _story_ for why the artifact works, that abstracts away irrelevant details. For example, when working with a database, we talk of adding a “row” to a “table”: the abstraction of rows and tables forms a story that allows us to easily understand and use the database.
A typical design process for complex artifacts iterates between _construction_ of the artifact and _factorization_ which creates a story for the artifact. The goal is to end up with a useful artifact along with a simple and accurate story for it. A story is simple if it can be easily understood by humans, and accurate if humans using the story to reason about the artifact do not get surprised or harmed by the artifact.
You might think that we can get this for search-based artifacts using interpretability. However, most interpretability methods are either producing the story after the artifact is constructed (meaning that the construction does not optimize for simple and accurate stories), or are producing artifacts simple enough that they do not need a story. This is insufficient for powerful, complex artifacts.
As a result, we would like to use design for our artifacts rather than search. One alternative approach is to have humans design intelligent systems (the approach taken by MIRI). The post suggests another: automating the process of design, so that we automate both construction and factorization, rather than just construction (as done in search).

Planned opinion:

I liked the more detailed description of what is meant by “design”, and the broad story given for design seems roughly right, though obscuring details. I somewhat felt like the proposed solution of automating design seems pretty similar to existing proposals for human-in-the-loop AI systems: typically in such systems we are using the human to provide information about what we want and to verify that things are going as we expect, and it seems like a pretty natural way that this would happen would be via the AI system producing a story that the human can verify.

I think this is a very good summary

Thanks :)

This write-up benefited from feedback from ...Ben Pence.

Did I give you feedback on this writeup? Or do I have a dark arch-nemesis out there that someday I will need to fight?

Ah this is a different Ben.

Then I will prepare for combat.

And thus the wheel of the Dharma was set in motion once again, for one more great turning of time

If there was a vote for the best comment thread of 2020, that would probably be it for me.

Honestly, Pace and Pence should team up to make a super team. Nomitive similarity ought to be a Schelling feature for coordination.

He can be vice president.

[-][anonymous]30

I mostly focused on the interpretability section as that's what I'm most familiar with, and I think your criticisms are very valid. I also wrote up some thoughts recently on where post-hoc interpretability fails, and Daniel Filan has some good responses in the comments below.

Also, re: disappointment on tree regularization, something that does seem more promising is Daniel Filan and others at CHAI working on investigating modularity in neural nets. You can probably ask him more, but last time we chatted, he also had some thoughts (unpublished) on how to enforce modularization as a regularizer, which seems to be what you wished the tree reg paper would have done.

Overall, this is great stuff, and I'll need to spend more time thinking about the design vs search distinction (which makes sense to me at first glance)/

Nice write-up. The note about adversarial examples for LIME and SHAP was not something I've come across before - very cool.

Thanks for the pointer to Daniel Filan's work - that is indeed relevant and I hadn't read the paper before now.

Nice post, very much the type of work I'd like to see more of. :) A few small comments:

Why should a search process factorize its constructions? It has no need for factorization because it does not operate on the basis of abstraction layers.

I think this is incorrect - for example, "biological systems are highly modular, at multiple different scales". And I expect deep learning to construct minds which are also fairly modular. That also allows search to be more useful, because it can make changes which are comparatively isolated.

This thread of work initially gained notoriety with Olah’s 2017 article

I'm not sure I'd describe this work as "notorious", even if some have reservations about it.

But there is a third option: we could automate design, making it competitive with search in terms of its effectiveness at producing powerful artificial intelligence systems, yet retaining its ability to produce comprehensible artifacts in which we can establish trust based on theories and abstraction layers.

In light of my claim that search can also produce modularity and abstraction, I suspect that this might look quite similar to what you describe as rescuing search - because search will still be doing the "construction" part of design, and then we just need a way to use the AIs we've constructed to analyse those constructions. So then I guess the key distinction is, as Paul identifies, whether the artifact works *because* of the story or not.

Nice post, very much the type of work I'd like to see more of.

Thank you!

I'm not sure I'd describe this work as "notorious", even if some have reservations about it.

Oops, terrible word choice on my part. I edited the article to say "gained attention" rather than "gained notoriety".

I think this is incorrect - for example, "biological systems are highly modular, at multiple different scales". And I expect deep learning to construct minds which are also fairly modular. That also allows search to be more useful, because it can make changes which are comparatively isolated.

Yes I agree with this, but modularity is only a part of what is needed for comprehensibility. Chris Olah's work on circuits in convnets suggests that convnets trained on image recognition tasks are somewhat modular, but it's still very very difficult to tease them apart and understand them. Biological trees are modular in many ways, but we're still working on understanding how trees work after many centuries of investigation.

You might say that comprehensibility = modularity + stories. You need artifacts that decompose into subsystems, and you need stories about that decomposition and what the pieces do so that you're not left figuring it out from scratch.

I am not sure that designed artefacts are automatically easily interpretable.

If an engineer is looking over the design of the latest smartphone, then the artefact is similar to previous artefacts they have experience with. This will include a lot of design details about chip architecture and instruction set. The engineer will also have the advantage of human written spec sheets.

If we sent a pile of smartphones to Issac Newton, he wouldn't have either of these advantages. He wouldn't be able to figure out much about how they worked.

There are 3 factors here, existence of written documentation, similarity to previous designs and being composed of separate subsystems. All help understandability. If an AI is designing radically new tech, we loose the similarity to understood designs.

I am not sure that designed artefacts are automatically easily interpretable.

It is certainly not the case that designed artifacts are easily interpretable. An unwieldy and poorly documented codebase is a design artifact that is not easily interpretable.

Design at its best can produce interpretable artifacts, whereas the same is not true for machine learning.

The interpretability of artifacts is not a feature of the artifact itself but of the pair (artifact, story), or you might say (artifact, documentation). We design artifacts in such a way that it is possible to write documentation such that the (artifact, documentation) pair facilitates interpretation by humans.

If we sent a pile of smartphones to Issac Newton, he wouldn't have either of these advantages. He wouldn't be able to figure out much about how they worked.

Hmm well if it's possible for anyone at all in the modern world to understand how a smartphone works by, say, age 30, then that means it takes no more than 30 years of training to learn from scratch everything you need to understand how a smartphone works. Presumably Newton would be quite capable of learning that information in less than 30 years. Now here I'm assuming that we send Newton the pair (artifact, documentation), where "documentation" is whatever corpus of human knowledge is needed as background material to understand a smartphone. This may include substantially more than just that which we would normally think of as "documentation for a smartphone". But it is possible to digest all this information because humans are born today without any greater pre-existing understanding of smartphones than Newton had, and yet some of them do go on to understand smartphones.

There are 3 factors here, existence of written documentation, similarity to previous designs and being composed of separate subsystems.

Yeah good point re similarity to previous designs. I'll think more on that.

There's also the "Plato's man" type of problem where undesirable things fall under the advanced definition of interpretability. For example, ordinary neural nets are "interpretable," because they are merely made out of interpretable components (simple matrices with a non-linearity) glued together.

In the sorting problem, suppose you applied your advanced interpretability techniques, and got a design with documentation.

You also apply a different technique, and get code with formal proof that it sorts.

In the latter case, you can be sure that the code works, even if you can't understand it.

The algorithm+formal proof approach works whenever you have a formal success criteria.

It is less clear how well the design approach works on a problem where you can't write formal success criteria so easily.

Here is a task that neural nets have been made to do, convert pictures of horses into similar pictures of zebras. https://youtu.be/D4C1dB9UheQ?t=72. I am unsure if a designed solution to this problem exists.

Imagine that you give a bunch of smart programmers a lecture on how to solve this problem, and then they have to implement a solution without access to any source of horse or zebra pictures. I suspect they would fail. I would suspect that solving this problem well fundamentally requires a significant amount of information about horses and zebras. I suspect that the amount of info required is more than a human can understand and conceptualize at once. The human will be able to understand each small part of the system, but logic gates are understandable, so that must hold for any system. The human can understand why it works in the abstract, the way we understand gradient decent over neural nets.

I am not sure that this problem has a core insight that is possessable, but not possessed by us.

Design is also closely related to the other-izer problem because if you think of "designing" strategies or actions, this can have different Goodhart's law implications than searching for them - if you break down the problem according to "common sense" rather than according to best achieving the objective, at least.

One key to this whole thing seems to be that "helpfulness" is not something that we can write an objective for. But I think the reason that we can't write an objective for it is better captured by inaccessible information than by Goodhart's law.

By "other-izer problem", do you mean the satisficer and related ideas? I'd be interested in pointers to more "other-izers" in this cluster.

But isn't it the case that these approaches are still doing something akin to search in the sense that they look for any element of a hypothesis space meeting some conditions (perhaps not a local optima, but still some condition)? If so then I think these ideas are quite different from what humans do when we design things. I don't think we're primarily evaluating whole elements of some hypothesis space looking for one that meets certain conditions, but are instead building things up piece-by-piece.

Well, any process that picks actions ends up equivalent to some criterion, even if only "actions likely to be picked by this process." The deal with agents and agent-like things is that they pick actions based on their modeled consequences. Basically anything that picks actions in different way (or, more technically, a way that's complicated to explain in terms of planning) is an other-izer to some degree. Though maybe this is drift from the original usage, which wanted nice properties like reflective stability etc.

The example of the day is language models. GPT doesn't pick its next sentence by modeling the world and predicting the consequences. Bam, other-izer. Neither design nor search.

Anyhow, back on topic, I agree that "helpfulness to humans" is a very complicated thing. But maybe there's some simpler notion of "helpful to the AI" that results in design-like other-izing that loses some of the helpfulness-to-humans properties, but retains some of the things that make design seem safer than search even if you never looked at the "stories."