Chris Olah’s views on AGI safety

Note: I am not Chris Olah. This post was the result of lots of back-and-forth with Chris, but everything here is my interpretation of what Chris believes, not necessarily what he actually believes. Chris also wanted me to emphasize that his thinking is informed by all of his colleagues on the OpenAI Clarity team and at other organizations.

In thinking about AGI safety—and really any complex topic on which many smart people disagree—I’ve often found it very useful to build a collection of different viewpoints from people that I respect that I feel like I understand well enough to be able to think from their perspective. For example, I will often try to compare what an idea feels like when I put on my Paul Christiano hat to what it feels like when I put on my Scott Garrabrant hat. Recently, I feel like I’ve gained a new hat that I’ve found extremely valuable that I also don’t think many other people in this community have, which is my Chris Olah hat. The goal of this post is to try to give that hat to more people.

If you’re not familiar with him, Chris Olah leads the Clarity team at OpenAI and formerly used to work at Google Brain. Chris has been a part of many of the most exciting ML interpretability results in the last five years, including Activation Atlases, Building Blocks of Interpretability, Feature Visualization, and DeepDream. Chris was also a coauthor of “Concrete Problems in AI Safety.”

He also thinks a lot about technical AGI safety and has a lot of thoughts on how ML interpretability work can play into that—thoughts which, unfortunately, haven’t really been recorded previously. So: here’s my take on Chris’s AGI safety worldview.

The benefits of transparency and interpretability

Since Chris primarily works on ML transparency and interpretability, the obvious first question to ask is how he imagines that sort of research aiding with AGI safety. When I was talking with him, Chris listed four distinct ways in which he thought transparency and interpretability could help, which I’ll go over in his order of importance.

Catching problems with auditing

First, Chris says, interpretability gives you a mulligan. Before you deploy your AI, you can throw all of your interpretability tools at it to check and see what it actually learned and make sure it learned the right thing. If it didn’t—if you find that it’s learned some sort of potentially dangerous proxy, for example—then you can throw your AI out and try again. As long as you’re in a domain where your AI isn’t actively trying to deceive your interpretability tools (via deceptive alignment, perhaps), this sort of a mulligan could help quite a lot in resolving more standard robustness problems (proxy alignment, for example). That being said, that doesn’t necessarily mean waiting until you’re on the verge of deployment to look for flaws. Ideally you’d be able to discover problems early on via an ongoing auditing process as you build more and more capable systems.

One of the OpenAI Clarity team’s major research thrusts right now is developing the ability to more rigorously and systematically audit neural networks. The idea is that interpretability techniques shouldn’t have to “get lucky” to stumble across a problem, but should instead reliably catch any problematic behavior. In particular, one way in which they’ve been evaluating progress on this is the “auditing game.” In the auditing game, one researcher takes a neural network and makes some modification to it—maybe images containing both dogs and cats are now classified as rifles, for example—and another researcher, given only the modified network, has to diagnose the problem and figure out exactly what modification was made to the network using only interpretability tools without looking at error cases. Chris’s hope is that if we can reliably catch problems in an adversarial context like the auditing game, it’ll translate into more reliably being able to catch alignment issues in the future.

Deliberate design

Second, Chris argues, advances in transparency and interpretability could allow us to significantly change the way we design ML systems. Instead of a sort of trial-and-error process where we just throw lots of different techniques at the various benchmarks and see what sticks, if we had significantly better transparency tools we might be able to design our systems deliberately by understanding why our models work and how to improve them. In this world, because we would be building systems with an understanding of why they work, we might be able to get a much better understanding of their failure cases as well and how to avoid them.

In addition to these direct benefits, Chris expects some large but harder-to-see benefits from such a shift as well. Right now, not knowing anything about how your model works internally is completely normal. If even partly understanding one’s model became normal, however, then the amount we don’t know might become glaring and concerning. Chris provides the following analogy to illustrate this: if the only way you’ve seen a bridge be built before is through unprincipled piling of wood, you might not realize what there is to worry about in building bigger bridges. On the other hand, once you’ve seen an example of carefully analyzing the structural properties of bridges, the absence of such an analysis would stand out.

Giving feedback on process

Third, access to good transparency and interpretability tools lets you give feedback to a model—in the form of a loss penalty, reward function, etc.—not just on its output, but also on the process it used to get to that output. Chris and his coauthors lay this argument out in “Building Blocks of Interpretability:”

One very promising approach to training models for these subtle objectives is learning from human feedback. However, even with human feedback, it may still be hard to train models to behave the way we want if the problematic aspect of the model doesn’t surface strongly in the training regime where humans are giving feedback. Human feedback on the model’s decision-making process, facilitated by interpretability interfaces, could be a powerful solution to these problems. It might allow us to train models not just to make the right decisions, but to make them for the right reasons. (There is however a danger here: we are optimizing our model to look the way we want in our interface — if we aren’t careful, this may lead to the model fooling us!)

The basic idea here is that rather than just using interpretability as a mulligan at the end, you could also use it as part of your objective during training, incentivizing the model to be as transparent as possible. Chris notes that this sort of thing is quite similar to the way in which we actually judge human students by asking them to show their work. Of course, this has risks—it could increase the probability that your model only looks transparent but isn’t actually—but it also has the huge benefit of helping your training process steer clear of bad uninterpretable models. In particular, I see this as potentially being a big boon for informed oversight, as it allows you to incorporate into your objective an incentive to be more transparent to an amplified overseer.

One way in particular that the Clarity team’s work could be relevant here is a research direction they’re working on called model diffing. The idea of model diffing is to have a way of systematically comparing different models and determining what’s different from the point of view of high-level concepts and abstractions. In the context of informed oversight—or specifically relaxed adversarial training—you could use model diffing to track exactly how your model is evolving over the course of training in a way which is inspectable to the overseer.[1]

Building microscopes not agents

One point that Chris likes to talk about is that—despite talking a lot about how we want to avoid race-to-the-bottom dynamics—the AI safety community seems to have just accepted that we have to build agents, despite the dangers of agentic AIs.[2] Of course, there’s a reason for this: agents seem to be more competitive. Chris cites Gwern’s “Why Tool AIs Want to Be Agent AIs” here, and notes that he mostly agrees with it—it does seem like agents will be more competitive, at least by default.

But that still doesn’t mean we have to build agents—there’s no universal law compelling us to do so. Rather, agents only seem to be on the default path because a lot of the people who currently think about AGI see them as the shortest path.[3] But potentially, if transparency tools could be made significantly better, or if a major realignment of the ML community could be achieved—which Chris thinks might be possible, as I’ll talk about later—then there might be another path.

Specifically, rather than using machine learning to build agents which directly take actions in the world, we could use ML as a microscope—a way of learning about the world without directly taking actions in it. That is, rather than training an RL agent, you could train a predictive model on a bunch of data and use interpretability tools to inspect it and figure out what it learned, then use those insights to inform—either with a human in the loop or in some automated way—whatever actions you actually want to take in the world.

Chris calls this alternative vision of what an advanced AI system might look like a microscope AI since the AI is being used sort of like a microscope to learn about and build models of the world. In contrast with something like a tool or oracle AI that is designed to output useful information, the utility of a microscope AI wouldn’t come from its output but rather our ability to look inside of it and access all of the implicit knowledge it learned. Chris likes to explain this distinction by contrasting Google Translate—the oracle/tool AI in this analogy—to an interface that could give you access to all the linguistic knowledge implicitly present in Google Translate—the microscope AI.

Chris talks about this vision in his post “Visualizing Representations: Deep Learning and Human Beings:”

The visualizations are a bit like looking through a telescope. Just like a telescope transforms the sky into something we can see, the neural network transforms the data into a more accessible form. One learns about the telescope by observing how it magnifies the night sky, but the really remarkable thing is what one learns about the stars. Similarly, visualizing representations teaches us about neural networks, but it teaches us just as much, perhaps more, about the data itself.

(If the telescope is doing a good job, it fades from the consciousness of the person looking through it. But if there’s a scratch on one of the telescope’s lenses, the scratch is highly visible. If one has an example of a better telescope, the flaws in the worse one will suddenly stand out. Similarly, most of what we learn about neural networks from representations is in unexpected behavior, or by comparing representations.)

Understanding data and understanding models that work on that data are intimately linked. In fact, I think that understanding your model has to imply understanding the data it works on.

While the idea that we should try to visualize neural networks has existed in our community for a while, this converse idea—that we can use neural networks for visualization—seems equally important [and] is almost entirely unexplored.

Shan Carter and Michael Nielsen have also discussed similar ideas in their Artificial Intelligence Augmentation article in Distill.

Of course, the obvious question with all of this is whether it could ever be anything but hopelessly uncompetitive. It is important to note that Chris generally agrees that microscopes are unlikely to be competitive—which is why he’s mostly betting on the other routes to impact above. He just hasn’t entirely given up hope that a realignment of the ML community away from agents towards things like deliberate design and microscopes might still be possible.

Furthermore, even in a world where the ML community still looks very similar to how it does today, if we have really good interpretability tools and the largest AI coalition has a strong lead over the next largest, then it might be possible to stick with microscopes for quite some time. Perhaps enough to either figure out how to align agents or otherwise get some sort of decisive strategic advantage.

What if interpretability breaks down as AI gets more powerful?

Chris notes that one of the biggest differences between him and many of the other people in the AI safety community is his belief that very strong interpretability is at all possible. The model that Chris has here is something like a reverse compilation process that turns a neural network into human-understable code. Chris notes that the resulting code might be truly gigantic—e.g. the entire Linux kernel—but that it would be faithful to the model and understandable by humans. Chris’s basic intuition here is that neural networks really do seem to learn meaningful features and that if you’re willing to put a lot of energy in to understand them all—e.g. just actually inspect every single neuron—then you can make it happen. Chris notes that this is in contrast to a lot of other neural network interpretability work which is more aimed at approximating what neural networks do in particular cases.

Of course, this is still heavily dependent on exactly what the scaling laws are like for how hard interpretability will be as our models get stronger and more sophisticated. Chris likes to use the following graph to describe how he sees transparency and interpretability tools scaling up:

This graph has a couple of different components to it. First, simple models tend to be pretty interpretable—think for example linear regression, which gives you super easy-to-understand coefficients. Second, as you scale up past simple stuff like linear regression, things get a lot messier. But Chris has a theory here: the reason these models aren’t very interpretable is because they don’t have the capacity to express the full concepts that they need, so they rely on confused concepts that don’t quite track the real thing. In particular, Chris notes that he has found that better, more advanced, more powerful models tend to have crisper, clearer, more interpretable concepts—e.g. InceptionV1 is more interpretable than AlexNet. Chris believes that this sort of scaling up of interpretability will continue for a while until you get to around human-level performance, at which point Chris hypothesizes that the trend will stop as models start moving away from crisp human-level concepts to still crisp but now quite alien concepts.

If you buy this graph—or something like it—then interpretability should be pretty useful all the way up to and including AGI—though perhaps not for very far past AGI. But if you buy a continuous-takeoff worldview, then that’s still pretty useful. Furthermore, in my opinion, I think that the dropping off of interpretability at the end of this graph is just an artifact of using a human overseer. If you instead substituted in an amplified overseer, then I think it’s plausible that interpretability could just keep going up, or at least level off at some high level.

Improving the field of machine learning

One thing that Chris thinks could really make a big difference in achieving a lot of the above goals would be some sort of realignment of the machine learning community. Currently, the thing that the ML community primarily cares about is chasing state-of-the-art results on its various benchmarks without regard for understanding what the ML tools they’re using are actually doing. But that’s not what the machine learning discipline has to look like, and in fact, it’s not what most scientific disciplines do look like.

Here’s Chris’s vision for what an alternative field of machine learning might look like. Currently, machine learning researchers primarily make progress on benchmarks via trial and error. Instead, Chris wants to see a field which focuses on deliberate design where understanding models is prioritized and the way that people make progress is through deeply understanding their systems. In this world, ML researchers primarily make better models by using interpretability tools to understand why their models are doing what they’re doing instead of just throwing lots of things at the wall and seeing what sticks. Furthermore, a large portion of the field in this world is just devoted to gathering information on what models do—cataloging all the different types of circuits that appear across different neural networks, for example[4]—rather than on trying to build new models.[5]

If you want to change the field in this way, there are essentially two basic paths to making something like that happen—you can either:

  1. get current ML researchers to switch over to interpretability/deliberate design/microscope use or
  2. produce new ML researchers working on those things.

Chris has thoughts on how to do both of these, but I’ll start with the first one. Chris thinks that several factors could make a high-quality interpretability field appealing for researchers. First, interpretability could be a way for researchers without access to large amounts of compute to stay relevant in a world where relatively few labs can train the largest machine learning models. Second, Chris thinks there’s lots of low hanging fruit in interpretability such that it should be fairly easy to have impressive research results in the space over the next few years. Third, Chris’s vision of interpretability is very aligned with traditional scientific virtues—which can be quite motivating for many people—even if it isn’t very aligned with the present paradigm of machine learning.

However, If you want researchers to switch to a new research agenda and/or style of research, it needs to be possible for them to support careers based on it. Unfortunately, the unit of academic credit in machine learning tends to be traditional papers, published in conferences, evaluated on whether they set a new state-of-the-art on a benchmark (or more rarely by proving theoretical results). This is what decides who gets hired, promoted, and tenured in machine learning.

To address this, Chris founded Distill, an academic machine learning journal that aims to promote a different style of machine learning research. Distill aims to be a sort of “adapter” between the traditional method of evaluating research and the new style of research—based around things like deliberate design and microscope use—that Chris wants to see the field move to. Specifically, Distill does this by being different in a few key ways:

  1. Distill explicitly publishes papers visualizing machine learning systems, or even just explanations improving Clarity of thought in machine learning (Distill’s expository articles have become widely used references).
  2. Distill has all of the necessary trappings to make it recognized as a legitimate academic journal such that Distill publications will be taken seriously and cited.
  3. Distill has support for all the sorts of nice interactive diagrams that are often necessary for presenting interpretability research.

The second option is to produce new ML researchers pursuing deliberate design rather than converting old ones. Here, Chris has a pretty interesting take on how this can be done: convert neuroscientists and systems biologists.

Here’s Chris’s pitch. There are whole fields of neuroscience dedicated to understanding all the different connections, circuits, pathways, etc. in all different manner of animal brains. Similarly, for the systems biologists, there are significant communities of researchers studying individual proteins, their interactions and pathways, etc. While neural networks are different from these lines of research at a detailed level, a lot of high level research expertise—e.g. epistemic standards for studying circuits, recurring motifs, research intuition—may be just as helpful for this type of research as machine learning expertise. Chris thinks neuroscientists or systems biologists willing to make this transition would be able to get funding to do their research, a much easier time running experiments, and lots of low-hanging fruit in terms of new publishable results that nobody has found yet.

Doesn’t this speed up capabilities?

Yes, it probably does—and Chris agrees that there’s a negative component to that—but he’s willing to bet that the positives outweigh the negatives.

Specifically, Chris thinks the main question is whether principled and deliberate model design based on interpretability can beat automated model design approaches like neural architecture search. If it can, we get capabilities acceleration, but also a paradigm shift towards deliberate model design, which Chris expects to significantly aid alignment. If we don’t, interpretability loses one of its upsides (other advantages like auditing still exist in this world) but also doesn’t have the downside of acceleration. Both the upside and downside go hand in hand, and Chris expects the upside to outweigh the downside.


Update: If you're interested in understanding Chris's current transparency and interpretability work, a good starting point is the Circuits Thread on Distill.


  1. In particular, this could be a way of getting traction on addressing gradient hacking. ↩︎

  2. As an example of the potential dangers of agents, more agentic AI setups seem much more prone to mesa-optimization. ↩︎

  3. A notable exception to this, however, is Eric Drexler’s “Reframing Superintelligence: Comprehensive AI Services as General Intelligence.” ↩︎

  4. An example of the sort of common circuit that appears in lots of different models that the Clarity team has found is the way in which convolutional neural networks stay reflection-invariant: to detect a dog, they separately detect leftwards-facing and rightwards-facing dogs and then union them together. ↩︎

  5. This results in a large portion of the field being focused on what is effectively microscope use, which could also be quite relevant for making microscope AIs more tenable. ↩︎

New Comment
32 comments, sorted by Click to highlight new comments since: Today at 7:56 AM

Evan, thank you for writing this up! I think this is a pretty accurate description of my present views, and I really appreciate you taking the time to capture and distill them. :)

I’ve signed up for AF and will check comments on this post occasionally. I think some other members of Clarity are planning to so as well. So everyone should feel invited to ask us questions.

One thing I wanted to emphasize is that, to the extent these views seem intellectually novel to members of the alignment community, I think it’s more accurate to attribute the novelty to a separate intellectual community loosely clustered around Distill than to me specifically. My views are deeply informed by the thinking of other members of the Clarity team and our friends at other institutions. To give just one example, the idea presented here as a “microscope AI” is deeply influenced by Shan Carter and Michael Nielsen’s thinking, and the actual term was coined by Nick Cammarata.

To be clear, not everyone in this community would agree with my views, especially as they relate to safety and strategic considerations! So I shouldn’t be taken as speaking on behalf of this cluster, but rather as articulating a single point of view within it.

Btw, we just pushed some basic subscriptions options. In the triple-dot menu for this post (at the top of the post), there's an option to 'subscribe to comments' and you'll get notified of new comments any time you go to the Alignment Forum, rather than having to check this page in particular.

Edit: There was actually a bug in the notifications system on the AI Alignment Forum when I wrote this comment. It's fixed now.

Subscribed! Thanks for the handy feature.

Sort of a side point, but something that's been helpful to me in this post and others in the past year is reconceptualizing the Fast/Slow takeoff into "Continuous" vs "Hard" takeoff, which suggest different strategic considerations. This particular post helped flesh out some of my models of what considerations are at play.

Is it a correct summary of the final point: "either this doesn't really impact the field, so it doesn't increase capabilities; or, it successfully moves the ML field from 'everything is opaque and terrifying' to 'people are at least trying to build models of what their software is doing, which is net positive for getting good practices for alignment into the mainstream?"

That's an interesting and clever point (although it triggers some sort of "clever argument" safeguard that makes me cautious of it). The main counterpoint that comes to mind is a possible world where "opaque AIs" just can't ever achieve general intelligence, but moderately well-thought-out AI designs can bridge the gap to "general intelligence/agency" without being reliable enough to be aligned.

Yep, I think that's a correct summary of the final point.

The main counterpoint that comes to mind is a possible world where "opaque AIs" just can't ever achieve general intelligence, but moderately well-thought-out AI designs can bridge the gap to "general intelligence/agency" without being reliable enough to be aligned.

Well, we know it's possible to achieve general intelligence via dumb black box search—evolution did it—and we've got lots of evidence for current black box approaches being quite powerful. So it seems unlikely to me that we "just can't ever achieve general intelligence" with black box approaches, though it could be that doing so is much more difficult than if you have more of an understanding.

Also, ease of aligning a particular AI design is a relative property, not an absolute one. When you say transparent approaches might not be "reliable enough to be aligned" you could mean that they'll be just as likely likely as black box approaches to be aligned, less likely, or that they won't be able to meet some benchmark threshold probability of safety. I would guess that transparency will increase the probability of alignment relative to not having it, though I would say that it's unclear currently by how much.

The way I generally like to think about this is that there are many possible roads we can take to get to AGI, with some being more alignable and some being less alignable and some being shorter and some being longer. Then, the argument here is that transparency research opens up additional avenues which are more alignable, but which may be shorter or longer. Even if they're shorter, however, since they're more alignable the idea is that even if we end up taking the fastest path without regards to safety, if you can make the fastest path available to us a safer one, then that's a win.

One thing I'd add, in addition to Evan's comments, is that the present ML paradigm and Neural Architecture Search are formidable competitors. It feels like there’s a big gap in effectiveness, where we’d need to make lots of progress for “principled model design” to be competitive with them in a serious way. The gap causes me to believe that we’ll have (and already have had) significant returns on interpretability before we see capabilities acceleration. If it felt like interpretability was accelerating capabilities on the present margin, I’d be a bit more cautious about this type of argumentation.

(To date, I think the best candidate for a capabilities success case from this approach is Deconvolution and Checkerboard Artifacts. I think it’s striking that the success was less about improving a traditional benchmark, and more about getting models to do what we intend.)

That's an interesting and clever point (although it triggers some sort of "clever argument" safeguard that makes me cautious of it).

I think it shouldn't be in the "clever argument" category, and the only reason it feels like that is because you're using the capabilities-alignment framework.

Consider instead this worldview:

The way you build things that are useful and do what you want is to understand how things work and put them together in a deliberate way. If you put things together randomly, they either won't work, or will have unintended side effects.

(This worldview can apply to far more than AI; e.g. it seems right in basically every STEM field. You might argue that putting things together randomly seems to work surprisingly well in AI, to which I say that it really doesn't, you just don't see all of the effort where you put things together randomly and it simply flat-out fails.)

The argument "it's good for people to understand AI techniques better even if it accelerates AGI" is a very straightforward non-clever consequence of this worldview.

Somewhat more broadly, I recommend being able to inhabit this other worldview. I expect it to be more useful / accurate than the capabilities / alignment worldview.

(Disclaimer: I believed this point before this post -- in fact I had several conversations with people about it back in May, when I was considering a project with potential effects along these lines.)

I'm not sure I understand the difference between this worldview and my own. (The phrase-in-italics in your comment seemed fairly integral to how I was thinking about alignment/capabilities in the first place).

This recent comment of yours seems more relevant as far as worldview differences go, i.e. 'if you expect discontinuous takeoff, then transparency is unlikely to do what you want'. (some slightly more vague "what counts as a clever argument" disagreement might be relevant too, although I'm not sure I can state my worry crisply, nor really confident my worry is cogent)

I don't have a strong position on the continuous/hard-takeoff debate and have updated a bit over the past year both on continuous-takeoff's plausibility as well as the value in shifting the AI field towards having clearer models of what they're building, generally. But insofar as I'm suspicious of this, it's mostly because I still put moderate probability on "some understanding here may be more dangerous than no understanding, precisely because it's enough to accomplish some things without accomplishing everything that you needed to."

some understanding here may be more dangerous than no understanding, precisely because it's enough to accomplish some things without accomplishing everything that you needed to.

Fwiw, under the worldview I'm outlining, this sounds like a "clever argument" to me, that I would expect on priors to be less likely to be true, regardless of my position on takeoff. (Takeoff does matter, in that I expect that this worldview is not very accurate/good if there's discontinuous takeoff, but imputing the worldview I don't think takeoff matters.)

I often think of this as penalizing nth-order effects in proportion to some quickly-growing function of n. (Warning: I'm using the phrase "nth-order effects" in a non-standard, non-technical way.)

Under the worldview I mentioned, the first-order effect of better understanding of AI systems, is that you are more likely to build AI systems that are useful and do what you want.

The second-order effect is "maybe there's a regime where you can build capable-but-not-safe things; if we're currently below that, it's bad to go up into that regime". This requires a more complicated model of the world (given this worldview) and more assumptions of where we are.

(Also, now that I've written this out, the model also predicts there's no chance of solving alignment, because we'll first reach the capable-but-not-safe things, and die. Probably the best thing to do on this model is to race ahead on understanding as fast as possible, and hope we leapfrog directly to the capable-and-safe regime? Or you work on understanding AI in secret, and only release once you know how to do capable-and-safe, so that no one has the chance to work on capable-but-not-safe? You can see why this argument feels a bit off under the worldview I outlined.)

I was going to write a comment here, but it got a bit long so I made a post instead.

Takeoff does matter, in that I expect that this worldview is not very accurate/good if there's discontinuous takeoff, but imputing the worldview I don't think takeoff matters.

Minor question: could you clarify what you mean by "imputing the worldview" here? Do you mean something like, "operating within the worldview"? (I ask because this doesn't seem to be a use of "impute" that I'm familiar with.)

Do you mean something like, "operating within the worldview"?

Basically yes. Longer version: "Suppose we were in scenario X. Normally, in such a scenario, I would discard this worldview, or put low weight on it, because reason Y. But suppose by fiat that I continue to use the worldview, with no other changes made to scenario X. Then ..."

It's meant to be analogous to imputing a value in a causal Bayes net, where you simply "suppose" that some event happened, and don't update on anything causally upstream, but only reason forward about things that are causally downstream. (I seem to recall Scott Garrabrant writing a good post on this, but I can't find it now. ETA: Found it, it's here, but it doesn't use the term "impute" at all. I'm now worried that I literally made up the term, and it doesn't actually have any existing technical meaning.)

It's meant to be analogous to imputing a value in a causal Bayes net

Aha! I thought it might be borrowing language from some technical term I wasn't familiar with. Thanks!

I expect it to be more useful / accurate than the capabilities / alignment worldview.

To note, I sort of interpreted the capabilities/alignment tradeoff as more related to things that enhance capabilities while providing essentially no greater understanding. Increasing compute is the primary example I can think of.

The model that Chris has here is something like a reverse compilation process that turns a neural network into human-understable code. Chris notes that the resulting code might be truly gigantic—e.g. the entire Linux kernel—but that it would be faithful to the model and understandable by humans.

Does "faithful" mean "100% identical in terms of I/O", or more like "captures all of the important elements of"? My understanding is that neural networks are continuous whereas human-understandable code like the Linux kernel are discrete, so it seemingly just can't work in the former case, and I'm not sure how it can work in the latter case either.

Do you or Chris think that a test of this might be to take a toy model (say a 100-neuron ANN) that solves some toy problem, and see if it can be reversed compiled? (Or let me know if this has already been done.) If not, what's the earliest meaningful test that can be done?

I'm also concerned that combining ML, reverse compilation, and "giving feedback on process" essentially equals programming by nudging which just seems like a really inefficient way of programming. ETA: Is there an explanation of why this kind of ML would be better (in any sense of that word) than someone starting with a random piece of code and trying to end up with an AI by modifying it a little bit at a time?

ETA2: I wonder if Chris is assuming some future ML technique that learns a lot faster (i.e., is much more sample efficient) than what we have today, so that humans wouldn't have to give a lot of feedback on process, and "programming by nudging" wouldn't be a good analogy anymore.

<note> I work in Clarity at OpenAI. Chris and I have discussed this response (though I cannot claim to represent him).

Does "faithful" mean "100% identical in terms of I/O", or more like "captures all of the important elements of"?

I'd say faithfulness lies on a spectrum. Full IO determinism on a neural network is nearly impossible (given the vagaries of floating point arithmetic), but what is really of interest to us is “effectively identical IO”. A working definition of this could be - an interpretable network that one that can act as a drop-in replacement the original network with no impact on final accuracy.

This allows some wiggle room in the weights - to be rounded up and down, and allow weights that do not have any impact on final accuracy can be ablated. We are not, however, interested in creating interpretable approximations of the original network.

My understanding is that neural networks are continuous whereas human-understandable code like the Linux kernel are discrete, so it seemingly just can't work in the former case, and I'm not sure how it can work in the latter case either.

We are reasoning explicitly about numerical code, but I would argue this isn't that alien to human comprehension! Discrete code may be more intuitive (sometimes), but human cognition is certainly capable of understanding numerical algorithms (think of say, the SIFT algorithm)!

Do you or Chris think that a test of this might be to take a toy model (say a 100-neuron ANN) that solves some toy problem, and see if it can be reversed compiled? (Or let me know if this has already been done.) If not, what's the earliest meaningful test that can be done?

We are indeed working through this on a fairly sophisticated vision model. We're making progress!

Thanks for writing this!

I'm curious what's Chris's best guess (or anyone else's) about where to place AlphaGo Zero on that diagram. Presumably its place is somewhere after "Human Performance", but is it close to the "Crisp Abstractions" pick, or perhaps way further - somewhere in the realm of "Increasingly Alien Abstractions"?

Specifically, rather than using machine learning to build agents which directly take actions in the world, we could use ML as a microscope—a way of learning about the world without directly taking actions in it.

Is there an implicit assumption here that RL agents are generally more dangerous than models that are trained with (un)supervised learning?

(Later the OP contrasts microscopes with oracles, so perhaps Chris interprets a microscope as a model that is smaller, or otherwise somehow restricted, s.t. we know it's safe?)

I'm curious what's Chris's best guess (or anyone else's) about where to place AlphaGo Zero on that diagram

Without the ability to poke around at AlphaGo -- and a lot of time to invest in doing so -- I can only engage in wild speculation. It seems like it must have abstractions that human Go players don’t have or anticipate. This is true of even vanilla vision models before you invest lots of time in understanding them (I've learned more than I ever needed to about useful features for distinguishing dog species from ImageNet models).

But I’d hope the abstractions are in a regime where, with effort, humans can understand them. This is what I expect the slope downwards as we move towards “alien abstractions” to look like: we’ll see abstractions that are extremely useful if you can internalize them, but take more and more effort to understand.

Is there an implicit assumption here that RL agents are generally more dangerous than models that are trained with (un)supervised learning?

Yes, I believe that RL agents have a much wider range of accident concerns than supervised / unsupervised models.

Later the OP contrasts microscopes with oracles, so perhaps Chris interprets a microscope as a model that is smaller, or otherwise somehow restricted, s.t. we know it's safe?

Gurkenglas provided a very eloquent description that matches why I believe this. I’ll continue discussion of this in that thread. :)

To me, the important safety feature of "microscope AI" is that the AI is not modeling the downstream consequences of its outputs (which automatically rules out manipulation and deceit). This feature is totally incompatible with agents (you can't vacuum the floor without modeling the consequences of your motor control settings), and optional for oracles [I'm using oracles in a broad sense of systems that you use to help answer your questions, leaving aside what their exact user interface is, so microscope AI is part of that]. For example, when Eliezer thinks about oracles he is not thinking this way; instead, he's thinking of a system that deliberately chooses an output to "increase the correspondence between the user's belief about relevant consequences and reality". But there's no reason in principle that we couldn't build a system that will not apply its intelligent world-model to analyze the downstream consequences of its outputs.

I think the only way to do that is to have its user interface not be created automatically as part of the training objective, but rather build the in ourselves, separately. Then the two key questions are: What's the safe training procedure that results in an intelligent world-model, and what's the separate input-output interface that we're going to build? Both of these are open questions AFAIK. I wrote Self-Supervised Learning and AGI Safety laying out this big picture as I see it.

For the latter question, what is the user interface, "Use interpretability tools & visualizations on the world-model" seems about as good an answer as any, and I am very happy to have Chris and others trying to flesh out that vision. I hope that they don't stop at feature extraction, but also pulling out the relationships (causal, compositional, etc.) that we need to do counterfactual reasoning, planning etc., and even a "search through causal pathways to get desired consequences" interface. Incidentally, the people who think that brain-computer interfaces will help with AGI safety (cf waitbutwhy) seem to be banking on something vaguely like "microscope AI", but I haven't yet found any detailed discussion along those lines.

For the first question, what is the safe training procedure that incidentally creates a world-model, contra Gurkenglas's comment here, I think it's an open question whether a safe training procedure exists. For example, unsupervised (a.k.a. "self-supervised") learning as ofer suggests seems awfully safe but is it really? See Self-Supervised Learning and Manipulative Predictions; I half-joked there about burying the computer in an underground bunker, running self-supervised learning under homomorphic encryption, until training was complete; then cutting power, digging it out, and inspecting the world model. But even then, an ambitious misaligned system could potentially leave manipulative booby-traps on its hard drive. Gurkenglas's suggestion of telling it nothing about the universe (e.g. have it play Nomic) would make it possibly safer but dramatically less useful (it won't understand the cause of Alzheimer's etc.) And it can probably still learn quite a bit about the world by observing its own algorithm... I'm not sure, I'm still generally optimistic that a solution exists, and I hope that Gurkenglas and I and everyone else keeps thinking about it. :-)

To me, the important safety feature of "microscope AI" is that the AI is not modeling the downstream consequences of its outputs (which automatically rules out manipulation and deceit).

As I mentioned in this comment, not modeling the consequences of its output is actually exactly what I want to get out of myopia.

For the latter question, what is the user interface, "Use interpretability tools & visualizations on the world-model" seems about as good an answer as any, and I am very happy to have Chris and others trying to flesh out that vision.

Yep; me too!

I hope that they don't stop at feature extraction, but also pulling out the relationships (causal, compositional, etc.) that we need to do counterfactual reasoning, planning etc., and even a "search through causal pathways to get desired consequences" interface.

Chris (and the rest of Clarity) are definitely working on stuff like this!

unsupervised (a.k.a. "self-supervised") learning as ofer suggests seems awfully safe but is it really?

I generally agree that unsupervised learning seems much safer than other approaches (e.g. RL), though I also agree that there are still concerns. See for example Abram's recent "The Parable of Predict-O-Matic" and the rest of his Partial Agency sequence.

As I understood it, an Oracle AI is asked a question and produces an answer. A microscope is shown a situation and constructs an internal model that we then extract by reading its innards. Oracles must somehow be incentivized to give useful answers, microscopes cannot help but understand.

Oracles must somehow be incentivized to give useful answers

A microscope model must also be trained somehow, for example with unsupervised learning. Therefore, I expect such a model to also look like it's "incentivized to give useful answers" (e.g. an answer to the question: "what is the next word in the text?").

My understanding is that what distinguishes a microscope model is the way it is being used after it's already trained (namely, allowing researchers to look at its internals for the purpose of gaining insights etcetera, rather than making inferences for the sake of using its valuable output). If this is correct, it seems that we should only use safe training procedures for the purpose of training useful microscopes, rather than training arbitrarily capable models.

Our usual objective is "Make it safe, and if we aligned it correctly make it useful.". A microscope is useful even if it's not aligned, because having a world model is a convergent instrumental goal. We increase the bandwidth from it to us, but we decrease the bandwidth from us to it. By telling it almost nothing, we hide our position in the mathematical universe and any attack it devises cannot be specialized on humanity. Imagine finding the shortest-to-specify abstract game that needs AGI to solve (Nomic?), then instantiating an AGI to solve it just to learn about AI design from the inner optimizers it produces.

It could deduce that someone is trying to learn about AI design from its inner optimizers, and maybe it could deduce our laws of physics because they are the simplest ones that would try such, but quantum experiments show it cannot deduce its Everett branch.

Ideally, the tldrbot we set to interpret the results would use a random perspective onto the microscope so the attack also cannot be specialized on the perspective.

This post is fascinating, thank you very much Evan for writing it.

It seems like everyone has very different takes on how to figure out whether to keep working on something.

My sense reading this post is that Chris feels that making progress on the understandability of ML systems is something he's found a lot of traction on, and doesn't see a principled argument for why he won't continue to find traction until we reach full understandability.

My sense reading this post by Jessica Taylor is that she feels that making progress on the understandability of ML systems is something she failed to find a lot of traction on, and doesn't see a principled argument for why she would be able to reach full understandability.

And Paul says here

And I think there’s some basic research intuition about how much a problem– suppose you poke at a problem a few times, and you’re like ‘Agh, seems hard to make progress’. How much do you infer that the problem’s really hard? And I’m like, not much. As a person who’s poked at a bunch of problems, let me tell you, that often doesn’t work and then you solve in like 10 years of effort.

I think that’s a fair characterization of my optimism.

I think the classic response to me is “Sure, you’re making progress on understanding vision models, but models with X are different and your approach won’t work!” Some common values of X are not having visual features, recurrence, RL, planning, really large size, and language-based. I think that this is a pretty reasonable concern (more so for some Xs than others). Certainly, one can imagine worlds where this line of work hits a wall and ends up not helping with more powerful systems. However, I would offer a small consideration in the other direction: In 2013 I think no one thought we’d make this much progress on understanding vision models, and in fact many people thought really understanding them was impossible. So I feel like there’s some risk of distorting our evaluation of tractability by moving the goal post in these conversations.

I’m not surprised by other people feeling like they have less traction. I feel like the first three or so years I spent trying to understand the internals neural networks involved a lot of false starts with approaches that ended up being dead ends (eg. visualizing really small networks, or focusing on dimensionality reduction). DeepDream was very exciting, but it retrospect I feel like it took me another two or so years to really digest what it meant and how one could really use it as a scientific tool. And this is with the benefit of amazing collaborators and multiple very supportive environments.

One final thing I’d add is that, if I’m honest, I’m probably more motivated by aesthetics than optimism. I’ve spent almost seven years obsessed with the question of what goes on inside neural networks and I find the crazy partial answers we learn every year tantalizingly beautiful. I think this is pretty normal for early research directions; Kuhn talks about it a fair amount in The Structure of Scientific Revolutions.

We should be careful to separate two levels of understanding: (1) We can understand the weights and activations of a particular trained model, versus (2) We can understand why a particular choice of architecture, learning algorithm, and hyperparameters is a good (effective) choice for a given ML application.

I think that (1) is great for AGI safety, (2) does a lot for capabilities and not much for safety.

So bringing up Neural Architecture Search is not necessarily the most relevant thing, since NAS is about (2), not (1).

For my part, I'm expecting that the community will "by default" make progress on (2), such that researchers using Neural Architecture Search will naturally be outcompeted by researchers who understand why to use a certain architecture and hyperparameters. Whereas I feel like (1) is the very important thing that won't necessarily happen automatically, unless people like Chris Olah keep doing the hard work to make it a community priority.

Thanks for making that distinction, Steve. I think the reason things might sounds muddled is that many people expect that (1) will drive (2).

Why might one expect (1) to cause (2)? One way to think about it is that, right now, most ML experiments optimistically given 1-2 bits of feedback to the researcher, in the form of whether their loss went up or down from a baseline. If we understand the resulting model, however, that could produce orders of magnitude more meaningful feedback about each experiment. As a concrete example, in InceptionV1, there are a cluster of neurons responsible for detecting 3D curvature and geometry that all form together in one very specific place. It's pretty suggestive that, if you wanted your model to have a better understanding of 3D curvature, you could add neurons there. So that's an example where richer feedback could, hypothetically, guide you.

Of course, it's not actually clear how helpful it is! We spent a bunch of time thinking about the model and concluded "maybe it would be especially useful on a particular dimension to add neurons here." Meanwhile, someone else just went ahead and randomly added a bunch of new layers and tried a dozen other architectural tweaks, producing much better results. This is what I mean about it actually being really hard to outcompete the present ML approach.

There's another important link between (1) and (2). Last year, I interviewed a number of ML researchers I respect at leading groups about what would make them care about interpretability. Almost uniformly, the answer was that they wanted interpretability to give them actionable steps for improving their model. This has led me to believe that interpretability will accelerate a lot if it can help with (2), but that's also the point at which it helps capabilities.

I really like this post, it's really rich in new ideas, from using transparency tools to deliberately design ML systems, to how interpretability might scale, to trying to reorient the field of ML to more safe and alignable designs, and a bunch more detail.

I also think that trying to get someone else's worldview and explain it is a really valuable practice, and it certainly seems like Evan is learning to put on a number of really interesting and unique hats, which is great. Chris in particular has affected how I think a bunch about making scientific progress, with his writing about distillation and work at Distill.pub.

So I've curated this post (i.e. it moves to the top of the frontpage, and gets emailed to all users who've signed up for curation emails).

 

  • Olah’s comment indicates that this is indeed a good summary of his views.
  • I think the first three listed benefits are indeed good reasons to work on transparency/interpretability. I am intrigued but less convinced by the prospect of ‘microscope AI’.
    • The ‘catching problems with auditing’ section describes an ‘auditing game’, and says that progress in this game might illustrate progress in using interpretability for alignment. It would be good to learn how much success the auditors have had in this game since the post was published.
    • One test of ‘microscope AI’: the go community has had a couple of years of the computer era, in which time open-source go programs stronger than AlphaGo have been released. This has indeed changed the way that humans think about go: seeing the corner variations that AIs tend to play has changed our views on which variations are good for which player, and seeing AI win probabilities conditioned on various moves, as well as the AI-recommended continuations, has made it easier to review games. Yet sadly, there has been to my knowledge no new go knowledge generated from looking at the internals of these systems, despite some visualization research being done (https://arxiv.org/pdf/1901.02184.pdf, https://link.springer.com/chapter/10.1007/978-3-319-97304-3_20). As far as I’m aware, we do not even know if these systems understand the combinatorial game theory of the late endgame, the one part of go that has been satisfactorily mathematized (and therefore unusually amenable to checking whether some program implements it). It’s not clear to me whether this is for a lack of trying, but this does seem like a setting where microscope AI would be useful if it were promising.
  • The paper mostly focuses on the benefits of transparency/interpretability for AI alignment. However, as far as I’m aware, since before this post was published, the strongest argument against work in this direction has been the problem of tractability - can we actually succeed at understanding powerful AI systems?
    • The one argument that is given in the post is a claim that moderately-smarter-than-human AIs will have relatively crisp and human-like abstractions that humans will be able to understand. The evidence for this claim is that more powerful image classifiers seem to have closer-to-human concepts, but this doesn’t seem strong to me - these classifiers appear to use concepts that are close-to-human, but without mechanistic understanding of how these ‘concepts’ are actually constructed in the network, it’s difficult to reason about them with confidence. Furthermore, we know that network visualization techniques that involve finding images that maximize the activation of an internal neuron produce bizarre images that don’t look like anything given a wide enough search space, suggesting that we in fact don’t understand what these neurons are representing.
    • This relates to a problem with the argument: just because an ML system uses easy-to-understand concepts does not mean that those concepts will be easy to recover at the level of analysis that humans can access. For example, humans have concepts that are easy for humans to understand, but it is difficult for neuroscientists to recover the details of all of these concepts purely from neural wirings and chemical levels.
    • Since this paper has been published, Olah’s team has published work that does say something about the internals of AI systems (https://distill.pub/2020/circuits/), but this has involved manually inspecting all the weights and neurons, an approach that I do not expect to scale.
    • All in all, I do think there are promising research directions here, but I don’t think that there is enough publicly-available evidence for a reader of the post to be convinced of that without relying on deference to the researchers who are relatively optimistic about this work.
  • The post has some interesting thoughts about how to re-orient the field (or at least a sub-section of the field) around interpreting ML models.
    • I’m most intrigued by the argument that as state-of-the-art models get larger, this work will be comparatively more accessible to smaller groups.
    • I think Distill is an interesting step in the direction of making ML interpretation more academically rewarding, but at present, it still seems to me that the effort-to-reward ratio is still unfavourably high relative to other work enterprising ML researchers can pursue.
    • At any rate, I’m interested in this re-orientation, and am unsure about whether there are promising ways to make it materialize. I hope that more work is done to make it happen, and that we will be able to know the results in five or so years from now.
  • Follow-up work that I’d be interested to see:
    • More academic research on extracting knowledge from the internals of ML systems.
    • LW posts arguing for the tractability of transparency and interpretability research as AI systems scale.
    • A more detailed post and model of what would succeed at re-orienting the field around interpreting powerful AI systems. Bonus points if the interventions only work if such a reorientation is actually a good idea.

I think that this post is a good description of a way of thinking about the usefulness of transparency and interpretability for AI alignment that I think is underrated by the LW-y AI safety community.

Bouncing off a comment upthread, I wrote a post here with some general thoughts on security of powerful systems. I also gave an argument that marginal transparency doesn't translate into marginal security. I've mostly not figured out what its relation is to the OP, and below I'll just say that with a few more words. Note that I'm a layman when it comes to the details of ML, so am speaking like a 101 undergraduate or something like that.

It sounds to me like the transparency work being published at Distill, and some of its longer term visions (as outlined in the OP), if successful, are going to substantially increase how well we understand ML systems, and how useful they are.

From the perspective of security that I discuss in the linked post, I feel like I don’t personally have a good enough understanding of what Chris and others think their work is, to be able to tell whether the transparency work being published in Distill will (as it progresses) make systems secure in the way I described in the link, and it’s plausibly a much further effort on top. Evan mentions the idea of turning an ML system into an understandable codebase the size of the Linux Kernel, which sounds breathtakingly ambitious and incredibly useful. Though, for example, it's typically very hard to make a codebase secure when you've been handed a system that didn't have security built in.

Relatedly, I don't feel I have a good understanding of the folks at Clarity's model of why (or whether) AI will not go well by default (to pick a concrete possibility: if we reach AGI by largely continuing the current trajectory of work that the field of ML is doing), where any big risks come from (to pick some clear possibilities: whether it’s broadly similar to Paul’s first model, second model, or neither), and what sort of technical work is required to prevent the bad outcomes. I'd be interested if Chris or anyone else working alongside him on Clarity feels like they can offer a crisp claim about how optimisation enters the system and what features of a transparent system mean that it's findable and removable with an appropriate amount of resources (though I don’t think that’s an easy ask or anything).

The content here is very valuable, even if the genre of "I talked a lot with X and here's my articulation of X's model" comes across to me as a weird intellectual ghostwriting. I can't think of a way around that, though.