This is the first post in a sequence exploring the argument that interpretability is a high-leverage research activity for solving the AI alignment problem.
This post contains important background context for the rest of the sequence. I'll give an overview of one of Holden Karnofksy's (2022) Important, actionable research questions for the most important century, which is the central question we'll be engaging with in this sequence. I'll also define some terms and compare this sequence to existing works.
If you're already very familiar with Karnofsky (2022) and interpretability, then you can probably skip to the second post in this sequence: Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios
This sequence is being written as a direct response to the following question from Karnofsky (2022):
“What relatively well-scoped research activities are particularly likely to be useful for longtermism-oriented AI alignment?” (full question details)
I'll refer to this throughout the sequence as the Alignment Research Activities Question.
In the details linked above for the Alignment Research Activities Question, Holden first discusses two categories of alignment research which are lacking in one way or another. He then presents a third category with some particularly desirable properties:
“Activity that is  likely to be relevant for the hardest and most important parts of the problem, while also being  the sort of thing that researchers can get up to speed on and contribute to relatively straightforwardly (without having to take on an unusual worldview, match other researchers’ unarticulated intuitions to too great a degree, etc.)”
He refers to this as "category (3)", but I'll use the term High-leverage Alignment Research since it's more descriptive and we'll be referring back to this concept often throughout the sequence.
We want to know more about which alignment research is in this category. Why? Further excerpts from Karnofsky (2022) to clarify:
“I think anything we can clearly identify as category (3) [that is, High-leverage Alignment Research] is immensely valuable, because it unlocks the potential to pour money and talent toward a relatively straightforward (but valuable) goal.[...]I think there are a lot of people who want to work on valuable-by-longtermist-lights AI alignment research, and have the skills to contribute to a relatively well-scoped research agenda, but don’t have much sense of how to distinguish category (3) from the others.
There’s also a lot of demand from funders to support AI alignment research. If there were some well-scoped and highly relevant line of research, appropriate for academia, we could create fellowships, conferences, grant programs, prizes and more to help it become one of the better-funded and more prestigious areas to work in.
I also believe the major AI labs would love to have more well-scoped research they can hire people to do."
I won't be thoroughly examining other research directions besides interpretability, except in cases where a hypothetical interpretability breakthrough is impacting another research direction toward a potential solution to the alignment problem. So I don't expect this sequence to produce a complete comparative answer to the Alignment Research Activities Question.
But by investigating whether interpretability research is High-leverage Alignment Research, I hope to put together a fairly comprehensive analysis of interpretability research that could be useful to people considering investing their money or time into it. I also hope that someone trying to answer the larger Alignment Research Activities Question could use my work on interpretability in this sequence as part of a more complete, comparative analysis across different alignment research activities.
So in the next post, Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios, I'll be exploring whether interpretability has property #1 of High-leverage Alignment Research. That is, whether interpretability is "likely to be relevant for the hardest and most important parts of the [AI alignment] problem."
Then, in a later post of this sequence, I'll explore whether interpretability has property #2 of High-leverage Alignment Research. That is, whether interpretability is "the sort of thing that researchers can get up to speed on and contribute to relatively straightforwardly (without having to take on an unusual worldview, match other researchers’ unarticulated intuitions to too great a degree, etc.)"
First of all, what is interpretability?
I’ll borrow a definition (actually two) from Christoph Molnar’s Interpretable Machine Learning (the superscript numbers here are Molnar's footnotes, not mine - you can find what they refer to by following the link):
“A (non-mathematical) definition of interpretability that I like by Miller (2017)3 is: Interpretability is the degree to which a human can understand the cause of a decision. Another one is: Interpretability is the degree to which a human can consistently predict the model’s result4. The higher the interpretability of a machine learning model, the easier it is for someone to comprehend why certain decisions or predictions have been made. A model is better interpretable than another model if its decisions are easier for a human to comprehend than decisions from the other model.”
I also occasionally use the word “transparency” instead of “interpretability”, but I mean these to be synonymous.
This is the first post I’m aware of attempting to answer the Alignment Research Activities Question since Karnofsky (2022) put it forth.
However, there are several previous posts which explore interpretability at a high-level and its possible impact on alignment. Many of the ideas in this post hence aren’t original and either draw from these earlier works or arrived independently at the same ideas.
Here are some of the relevant posts, and my comments on how they compare to the present sequence:
The next post, Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios, explores whether interpretability has property #1 of High-leverage Alignment Research. That is, whether interpretability is "likely to be relevant for the hardest and most important parts of the [AI alignment] problem."
Many thanks to Joe Collman, Nick Turner, Eddie Kibicho, Donald Hobson, Logan Riggs Smith, Ryan Murphy and Justis Mills (LessWrong editing service) for helpful discussions and feedback on earlier drafts of this post.
Thanks also to the AGI Safety Fundamentals Curriculum, which is an excellent course I learned a great deal from leading up to writing this, and for which I started this sequence as my capstone project.
Read the next post in this sequence: Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios
Karnofsky, Holden (2022): Important, actionable research questions for the most important century
Sometimes when I quote Karnofsky (2022), I'm referring directly to the link above to the post on the Effective Altruism Forum. Other times I'm referring to text that only appears in the associated Appendix 1: detailed discussion of important, actionable questions for the most important century that Holden provides, which is on Google Docs.
The "most important century" part of the present sequences's name also draws its inspiration from Karnofsky (2022) and an earlier blog post series by the same author.
3 of the 11 proposals explicitly have “transparency tools” in the name. 5 more of them rely on relaxed adversarial training. In Evan Hubinger’s Relaxed adversarial training for inner alignment, he explains why this technique ultimately depends on interpretability as well:
“...I believe that one of the most important takeaways we can draw from the analysis presented here, regardless of what sort of approach we actually end up using, is the central importance of transparency. Without being able to look inside our model to a significant degree, it is likely going to be very difficult to get any sort of meaningful acceptability guarantees. Even if we are onlyshooting for an iid guarantee, rather than a worst-case guarantee, we are still going to need some way of looking inside our model to verify that it doesn't fall into any of the other hard cases.”
Then there is Microscope AI, which is an alignment proposal based entirely around interpretability. STEM AI relies on transparency tools to solve inner alignment issues in Hubinger’s analysis. Finally, in proposal #2 which utilizes intermittent oversight, he clarifies that the overseers will be "utilizing things like transparency tools and adversarial attacks."