Interpretability Research for the Most Important Century

This series of posts attempts to answer the following question from Holden Karnofsky’s Important, actionable research questions for the most important century (from which the name of this sequence is inspired as well):

“What relatively well-scoped research activities are particularly likely to be useful for longtermism-oriented AI alignment?”.

As one answer to Holden's question, I explore the argument that interpretability research is one of these high-leverage activities in the AI alignment research space.

Featured image was created by DALL·E

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Interpretability Research for the Most Important Century