AI ALIGNMENT FORUM
AF

1212
The Engineer’s Interpretability Sequence

The Engineer’s Interpretability Sequence

Feb 09, 2023 by scasper

Interpretability research is popular, and interpretability tools play a role in almost every agenda for making AI safe. However, for all the interpretability work that exists, there is a significant gap between the research and engineering applications. If one of our main goals for interpretability research is to help us with aligning highly intelligent AI systems in high stakes settings, shouldn’t we be seeing tools that are more helpful on real world problems?

This 12-post sequence argues for taking an engineering approach to interpretability research. And from this lens, it analyzes existing work and proposes directions for moving forward. 

21The Engineer’s Interpretability Sequence (EIS) I: Intro
scasper
3y
18
9EIS II: What is “Interpretability”?
scasper
3y
5
8EIS III: Broad Critiques of Interpretability Research
scasper
3y
2
8EIS IV: A Spotlight on Feature Attribution/Saliency
scasper
3y
0
25EIS V: Blind Spots In AI Safety Interpretability Research
scasper
3y
15
24EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety
scasper
3y
6
14EIS VII: A Challenge for Mechanists
scasper
3y
4
14EIS VIII: An Engineer’s Understanding of Deceptive Alignment
scasper
3y
4
15EIS IX: Interpretability and Adversaries
scasper
3y
5
6EIS X: Continual Learning, Modularity, Compression, and Biological Brains
scasper
3y
1
6EIS XI: Moving Forward
scasper
3y
2
5EIS XII: Summary
scasper
3y
0
69EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024
scasper
1y
7
31EIS XIV: Is mechanistic interpretability about to be practically useful?
scasper
1y
4
20EIS XV: A New Proof of Concept for Useful Interpretability
scasper
7mo
2