Against Almost Every Theory of Impact of Interpretability
Epistemic Status: I believe I am well-versed in this subject. I erred on the side of making claims that were too strong and allowing readers to disagree and start a discussion about precise points rather than trying to edge-case every statement. I also think that using memes is important because safety ideas are boring and anti-memetic. So let’s go! Many thanks to @scasper, @Sid Black , @Neel Nanda , @Fabien Roger , @Bogdan Ionut Cirstea, @WCargo, @Alexandre Variengien, @Jonathan Claybrough, @Edoardo Pona, @Andrea_Miotti, Diego Dorn, Angélina Gentaz, Clement Dumas, and Enzo Marsot for useful feedback and discussions. When I started this post, I began by critiquing the article A Long List of Theories of Impact for Interpretability, from Neel Nanda, but I later expanded the scope of my critique. Some ideas which are presented are not supported by anyone, but to explain the difficulties, I still need to 1. explain them and 2. criticize them. It gives an adversarial vibe to this post. I'm sorry about that, and I think that doing research into interpretability, even if it's no longer what I consider a priority, is still commendable. How to read this document? Most of this document is not technical, except for the section "What does the end story of interpretability look like?" which can be mostly skipped at first. I expect this document to also be useful for people not doing interpretability research. The different sections are mostly independent, and I’ve added a lot of bookmarks to help modularize this post. If you have very little time, just read (this is also the part where I’m most confident): * Auditing deception with Interp is out of reach (4 min) * Enumerative safety critique (2 min) * Technical Agendas with better Theories of Impact (1 min) Here is the list of claims that I will defend: (bolded sections are the most important ones) * The overall Theory of Impact is quite poor * Interp is not a good predictor of future systems * Auditing dece
I've asked Claude to make a rough assessment on this. Tldr, the proba goes from 13% to ~27 and this propagates to plan C and D.
Claude: Ryan's response is suggestive but incomplete. "Sabotaging Chinese AI companies" gestures at a possible answer but doesn't constitute a full defense because:
To be fair to Ryan, the original post does mention "helping the US government ensure non-proliferation/lead time" under Plan B, so the... (read 568 more words →)