The previous two posts have emphasized some problematic scenarios for mech-interp. Mech-interp is our example of a more general problem in AI safety. In this post we zoom out to that more general problem, before proposing our solution. We can characterize the more general problem, inherent in the causal–mechanistic paradigm,...
The previous post highlighted some salient problems for the causal–mechanistic paradigm we sketched out. Here, we'll expand on this with some plausible future scenarios that further weaken the paradigm's reliability in safety applications. We first briefly refine our critique and outline the scenario progression. Outline We contend that the causal–mechanistic...
Mechanistic Interpretability Many of you will be familiar with the following section. Please skip through to the next. The field of mechanistic interpretability (MI) is not a single, monolithic research program but rather a rapidly evolving collection of methods, tools, and research programs. These are united by the shared ambition...
Tunings/Preamble Notes on the programme What follows is an overview of the incoming sequence. It is an edited version of the research I have been developing over the past year or so. Most recent version here. The initial position paper (co-authors Chris Pang, Sahil K) was presented at the Tokyo...