A Pragmatic Vision for Interpretability
Executive Summary * The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability: * Trying to directly solve problems on the critical path to AGI going well [[1]] * Carefully choosing problems according to our comparative advantage * Measuring progress with empirical feedback on proxy tasks * We believe that, on the margin, more researchers who share our goals should take a pragmatic approach to interpretability, both in industry and academia, and we call on people to join us * Our proposed scope is broad and includes much non-mech interp work, but we see this as the natural approach for mech interp researchers to have impact * Specifically, we’ve found that the skills, tools and tastes of mech interp researchers transfer well to important and neglected problems outside “classic” mech interp * See our companion piece for more on which research areas and theories of change we think are promising * Why pivot now? We think that times have changed. * Models are far more capable, bringing new questions within empirical reach * We have been disappointed by the amount of progress made by ambitious mech interp work, from both us and others [[2]] * Most existing interpretability techniques struggle on today’s important behaviours, e.g. they involve large models, complex environments, agentic behaviour and long chains of thought * Problem: It is easy to do research that doesn't make real progress. * Our approach: ground your work with a North Star - a meaningful stepping-stone goal towards AGI going well - and a proxy task - empirical feedback that stops you fooling yourself and that tracks progress toward the North Star. * “Proxy tasks” doesn't mean boring benchmarks. Examples include: interpret the hidden goal of a model organism; stop emergent misalignment without changing training data; predict what prompt changes will st
Ah, I think I can stymy M with 2 nonconstant advisors. Namely, let A1(n)=12−1n+3 and A2(n)=12+1n+3. We (setting up an adversarial E) precommit to setting E(n)=0 if p(n)≥A2(n) and E(n)=1 if p(n)≤A1(n); now we can assume that M always chooses p(n)∈[A1(n),A2(n)], since this is better for M.
Now define b′i(j)=|Ai(j)+E(j)−1|−|p(j)+E(j)−1| and bi(n)=∑j<nb′i(j). Note that if we also define badi(n)=∑j<n(log|Ai(j)+E(j)−1|−log|p(j)+E(j)−1|) then ∑j<n|2bi(j)−badi(j)|≤∑j<n(2A1(j)−1−log(2A1(j))))=∑j<nO((12−A1(j))2) is bounded; therefore if we can force b1