Abstract A key step in reverse engineering neural networks is to decompose them into simpler parts that can be studied in relative isolation. Linear parameter decomposition— a framework that has been proposed to resolve several issues with current decomposition methods—decomposes neural network parameters into a sum of sparsely used vectors...
This is a blogpost version of a talk I gave earlier this year at GDM. Epistemic status: Vague and handwavy. Nuance is often missing. Some of the claims depend on implicit definitions that may be reasonable to disagree with and is, in an important sense, subjective. But overall I think...
TL;DR: This paper brings together ~30 mechanistic interpretability researchers from 18 different research orgs to review current progress and the main open problems of the field. This review collects the perspectives of its various authors and represents a synthesis of their views by Apollo Research on behalf of Schmidt Sciences....
This is a linkpost for Apollo Research's new interpretability paper: "Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition". We introduce a new method for directly decomposing neural network parameters into mechanistic components. Motivation At Apollo, we've spent a lot of time thinking about how the computations...
Bart, Michael and Patrick are joint first authors. Research conducted as part of MATS 6.0 in Lee Sharkey and Neel Nanda’s streams. Thanks to Mckenna Fitzgerald and Robert Krzyzanowski for their feedback! TL;DR: * Sparse Autoencoder (SAE) latents have been shown to typically be monosemantic (i.e. correspond to an interpretable...
Why we made this list: * The interpretability team at Apollo Research wrapped up a few projects recently[1]. In order to decide what we’d work on next, we generated a lot of different potential projects. Unfortunately, we are computationally bounded agents, so we can't work on every project idea that...
This work was produced as part of Lee Sharkey's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort Intro and Motivation Sparse dictionary learning (SDL) has attracted a lot of attention recently as a method for interpreting transformer activations. They demonstrate that model activations can often...