Dan Braun

[Linkpost] Interpreting Language Model Parameters

by Lucius Bushnaq, Dan Braun, Oliver Clive-Griffin, Bart Bussmann, Nathan Hu, mivanitskiy, Linda Linsefors, and Lee Sharkey

This is the latest work in our Parameter Decomposition agenda. We introduce a new parameter decomposition method, adVersarial Parameter Decomposition (VPD)[1] and decompose the parameters of a small[2] language model with it. VPD greatly improves on our previous techniques, Stochastic Parameter Decomposition (SPD) and Attribution-based Parameter Decomposition (APD). We think...

May 5164

[Paper] Stochastic Parameter Decomposition

by Lee Sharkey, Lucius Bushnaq, and Dan Braun

Abstract A key step in reverse engineering neural networks is to decompose them into simpler parts that can be studied in relative isolation. Linear parameter decomposition— a framework that has been proposed to resolve several issues with current decomposition methods—decomposes neural network parameters into a sum of sparsely used vectors...

Jun 27, 202547

Attribution-based parameter decomposition

by Lucius Bushnaq, Dan Braun, StefanHex, jake_mendel, and Lee Sharkey

This is a linkpost for Apollo Research's new interpretability paper: "Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition". We introduce a new method for directly decomposing neural network parameters into mechanistic components. Motivation At Apollo, we've spent a lot of time thinking about how the computations...

Jan 25, 2025109

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

by Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex, and Nicholas Goldowsky-Dill

Why we made this list: * The interpretability team at Apollo Research wrapped up a few projects recently[1]. In order to decide what we’d work on next, we generated a lot of different potential projects. Unfortunately, we are computationally bounded agents, so we can't work on every project idea that...

Jul 18, 2024127

Apollo Research 1-year update

by Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, Alex Meinke, and rusheb

This is a linkpost for: www.apolloresearch.ai/blog/the-first-year-of-apollo-research About Apollo Research Apollo Research is an evaluation organization focusing on risks from deceptively aligned AI systems. We conduct technical research on AI model evaluations and interpretability and have a small AI governance team. As of 29 May 2024, we are one year old....

May 29, 202493

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

by Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache, and Marius Hobbhahn

This is a linkpost for our two recent papers: 1. An exploration of using degeneracy in the loss landscape for interpretability https://arxiv.org/abs/2405.10927 2. An empirical test of an interpretability technique based on the loss landscape https://arxiv.org/abs/2405.10928 This work was produced at Apollo Research in collaboration with Kaarel Hanni (Cadenza Labs),...

May 20, 2024108

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

A short summary of the paper is presented below. This work was produced by Apollo Research in collaboration with Jordan Taylor (MATS + University of Queensland) . TL;DR: We propose end-to-end (e2e) sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important by minimizing...

May 17, 202457

Dan Braun

Dan Braun

Announcing Apollo Research

[Linkpost] Interpreting Language Model Parameters

[Interim research report] Taking features out of superposition with sparse autoencoders

Interpreting Neural Networks through the Polytope Lens

Dan Braun

Announcing Apollo Research

[Linkpost] Interpreting Language Model Parameters

[Interim research report] Taking features out of superposition with sparse autoencoders

Interpreting Neural Networks through the Polytope Lens

[Linkpost] Interpreting Language Model Parameters

[Paper] Stochastic Parameter Decomposition

Attribution-based parameter decomposition

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Apollo Research 1-year update

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning