x

AI ALIGNMENT FORUM

AF

Joern Stoehler

Joern Stoehler

Message

161

1

22

4y

Joern Stoehler

161

4y

Joern Stoehler — AI Alignment Forum

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

by Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache, and Marius Hobbhahn

This is a linkpost for our two recent papers: 1. An exploration of using degeneracy in the loss landscape for interpretability https://arxiv.org/abs/2405.10927 2. An empirical test of an interpretability technique based on the loss landscape https://arxiv.org/abs/2405.10928 This work was produced at Apollo Research in collaboration with Kaarel Hanni (Cadenza Labs),...

May 20, 2024•108