Magdalena Wache

Message

570

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

This is a linkpost for our two recent papers: 1. An exploration of using degeneracy in the loss landscape for interpretability https://arxiv.org/abs/2405.10927 2. An empirical test of an interpretability technique based on the loss landscape https://arxiv.org/abs/2405.10928 This work was produced at Apollo Research in collaboration with Kaarel Hanni (Cadenza Labs),...

May 20, 2024•108

Interpretability Externalities Case Study - Hungry Hungry Hippos

Some people worry about interpretability research being useful for AI capabilities and potentially net-negative. As far as I was aware of, this worry has mostly been theoretical, but now there is a real world example: The hungry hungry hippos (H3) paper. Tl;dr: The H3 paper * Proposes an architecture for...

Sep 20, 2023•64

Finite Factored Sets in Pictures

Finite factored sets are a new paradigm for talking about causality. You can use them to do some cool things you can’t do with Pearl’s causal graphs, for example inferring a causal arrow between two binary variables. Also, finite factored sets are a really neat mathematical structure: they are a...

Dec 11, 2022•186