This is a linkpost for our two recent papers: 1. An exploration of using degeneracy in the loss landscape for interpretability https://arxiv.org/abs/2405.10927 2. An empirical test of an interpretability technique based on the loss landscape https://arxiv.org/abs/2405.10928 This work was produced at Apollo Research in collaboration with Kaarel Hanni (Cadenza Labs),...
Some people worry about interpretability research being useful for AI capabilities and potentially net-negative. As far as I was aware of, this worry has mostly been theoretical, but now there is a real world example: The hungry hungry hippos (H3) paper. Tl;dr: The H3 paper * Proposes an architecture for...
Finite factored sets are a new paradigm for talking about causality. You can use them to do some cool things you can’t do with Pearl’s causal graphs, for example inferring a causal arrow between two binary variables. Also, finite factored sets are a really neat mathematical structure: they are a...