Message

rajashree

Message

115

[Replication] Crosscoder-based Stage-Wise Model Diffing

by Anna Soligo, Thomas Read, Oliver Clive-Griffin, dmanningcoe, Chun Hei Yip, rajashree, and Jason Gross

Introduction Anthropic recently released Stage-Wise Model Diffing, which presents a novel way of tracking how transformer features change during fine-tuning. We've replicated this work on a TinyStories-33M language model to study feature changes in a more accessible research context. Instead of SAEs we worked with single-model all-layer crosscoders, and found...

Mar 22, 2025•25

Measuring Nonlinear Feature Interactions in Sparse Crosscoders [Project Proposal]

by Jason Gross and rajashree

TL;DR * Problem: Sparse crosscoders are powerful tools for compressing neural network representations into interpretable features. However, we don’t understand how features interact. * Perspective: We need systematic procedures to measure and rank nonlinear feature interactions. This will help us identify which interactions deserve deeper interpretation. Success can be measured...

Jan 6, 2025•19

Compact Proofs of Model Performance via Mechanistic Interpretability

by LawrenceC, rajashree, Adrià Garriga-alonso, and Jason Gross

We recently released a paper on using mechanistic interpretability to generate compact formal guarantees on model performance. In this companion blog post to our paper, we'll summarize the paper and flesh out some of the motivation and inspiration behind our work. Paper abstract > In this work, we propose using...

Jun 24, 2024•104