Stefan Heimersheim

Message

Stefan Heimersheim. Mechanistic interpretability & AI safety researcher, previously at FAR.AI and Apollo Research. The opinions expressed here are my own and do not necessarily reflect the views of my employer.

2154

264

129

Stefan Heimersheim

Stefan Heimersheim

Stefan Heimersheim

You can remove GPT2’s LayerNorm by fine-tuning for an hour

[Interim research report] Activation plateaus & sensitive directions in GPT2

Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1

Attribution-based parameter decomposition

Stefan Heimersheim

You can remove GPT2’s LayerNorm by fine-tuning for an hour

[Interim research report] Activation plateaus & sensitive directions in GPT2

Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1

Attribution-based parameter decomposition

Detecting Strategic Deception Using Linear Probes

Attribution-based parameter decomposition

You can remove GPT2’s LayerNorm by fine-tuning for an hour

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

[Interim research report] Activation plateaus & sensitive directions in GPT2

Apollo Research 1-year update

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks