This is an informal research note. It is the result of a few-day exploration into RMU through the lens of model internals. Code to reproduce the main result is available here. This work was produced as part of Ethan Perez's stream in the ML Alignment & Theory Scholars Program -...
This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee. This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal. We thank Nina...
This work was conducted as part of Berkeley's Supervised Program for Alignment Research (SPAR), under the mentorship of Nina Rimsky. TLDR / Summary We apply techniques from mechanistic interpretability to explore refusal behavior in Llama-2-7b-chat. We are able to identify a small set of attention heads that, when patched, are...