x

AI ALIGNMENT FORUM

AF

keith_wynroe — AI Alignment Forum

keith_wynroe

Top postsTop post

keith_wynroe

Message

333

Ω

42

5

34

4y

keith_wynroe

333

Ω

42

4y

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning

This work was produced as part of Lee Sharkey's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort Intro and Motivation Sparse dictionary learning (SDL) has attracted a lot of attention recently as a method for interpreting transformer activations. They demonstrate that model activations can often...

Jul 2, 2024•87

An OV-Coherent Toy Model of Attention Head Superposition

by Lauren Greenspan and keith_wynroe

Background This project was inspired by Anthropic’s post on attention head superposition, which constructed a toy model trained to learn a circuit to identify skip-trigrams that are OV-incoherent (attending from multiple destination tokens to a single source token) as a way to ensure that superposition would occur. Since the OV...

Aug 29, 2023•26