Jett Janiak — AI Alignment Forum

Polysemantic Attention Head in a 4-Layer Transformer

Produced as a part of MATS Program, under @Neel Nanda and @Lee Sharkey mentorship Epistemic status: optimized to get the post out quickly, but we are confident in the main claims TL;DR: head 1.4 in attn-only-4l exhibits many different attention patterns that are all relevant to model's performance Introduction *...

Nov 9, 202351

An adversarial example for Direct Logit Attribution: memory management in gelu-4l

Please check out our notebook for figure recreation and to examine your own model for clean-up behavior. Produced as part of ARENA 2.0 and the SERI ML Alignment Theory Scholars Program - Spring 2023 Cohort Fig 5: Correlation between DLA of writer head and DLA of [clean-up heads output dependent...

Aug 30, 202317

A circuit for Python docstrings in a 4-layer attention-only transformer

Produced as part of the SERI ML Alignment Theory Scholars Program under the supervision of Neel Nanda - Winter 2022 Cohort. TL;DR: We found a circuit in a pre-trained 4-layer attention-only transformer language model. The circuit predicts repeated argument names in docstrings of Python functions, and it features * 3...

Feb 20, 202396