Adam Karvonen*, Can Rager*, Johnny Lin*, Curt Tigges*, Joseph Bloom*, David Chanin, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Samuel Marks, Neel Nanda *equal contribution TL;DR * We are releasing SAE Bench, a suite of 8 diverse sparse autoencoder (SAE) evaluations including unsupervised metrics and...
Work performed as a part of Neel Nanda's MATS 6.0 (Summer 2024) training program. TLDR This is an interim report on reverse-engineering Othello-GPT, an 8-layer transformer trained to take sequences of Othello moves and predict legal moves. We find evidence that Othello-GPT learns to compute the board state using many...
Please check out our notebook for figure recreation and to examine your own model for clean-up behavior. Produced as part of ARENA 2.0 and the SERI ML Alignment Theory Scholars Program - Spring 2023 Cohort Fig 5: Correlation between DLA of writer head and DLA of [clean-up heads output dependent...
Overview * Solving the problem of mesa-optimization would probably be easier if we understood how models do search internally * We are training GPT-type models on the toy task of solving mazes and studying them in both a mechanistic interpretability and behavioral context. * This post lays out our model...