Top postsTop post
TLDR: we find that SAEs trained on the difference in activations between a base model and its instruct finetune are a valuable tool for understanding what changed during finetuning. This work is the result of Jacob and Santiago's 2-week research sprint as part of Neel Nanda's training phase for MATS...
Abstract We are interested in model-diffing: finding what is new in the chat model when compared to the base model. One way of doing this is training a crosscoder, which would just mean training an SAE on the concatenation of the activations in a given layer of the base and...