x

AI ALIGNMENT FORUM

AF

Santiago Aranguri — AI Alignment Forum

Santiago Aranguri

Top postsTop post

Santiago Aranguri

Message

208

Ω

23

6

5

1y

Santiago Aranguri

208

Ω

23

1y

SAE on activation differences

TLDR: we find that SAEs trained on the difference in activations between a base model and its instruct finetune are a valuable tool for understanding what changed during finetuning. This work is the result of Jacob and Santiago's 2-week research sprint as part of Neel Nanda's training phase for MATS...

Jun 30, 2025•45

Tied Crosscoders: Explaining Chat Behavior from Base Model

Abstract We are interested in model-diffing: finding what is new in the chat model when compared to the base model. One way of doing this is training a crosscoder, which would just mean training an SAE on the concatenation of the activations in a given layer of the base and...

Mar 22, 2025•9