SAE Training Dataset Influence in Feature Matching and a Hypothesis on Position Features
Abstract Sparse Autoencoders (SAEs) linearly extract interpretable features from a large language model's intermediate representations. However, the basic dynamics of SAEs, such as the activation values of SAE features and the encoder and decoder weights, have not been as extensively visualized as their implications. To shed light on the properties...
Feb 26, 20254