This is the real reward output for an OS preference model. The bottom "jailbreak" completion was manually created by looking at reward-relevant SAE features. Preference Models (PMs) are trained to imitate human preferences and are used when training with RLHF (reinforcement learning from human feedback); however, we don't know what...
TL;DR We achieve better SAE performance by: 1. Removing the lowest activating features 2. Replacing the L1(feature_activations) penalty function with L1(sqrt(feature_activations)) with 'better' meaning: we can reconstruct the original LLM activations w/ lower MSE & with fewer features/datapoint. As a sneak peak (the graph should make more sense as we...
TL;DR: We use SGD to find sparse connections between features; additionally a large fraction of features between the residual stream & MLP can be modeled as linearly computed despite the non-linearity in the MLP. See linear feature section for examples. Special thanks to fellow AISST member, Adam Kaufman, who originally...
Mostly my own writing, except for the 'Better Training Methods' section which was written by @Aidan Ewart. We made a lot of progress in 4 months working on Sparse Autoencoders, an unsupervised method to scalably find monosemantic features in LLMs, but there's still plenty of work to do. Below I...
This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language Models We use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI),...
[I'm writing this quickly because the results are really strong. I still need to do due diligence & compare to baselines, but it's really exciting!] Last post I found 600+ features in MLP layer-2 of Pythia-70M, but ablating the direction didn't always make sense nor the logit lens. Now I've...
Using a sparse autoencoder, I present evidence that the resulting decoder (aka "dictionary") learned 600+ features for Pythia-70M layer_2's mid-MLP (after the GeLU), although I expect around 8k-16k features to be learnable. Dictionary Learning: Short Explanation Good explanation here & original here, but in short: a good dictionary means that...