Alignment Stream of Thought

Wiki Contributions



We had done very extensive ablations at small scale where we found TopK to be consistently better than all of the alternatives we iterated through, and by the time we launched the big run we had already worked out how to scale all of the relevant hyperparameters, so we were decently confident.

One reason we might want a progressive code is it would basically let you train one autoencoder and use it for any k you wanted to at test time (which is nice because we don't really know exactly how to set k for maximum interpretability yet). Unfortunately, this is somewhat worse than training for the specific k you want to use, so our recommendation for now is to train multiple autoencoders.

Also, even with a progressive code, the activations on the margin would not generally be negative (we actually apply a ReLU to make sure that the activations are definitely non-negative, but almost always the (k+1)th value is still substantially positive)


To add some more concreteness: suppose we open up the model and find that it's basically just a giant k nearest neighbors (it obviously can't be literally this, but this is easiest to describe as an analogy). Then this would explain why current alignment techniques work and dissolves some of the mystery of generalization. Then suppose we create AGI and we find that it does something very different internally that is more deeply entangled and we can't really make sense of it because it's too complicated. Then this would imo also provide strong evidence that we should expect our alignment techniques to break.

In other words, a load bearing assumption is that current models are fundamentally simple/modular in some sense that makes interpretability feasible, and that observing this breaking in the future is probably important evidence that will hopefully come before those future systems actually kill everyone.


Thanks for your kind words!

My views on interpretability are complicated by the fact that I think it's quite probable there will be a paradigm shift between current AI and the thing that is actually AGI like 10 years from now or whatever. So I'll describe first a rough sketch of what I think within-paradigm interp looks like and then what it might imply for 10 year later AGI. (All these numbers will be very low confidence and basically made up)

I think the autoencoder research agenda is currently making significant progress on item #1. The main research bottlenecks here are (a) SAEs might not be able to efficiently capture every kind of information we care about (e.g circular features) and (b) residual stream autoencoders are not exactly the right thing for finding circuits. Probably this stuff will take a year or two to really hammer out. Hopefully our paper helps here by giving a recipe to push autoencoders really quickly so we bump into the limitations faster and with less second guessing about autoencoder quality.

Hopefully #4 can be done to some great part in parallel with #1; there's a whole bunch of engineering needed to e.g take autoencoders and scale them up to capture all the behavior of the model (which was also a big part of the contribution of this paper). I'm pretty optimistic that if we have a recipe for #1 that we trust, the engineering (and efficiency improvements) for scaling up is doable. Maybe this adds another year of serial time. The big research uncertainty here fmpov is how hard it is to actually identify the structures we're looking for, because we'll probably have a tremendously large sparse network where each node does some really boring tiny thing.

However, I mostly expect that GPT-4 (and probably 5) is probably just actually not doing anything super spicy/stabby. So I think most of the value of doing this interpretability will be to sort of pull back the veil, so to speak, of how these models are doing all the impressive stuff. Some theories of impact:

  • Maybe we'll become less confused about the nature of intelligence in a way that makes us just have better takes about alignment (e.g there will be many mechanistic theories of what the heck GPT-4 is doing that will have been conclusively ruled out)
  • Maybe once the paradigm shift happens, we will be better prepared to identify exactly what interpretability assumptions it broke (or even just notice whether some change is causing a mechanistic paradigm shift)

Unclear what timeline these later things happen on; probably depends a lot on when the paradigm shift(s) happen.


For what it's worth, it seems much more likely to me for catastrophic Goodhart to happen because the noise isn't independent from the thing we care about, rather than the noise being independent but heavy tailed.


It doesn't seem like a huge deal to depend on the existence of smaller LLMs - they'll be cheap compared to the bigger one, and many LM series already contain smaller models. Not transferring between sites seems like a problem for any kind of reconstruction based metric because there's actually just differently important information in different parts of the model.


Sorry I meant the Anthropiclike neuron resampling procedure.

I think I misread Neel's comment, I thought he was saying that 131k was chosen because larger autoencoders would have too many dead latents (as opposed to this only being for Pythia residual).


Another question: any particular reason to expect ablate-to-zero to be the most relevant baseline? In my experiments, I find ablate to zero to completely destroy the loss. So it's unclear whether 90% recovered on this metric actually means that much - GPT-2 probably recovers 90% of the loss of GPT-4 under this metric, but obviously GPT-2 only explains a tiny fraction of GPT-4's capabilities. I feel like a more natural measure may be for example the equivalent compute efficiency hit.


Got it - do you think with a bit more tuning the feature death at larger scale could be eliminated, or would it be tough to manage with the reinitialization approach?


Makes sense that the shift would be helpful

Load More