All of StefanHex's Comments + Replies

Nice work! I'm especially impressed by the [word] and [word] example: This cannot be read-off the embeddings, thus the model must be actually computing and storing this feature somewhere! I think this is exciting since the things we care about (deception etc.) are also definitely not included in the embeddings. I think you could make a similar case for Title Case and Beginning & End of First Sentence but those examples look less clear, e.g. the Title Case could be mostly stored in "embedding of uppercase word that is usually lowercase".

1Logan Riggs Smith5mo
Actually any that are significantly effected in "Ablated Text" means that it's not just the embedding. Ablated Text here means I remove each token in the context & see the effect on the feature activation for the last token. This is True in the StackExchange & Last Name one (though only ~50% of activation for last-name, will still recognize last names by themselves but not activate as much). The Beginning & End of First Sentence actually doesn't have this effect (but I think that's because removing the first word just makes the 2nd word the new first word?), but I haven't rigorously studied this.

Thank you for making the early write-up! I'm not entirely certain I completely understand what you're doing, could I give you my understanding and ask you to fill the gaps / correct me if you have the time? No worries if not, I realize this is a quick & early write-up!


As previously you run Pythia on a bunch of data (is this the same data for all of your examples?) and save its activations.
Then you take the residual stream activations (from which layer?) and train an autoencoder (like Lee, Dan & beren here) with a single hidden layer (w/ ReLU)

... (read more)
1Logan Riggs Smith5mo
Setup: Model: Pythia-70m (actually named 160M!) Transformer lens: "blocks.2.hook_resid_post" (so layer 2) Data: Neel Nanda's Pile-10k (slice of pile, restricted to have only 25 tokens, same as last post) Dictionary_feature sizes: 4x residual stream ie 2k (though I have 1x, 2x, 4x, & 8x, which learned progressively more features according to the MCS metric) Uniform Examples: separate feature activations into bins & sample from each bin (eg one from [0,1], another from [1,2]) Logit Lens: The decoder here had 2k feature directions. Each direction is size d_model, so you can directly unembed the feature direction (e.g. the German Feature) you're looking at. Additionally I subtract out several high norm tokens from the unembed, which may be an artifact of the pythia tokenizer never using those tokens (thanks Wes for mentioning this!) Ablated Text: Say the default feature (or neuron in your words) activation of Token_pos 10 is 5, so you can remove all tokens from 0 to 10 one at a time and see the effect on the feature activation. I select the token pos by finding the max feature activating position or the uniform one described above. This at least shows some attention head dependencies, but not more complicated ones like (A or B... C) where removing A or B doesn't effect C, but removing both would. [Note: in the examples, I switch between showing the full text for context & showing the partial text that ends on the uniformly-selected token]  

Hi, and thanks for the comment!

Do you think there should be a preference to the whether one patches clean --> corrupt or corrupt --> clean?

Both of these show slightly different things. Imagine an "AND circuit" where the result is only correct if two attention heads are clean. If you patch clean->corrupt (inserting a clean attention head activation into a corrupt prompt) you will not find this; but you do if you patch corrupt->clean. However the opposite applies for a kind of "OR circuit". I historically had more success with corrupt->clean s... (read more)

Thank for for the extensive comment! Your summary is really helpful to see how this came across, here's my take on a couple of these points:

2.b: The network would be sneaking information about the size of the residual stream past LayerNorm. So the network wants to implement an sort of "grow by a factor X every layer" and wants to prevent LayerNorm from resetting its progress.

  1. There's the difference between (i) How does the model make the residual stream grow exponentially -- the answer is probably theory 1, that something in the weights grow exponentially
... (read more)
4Alex Turner7mo
Although -- naive speculation -- the deletion-by-magnitude theory could enforce locality in what layers read what information, which seems like it would cut away exponentially many virtual heads? That would be awfully convenient for interpretability. (More trying to gesture at some soft "locality" constraint, rather than make a confident / crisp claim in this comment.)