This is a special post for quick takes by Sam Marks. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
2 comments, sorted by Click to highlight new comments since:

Some updates about the dictionary_learning repo:

  • The repo now has support for ghost grads. h/t g-w1 for submitting a PR for this
  • ActivationBuffers now work natively with model components -- like the residual stream -- whose activations are typically returned as tuples; the buffer knows to take the first component of the tuple (and will iteratively do this if working with nested tuples).
  • ActivationBuffers can now be stored on the GPU.
  • The file contains code for evaluating trained dictionaries. I've found this pretty useful for quickly evaluating dictionaries people send to me.
  • New convenience: you can do reconstructed_acts, features = dictionary(acts, output_features=True) to get both the reconstruction and the features computed by dictionary.

Also, if you'd like to train dictionaries for many model components in parallel, you can use the parallel branch. I don't promise to never make breaking changes to the parallel branch, sorry.

Finally, we've released a new set of dictionaries for the MLP outputs, attention outputs, and residual stream in all layers of Pythia-70m-deduped. The MLP and attention dictionaries seem pretty good, and the residual stream dictionaries seem like a mixed bag. Their stats can be found here.

Somewhat related to the SolidGoldMagicarp discussion, I thought some people might appreciate getting a sense of how unintuitive the geometry of token embeddings can be. Namely, it's worth noting that the tokens whose embeddings are most cosine-similar to a random vector in embedding space tend not to look very semantically similar to each other. Some examples:

v_1                 v_2             v_3
 characterized       Columb          determines
 Stra                1900           conserv
 Ire                 sher            distinguishes
sent                 paed            emphasizes
 Shelter             000             consists
 Pil                mx               operates
stro                 female          independent
 wired               alt             operate
 Kor                GW               encompasses
 Maul                lvl             consisted

Here v_1, v_2, v_3, are random vectors in embedding space (drawn from ), and the columns give the 10 tokens whose embeddings are most cosine-similar to . I used GPT-2-large.

Perhaps 20% of the time, we get something like , where many of the nearest neighbors have something semantically similar among them (in this case, being present tense verbs in the 3rd person singular).

But most of the time, we get things that look like  or : a hodgepodge with no obvious shared semantic content. GPT-2-large seems to agree: picking " female" and " alt" randomly from the  column, the cosine similarity between the embeddings of these tokens is 0.06.

[Epistemic status: I haven't thought that hard about this paragraph.] Thinking about the geometry here, I don't think any of this should be surprising. Given a random vector , we should typically find that  is ~orthogonal to all of the ~50000 token embeddings. Moreover, asking whether the nearest neighbors to  should be semantically clustered seems to boil down to the following. Divide the tokens into semantic clusters ; then compare the distribution of intra-cluster variances  to the distribution of cosine similiarities of the cluster means . From the perspective of cosine similarity to , we should expect these clusters to look basically randomly drawn from the full dataset , so that each variance in the former set should be . This should be greater than the mean of the latter set, implying that we should expect the nearest neighbors to  to mostly be random tokens taken from different clusters, rather than a bunch of tokens taken from the same cluster. I could be badly wrong about all of this, though.

There's a little bit of code for playing around with this here.