This is a special post for short-form writing by Sam Marks. Only they can create top-level comments. Comments here also appear on the Shortform Page and All Posts page.

This is a special post for short-form writing by Sam Marks. Only they can create top-level comments. Comments here also appear on the Shortform Page and All Posts page.

Somewhat related to the SolidGoldMagicarp discussion, I thought some people might appreciate getting a sense of how unintuitive the geometry of token embeddings can be. Namely, it's worth noting that the tokens whose embeddings are most cosine-similar to a

randomvector in embedding space tend not to look very semantically similar to each other. Some examples:Here v_1, v_2, v_3, are random vectors in embedding space (drawn from N(0,Id_emb)), and the columns give the 10 tokens whose embeddings are most cosine-similar to vi. I used GPT-2-large.

Perhaps 20% of the time, we get something like v3, where many of the nearest neighbors have something semantically similar among them (in this case, being present tense verbs in the 3rd person singular).

But most of the time, we get things that look like v1 or v2: a hodgepodge with no obvious shared semantic content. GPT-2-large seems to agree: picking " female" and " alt" randomly from the v2 column, the cosine similarity between the embeddings of these tokens is 0.06.

[Epistemic status: I haven't thought that hard about this paragraph.]Thinking about the geometry here, I don't think any of this should be surprising. Given a random vector v∈R1280, we should typically find that v is ~orthogonal to all of the ~50000 token embeddings. Moreover, asking whether the nearest neighbors to v should be semantically clustered seems to boil down to the following. Divide the tokens into semantic clusters S1,…,Sn; then compare the distribution of intra-cluster variances {Varw∈Si(⟨w,v⟩cos)}ni=1 to the distribution of cosine similiarities of the cluster means {⟨Ew∈Si[w],v⟩cos}ni=1. From the perspective of cosine similarity to v, we should expect these clusters to look basically randomly drawn from the full dataset S=⋃ni=1Si, so that each variance in the former set should be ≈Varw∈S(⟨w,v⟩cos). This should be greater than the mean of the latter set, implying that we should expect the nearest neighbors to v to mostly be random tokens taken from different clusters, rather than a bunch of tokens taken from the same cluster. I could be badly wrong about all of this, though.There's a little bit of code for playing around with this here.