Mech Interp Puzzle 2: Word2Vec Style Embeddings — AI Alignment Forum

x

Mech Interp Puzzle 2: Word2Vec Style Embeddings — AI Alignment Forum

Code can be found here. No prior knowledge of mech interp or language models is required to engage with this.

Language model embeddings are basically a massive lookup table. The model "knows" a vocabulary of 50,000 tokens, and each one has a separate learned embedding vector.

Visual illustration of word embeddings

But these embeddings turn out to contain a shocking amount of structure! Notably, it's often linear structure, aka word2vec style structure. Word2Vec is a famous result (in old school language models, back in 2013!), that `man - woman == king - queen`. Rather than being a black box lookup table, the embedded words were broken down into independent variables, "gender" and "royalty". Each variable gets its own direction, and the embedded word is seemingly the sum of its variables.

One of the more striking examples of this I've found is a "number of characters per token" direction - if you do a simple linear regression mapping each token to the number of characters in it, this can be very cleanly recovered! (If you filter out ridiculous tokens, like 19979: 512 spaces).

Notably, this is a numerical feature not a categorical feature - to go from 3 tokens to four, or four to five, you just add this direction! This is in contrast to the model just learning to cluster tokens of length 3, of length 4, etc.

Question 2.1: Why do you think the model cares about the "number of characters" feature? And why is it useful to store it as a single linear direction?

There's tons more features to be uncovered! There's all kinds of fundamental syntax-level binary features that are represented strongly, such as "begins with a space".

Question 2.2: Why is "begins with a space" an incredibly important feature for a language model to represent? (Playing around a tokenizer may be useful for building intuition here)

You can even find some real word2vec style relationships between pairs of tokens! This is hard to properly search for, because most interesting entities are multiple tokens. One nice example of meaningful single token entities is common countries and capitals (idea borrowed from Merullo et al). If you take the average embedding difference for single token countries and capitals, this explains 18.58% of the variance of unseen countries! (0.25% is what I get for a randomly chosen vector).

Caveats: This isn't quite the level we'd expect for real word2vec (which should be closer to 100%), and cosine sim only tracks that the direction matters, not what its magnitude is (while word2vec should be constant magnitude, as it's additive). My intuition is that models think more in terms of meaningful directions though, and that the exact magnitude isn't super important for a binary variable.

Question 2.3: A practical challenge: What other features can you find in the embedding? Here's the colab notebook I generated the above graphs from, it should be pretty plug and play. The three sections should give examples for looking for numerical variables (number of chars), categorical variables (begins with space) and relationships (country to capital). Here's some ideas - I encourage you to spend time brainstorming your own!

Is a number
How frequent is it? (Use pile-10k to get frequency data for the pile)
Is all caps
Is the first token of a common multi-token word
Is a first name
Is a function word (the, a, of, etc)
Is a punctuation character
Is unusually common in German (or language of your choice)
The indentation level in code
Relationships between common English words and their French translations
Relationships between the male and female version of a word

Please share your thoughts and findings in the comments! (Please wrap them in spoiler tags)