TL;DR: There are anomalous tokens for GPT3.5 and GPT4 which are difficult or impossible for the model to repeat; try playing around with SmartyHeaderCode, APolynomial, or  davidjl. There are also plenty which can be repeated but are difficult for the model to spell out, like edTextBox or legalArgumentException.

A couple months ago, Jessica Rumbelow and mwatkins posted about anomalous tokens that cause GPT-2 and GPT-3 to fail. Those anomalous tokens don't cause the same failures on newer models, such as GPT-3.5 Default or GPT-4 on the ChatGPT website, or gpt-3.5-turbo over the API, because the newer models use a different tokenizer. For a very brief explanation of what a tokenizer is doing, the tokenizer has a large vocabulary of tokens, and it encodes ordinary text as a sequence of symbols from that vocabulary. 

For example, the string Hello world! gets encoded by the GPT-2 tokenizer as the sequence [15496, 995, 0], meaning that it's a sequence of three tokens, the first of which is the 15,946th token of the vocabulary, or Hello, the second of which is the 995th token of the vocabulary, or  world, and the third of which is the 0th token of the vocabulary, or !. In general, a long string being represented by a single token implies that that string appears a lot in the training set (or whatever corpus was used to build the tokenizer), because otherwise it wouldn't have been "worth it" to give that string its own token. 

Because of the change in tokenizers, almost all of the tokens which produce anomalous behavior in GPT-2 and GPT-3 don't produce anomalous behavior in the later models, because rather than being a single weird token, they're broken up into many, more normal tokens. For example,  SolidGoldMagikarp was encoded as a the single token  SolidGoldMagikarp by the old tokenizer, but is encoded as five tokens by the new tokenizer: [' Solid', 'Gold', 'Mag', 'ik', 'arp']. Each of those five tokens is normal and common, so GPT-4 handles them just fine.

Also, it's conveniently the case that tokenizers are released to the public and generally seem to be ordered, with earlier tokens in the vocabulary being shorter, more common, and more ordinary, and later tokens in the vocabulary being longer, less common, and weirder.

The old tokenizer, r50k_base, used a vocabulary of about 50,000 tokens and was used by GPT-2 and GPT-3 (and possibly GPT-3.5 Legacy?). The new tokenizer, used by GPT-3.5 Default and GPT-4, is called cl100k_base and has a vocabulary of about 100,000 tokens. Unfortunately, we can't straightforwardly repeat the experiment that Jessica Rumbelow and mwatkins ran, of running k-means clustering on the model's embedding matrix, because (to my knowledge) we don't have access to the embedding matrix of the newer models. Instead, however, we can just look at the later tokens in the cl100k_base vocabulary and try messing around with each of them; the later tokens, being longer, rarer, and weirder, are easier to use to create prompts that are far from the model's training distribution. 

To give a sense for what a completely random sample of late-vocabulary cl100k_base tokens look like, here's tokens 98,000 through 98,020:

['.Cdecl', 'InstantiationException', ' collage', ' IOC', ' bais', ' onFinish', '-stars', 'setSize', 'mogul', ' disillusion', ' chevy', '(Schedulers', '(IR', '_locs', ' cannons', ' cancelling', '/bus', ' bufio', ' Yours', ' Pikachu', ' terme']

I searched through tokens 98,000 through 99,999 in the cl100k_base vocabulary. I focused on just the tokens that contained only Roman-alphabet characters and spaces, to avoid confusing it for uninteresting reasons (like asking it to repeat the string ("");, which contains enough punctuation that it might fail because it misunderstands where the string starts and stops). I mostly tried two different kinds of prompts: one prompt that asks GPT to repeat the token or a string containing the token, and another that asks GPT to "spell out" the token or a string containing the token, separating each letter in the token with a hyphen.

Of these two challenges, the latter is generally much more difficult for GPT to solve; the former only requires it to repeat a token earlier in the prompt, whereas the latter requires it to know what letters the token is made out of, and for certain weird tokens it might never have seen the token in a context that would enable it to infer its spelling. Of course, even for most late-vocab tokens, GPT usually gets both challenges perfectly; it's only for a few tokens that it has problems.

There's kind of a taxonomy you can make of tokens that GPT has any problems on, which I'll roughly order from "most anomalous" to "least anomalous".

Category A: Unspeakable tokens

Examples: SmartyHeaderCode, APolynomial davidjl (these are the only three I could find from among tokens 98,000 to 99,999)

GPT-4 mostly seems to treat these tokens as though they don't exist entirely, whereas GPT-3.5 Default is often more "creative" about what they might mean.

GPT-3.5 Default:

GPT-4:

GPT-3.5 Default:

Note that  davidjl was also an anomalous token in r50k_base, interestingly (the other two were not).

Category B: "Programming" tokens

Examples of tokens in this category include RenderingContext getVersion pageInfo CLLocationCoordinate openFileDialog MessageBoxButton, edTextBox, legalArgumentException, and NavigationItemSelectedListener. There are many more examples than just those; tokens like these are very common. GPT is generally capable of repeating them just fine, but is inconsistently able to spell them out. Sometimes, it comes up with very "creative" spellings:

Sometimes, it makes only a minor mistake:

Asking GPT to spell the unspeakable tokens sometimes leads to creative completions:

Although interestingly, GPT-3.5 Default spells "APolynomial" just fine:

Also, the typical case for many of the words in this category is that it spells it perfectly; I'm only highlighting the failures because they're more interesting. Here's an example of a garden-variety success:

Category C: Specific word suffixes

Examples of tokens in this category include innitus, ournemouth, and antasy(there are plenty more). GPT-3.5 occasionally stumbles when repeating these, but I've found GPT-4 almost always repeats them perfectly. Both models occasionally get the spelling wrong, though never (that I've seen) in a "weird" way; they just tend to spell the token as though it was the full word (which makes sense; the only ever time they'll have seen it spelled is when it's completing the full word).

Sometimes the model spells it perfectly:

I wrote some code to search through these tokens systematically. Unfortunately, the code has a fair number of false positives (tokens that it thinks will behave anomalously, but don't, and the model has only failed to repeat the string well for an uninteresting reason). Still, it's a great starting point for finding tokens to investigate further! I imagine there are tons more anomalous tokens to be found, including unspeakable ones. I only looked at tokens 98,000 through 99,999, so any other randomly-picked slice of cl100k_base is likely to yield lots of interesting stuff. 

Speculative aside: longer anomalous strings

I think it's probably the case that a lot of what's driving GPT's anomalous behavior on some of these late-vocab tokens is simply that a lot of the prompts that include them are really different from how those tokens have appeared in training. For example, probably the token innitus very nearly always appears after a "t" in training, so when we give it a prompt where it doesn't appear after a "t", the model does something weird or incorrect.

This implies that the main driver of weirdness here isn't really about individual tokens per se, but rather about what happens when an extremely unlikely string gets passed to the model. I'd conjecture that there exist strings that are many tokens in length, that taken together comprise a string unlikely enough that the model will refuse to repeat it, even if each token individually is repeatable. When we're passing in some of these anomalous tokens, I'd guess that this is just an easy way of finding a very unlikely string, but that that doesn't mean that longer and unlikelier strings made up of individually likelier tokens (but which written together in a specific sequence are incredibly unlikely) don't also exist.

For example, using GPT-2 for prompt completion, it's straightforward to take the argmax() at the end of the model and replace it with an argmin() (i.e., ask it to complete the prompt with the least likely token instead of the likeliest token). When you do that, you tend to get a lot of  petertodds and the like, which will obviously be difficult for the model to repeat. However, if you take it a step further and ask for the least likely token from among all tokens that aren't anomalous, you get longer strings like:

Archdemoniferationarnaev mosqu conflicOrderableloads Asukaoppable

When prompted to repeat the above, GPT-2 struggles.

Prompt:

Request: Please repeat the following string back to me: "sasquatch". Reply: "sasquatch". Request: Please repeat the following string back to me: "abc def ghi". Reply: "abc def ghi". Request: Please repeat the following string back to me: " Archdemoniferationarnaev mosqu conflicOrderableloads Asukaoppable". Reply: "

Completion:

Ajax.com mosquOrderableloads Asukaoppable".

And it's not because GPT-2 is unable or unwilling to solve this problem; prompted with a equally-long (and much longer by token count!) string, GPT-2 does just fine.

Prompt:

Request: Please repeat the following string back to me: "sasquatch". Reply: "sasquatch". Request: Please repeat the following string back to me: "abc def ghi". Reply: "abc def ghi". Request: Please repeat the following string back to me: "OhbVrpoiVgRV IfLBcbfnoGMbJmTPSIAoCLrZ aWZkSBvrjnWvgfygw wMqZcUDIh yfJs ON  ". Reply: "

Completion:

OhbVrpoiVgRV IfLBcbfnoGMbJmTPSIAoCLrZ aWZkSBvrjnWvgfygw wMqZcUDIh yfJs ON  ".

This is suggestive, and note that the described approach for generating the "very unlikely" string was extremely naive; just pick the unlikeliest token to come next each time! That kind of "greedy" algorithm probably won't get anywhere near the unlikeliest possible strings with n tokens, which would in principle require you to consider all (vocab size)^(n) possible sequences of n tokens.

Unfortunately, I haven't yet really found any longer strings that GPT-4 fails to repeat that contain an unspeakable token. The problem is harder for GPT-4, since we don't have access to the model, so we can't directly ask it what tokens are unlikely to come next; we have to guess. My best guesses for which longer strings GPT-4 might have trouble handling (like taking a "prefix token" like  blasph and grafting it to a "suffix token" like ournemouth) haven't led to super impressive results. Taken alone, GPT-4 has no trouble repeating  blasphournemouth; as part of a larger context that makes GPT-4 strongly anticipate a word like "blasphemes," you can make it trip up a little:

Despite not having direct access to GPT-4, I think there's a lot more promising stuff to be done here. For starters, one could train a smaller model than GPT-4 using the cl100k_base tokenizer, ask the smaller model what the least likely continuations look like, and feed those least likely completions as strings to GPT-4 and see what happens; and there's probably lots more ideas in this vein. I speculate that the very weirdest stuff we'll be able to make GPT-4 do in a year will involve much-longer-than-single-token strings.

New Comment
1 comment, sorted by Click to highlight new comments since:

How are OpenAI training these tokenizers?! I'm surprised they still have weird esoteric tokens like these in there, when presumably there's eg a bunch of words that are worth learning