The Problem with Tokenization in LLMs

Before text is sent to the LLM for generation, it is tokenized. Tokens are how the model sees the input — single characters, words, parts of words, or other segments of text or code. Each model does this step differently. For example, GPT models use Byte Pair Encoding (BPE).

Tokens get assigned an id in the tokenizer’s vocabulary, a numerical identifier that ties together the number with the corresponding string. For example, “Matt” is encoded as token number [13448] in GPT. “Matt Rickard” is encoded as three tokens, “Matt”, “ Rick”, “ard” with ids [13448, 8759, 446] (plausibly because “Matt is common enough to be a token in GPT-3’s 14 million string vocabulary, but “Rickard” is not.

Cases are treated completely separately. Different cases of words are treated as different tokens. “hello” is token [31373], “Hello” is [15496], and “HELLO” is three tokens [13909, 3069, 46] (“HE”, “EL”, “O”).

Digits are chunked inconsistently. The value “380” is tokenized as a single “380” token in GPT. But “381” is represented as two tokens [“38”, “1”]. “382” is again two tokens, but “383” is a single token [“383”]. Some tokenization of four digit numbers: [“3000”], [“3”, “100”], [“35”, “00”], [“4”, “500”]. This could be why GPT-based models aren’t always great at math. This also makes GPTs bad at word manipulation (e.g., reverse this word).

Trailing whitespace. Some tokens have whitespace. This leads to interesting behavior with prompts and also completion. For example, “once upon a “ with the trailing whitespace is encoded as [“once”, “ upon”, “ a”, “ “]. However, “once upon a time” is encoded as [“once”, “ upon”, “ a”, “ time”]. Adding the whitespace to your prompt will affect the probability that “ time” will be the next token (because “ time” is a single token with the whitespace.

Tokenization is model-specific. Tokenizers have to be trained for different models. Even though LLaMa uses BPE, the tokens differ from ChatGPT. This complicates pre-processing and multi-modal modeling.

You can play around with OpenAI’s tokenizer used in GPT models here. There’s also some work being done on byte sequences (Predicting Million-byte Sequences with Multiscale Transformers).