A Complete Overview to Tokenization

Let us discuss absolutely everything about the greatest challenge that Large Language Models (LLMs) face — tokenization.

What Exactly is Tokenization?

Tokenization is the process of breaking down text into smaller parts called ‘tokens’ whereas a ‘tokenizer’ is the tool or algorithm that lets us do that.

Text can be tokenized into characters or words or even subwords — we’ll be exploring each of these and more in detail below.

Why do we Tokenize?

Unstructured text usually lacks a structure, splitting the text into smaller tokens and assigning a numerical value to each unique token offers the computer interpretability of that text.

All tokens in a given text are processed parallelly through the Transformers architecture which is the base architecture for LLMs — this parallel processing helps in increasing the processing speed. These tokens are then also processed through positional embeddings which allow the model to realize the order of the tokens in the text and get a sense of the context for each one. Realizing the context is extremely important since changing the context can drastically affect the meaning of a word or a sentence in a language.

Different Tokenization Techniques

  • Character Tokenization

We split the text into its characters in this method. Strings are already just a concatenated list of characters in most programming languages and hence, this method is easy to code.

text = "Sample text for tokenization!"tokens = list(text)print(tokens)
['S', 'a', 'm', 'p', 'l', 'e', ' ', 't', 'e', 'x', 't', ' ','f', 'o', 'r',' ', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'a','t', 'i', 'o', 'n', '!']

The problem with this type of tokenization is that the model receives tokens that only consist of characters along with their neighbor tokens also being the same — these same characters are present in lots of more different words causing the model to not get any sense of context.

  • Word Tokenization

The text is split into words in this technique and whitespaces are not included.

tokens = text.split()print(tokens)
['Sample', 'text', 'for', 'tokenization!']

As we can see from the above example, this approach is mostly better than the previous one because providing words as tokens provide more context, although, of course — they still can have different meanings in a sentence. 

We can also see that symbols such as ‘!’ are attached to a word, which is not desirable because the symbol in itself has a different meaning and can change the meaning of any word it is attached to. We don’t want to underfit other words that also might have the same symbol. To solve this, we split all symbols in the text into separate tokens.

['Sample', 'text', 'for', 'tokenization', '!']
  • Subword Tokenization

The basic concept of subword tokenization is that each word in a text should be further split into further subwords where for example, one of them would be a frequently-appearing substring and the latter would be a rarely-appearing meaningful base word.

The most popular tokenizer that uses this type of algorithm is called ‘WordPiece’. Let’s see a popular WordPiece-based tokenizer used by the DistilBERT model in action through the below code-

from transformers import DistilBertTokenizermodel = "distilbert-base-uncased"tokenizer = DistilBertTokenizer.from_pretrained(model)encoded_text = tokenizer(text)tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)print(tokens)
['[CLS]', 'sample', 'text', 'for', 'token', '##ization', '!', '[SEP]']

As we can see the word ‘tokenization’ is further split into two tokens — ‘token’ and ‘##ization’, this is because ‘token’ is a rarely-appearing base word whereas ‘ization’ is a substring that can appear many times with various words and change the meaning of any word it is attached to just like a symbol. The ‘##’ in ‘##ization’ represents that it is attached to another token before it.

Another thing that we notice is the addition of two new tokens — ‘[CLS]’ and ‘[SEP]’, these are nothing but extra tokens added by this particular tokenizer to indicate the start and end of the text respectively so as to make interpretation easier for the model.

  • Byte-Pair Encoding (BPE)

The first thing that this algorithm does is use a pre-tokenization method to split the text into unique words and get the frequency for each one as well. Example-

("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)

Now, the technique creates a base vocabulary for itself from all the unique characters: [“b”, “g”, “h”, “n”, “p”, “s”, “u”].

BPE then also takes into account the frequency of characters. To understand what it does further, let’s split all unique words into their characters-

("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4),("h" "u" "g" "s", 5)

The algorithm looks for the most frequently appearing contiguous pair of characters. In the example above, ‘u’ and ‘g’ appear the most frequently (10+5+5=20) and hence they will be merged into ‘ug’ which will also be added to the base vocabulary-

("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4),("h" "ug" "s", 5)

From the new base vocabulary, this merging process will continue the same way upto a certain threshold set by the programme.

The final vocabulary will then be applied on any text encountered by the tokenizer. For example, ‘bugs’ would be tokenized into [‘b’, ‘ug’, ‘s’] whereas ‘mugs’ would be tokenized into [‘<unk>’, ‘ug’, ‘s’]

As we can see, the letter ‘m’ got changed into an <unk> character, and this is because ‘m’ was not present anywhere in the base vocabulary. The same will happen to all new characters.

Byte-level BPE is a method used by GPT-2 where instead of taking all unicode characters for the base vocabulary, bytes (0–255) are taken. These bytes represent various characters and the method helps reduce the scope of the vocabulary to 256.

  • Unigram Tokenization

This algorithm starts by already initialising a large vocabulary and trims it down along the way. At each training step, this model calculates the loss between the vocabulary and the training data so as to decide what to keep in the vocabulary. This process is repeated until the vocabulary has reached the desired size. 

The method makes sure to retain all the base character so all words can be tokenized.

Popular Tokenizer Libraries

  • SentencePiece

SentencePiece is a library developed by Google for general-purpose tokenization. It offers tokenization through all Unigram, BPE, Word and Character methods, although, by default it is set to Unigram.

It is different from just the regular vanilla methods because unlike them, SentencePiece also incorporates whitespaces in the tokenization with an ‘_’ symbol.

Let’s use a pre-trained BPE model with SentencePiece-

import sentencepiece as spmsp = spm.SentencePieceProcessor()sp.load('en.wiki.bpe.vs1000.model')tokens = sp.encode_as_pieces(text)print(tokens)
['▁', 'S', 'amp', 'le', '▁te', 'x', 't', '▁for', '▁to', 'k', 'en', 'iz','ation', '!']
  • TikToken

TikToken is a tokenizer algorithm as well as an open-source library developed by OpenAI. It is a fast-BPE tokenizer made specifically for use with GPT models. Due to it being focused for only a particular type of model, it is much faster than any other BPE tokenizer.

Challenges with Tokenization for LLMs

Other than it being hard to tell the model about the exact meaning and context of the token, there’s another problem with them that sometimes breaks entire language models — glitch tokens.

Glitch tokens are the tokens that when input into most language models, cause the model to give anomalous outputs, some of these tokens include — ‘SolidGoldMagikarp’, ‘attRot’, ‘ysics’ etc.

It is believed that this happens because the tokenizers of these LLMs are trained on a huge corpus of data that is directly web scraped from the internet while the text that the LLMs themself are trained on is properly curated with much more effort, this leads to some tokens being present in the training set of the tokenizer but not the LLM and hence, the language model doesn’t know how to deal with them causing it to break.

It is also observed that when k-means is used to cluster similar tokens, these glitch-tokens tend to be the center of them although, the reason is not confirmed.

Tokenizer-Free Approach for LLMs

In a recent paper called ‘T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings’, usage of sparse vectors — to represent text has been talked about which can help in completely skipping the heavy tokenization process. This sparse representation can be more efficient and also preserve the semantic meaning of the text that the previous embeddings after tokenization did.

Research Papers on Tokenization

Here are some research papers that I recommend giving a read to know more about tokenization-