Understanding Tokenization: Breaking Down Language for AI systems

24 jul 2025

Imagine sitting down with an alien from Mars. They’re eager to learn about humans, but there’s one problem: They don’t understand a word you’re saying. To communicate, you’d need to break your language down into small, digestible units and assign each one a meaning they can learn over time. This is exactly what happens when humans teach machines to understand language. Before any AI can answer a question, write a poem, or summarize a document, it first needs to convert our words into something it understands: Numbers. That process starts with tokenization - the method of slicing language into manageable chunks called tokens, the first stepping stone in making sense of human communication.

Now let’s replace aliens with computers. We know that computers operate with numbers, and hence so do AI systems. For example, if a system needs to process an image it will divide it into smaller units, pixels, and represent them using numbers (the RGB code!). In the same way, for an AI model to be able to interpret text, this text first needs to be converted into a sequence of numbers which is useful for further information processing. This is precisely what tokenization does! Tokenization is the art and science of chopping up text into individual units called "tokens." These tokens are the building blocks that machines use to understand, analyze, and even generate human language.

Before we continue with tokenization, if you are feeling lost about what an LLM is or how it is trained, you might want to first read our article: “How are Large Language Models trained? A step-by-step guide to LLM training.”

Why does tokenization matter?

At its core, NLP is about enabling computers to understand, interpret, and generate human language in a way that is useful. But raw text, as we write and speak it, is messy and unstructured. It has varying sentence lengths, punctuation, capitalization, and often, misspellings or slang. Before any sophisticated analysis can happen—whether it's translating a webpage, powering a chatbot, or summarizing a long document—the text needs to be standardized and broken down into a format that algorithms can easily process. Tokenization is the very first step in this crucial preparation phase, acting as the bridge between raw, unstructured text and the structured data that machine learning models require.

What is Tokenization?

Simply put, tokenization is the process of converting a sequence of characters into a sequence of tokens. These tokens can be words, phrases, or even individual characters, depending on the approach (we will discuss all these approaches further on). The main goal is to transform raw text into a structured format that algorithms can easily analyze and process.

Consider the sentence: "Frog on a log."

A tokenizer would break this sentence down into individual tokens, such as: `[CLS]`, `Frog`, `on`, `a`, `log`, `.`, `[SEP]`. You might notice some extra tokens like `[CLS]` and `[SEP]`. These are special tokens often used in more advanced NLP models to mark the beginning and end of a sentence or text segment, providing additional structural information. Each of these tokens is then mapped to a numerical ID (e.g., "Frog" might be 2025, "on" might be 2001, etc.). This numerical representation is what the machine truly "sees" and works with.

Normalization: Cleaning Up the Text

Language is complex and flexible, and created by humans, who often make mistakes or bend the rules of grammar. So, before we can start breaking text into tokens, we often need to normalize it. Normalization is a critical preprocessing step that cleans and standardizes the text, making it consistent and easier for tokenizers to handle. Think of it as preparing and laying out your ingredients before you start cooking.

Key aspects of normalization include:

Cleaning Characters: Removing unwanted characters like hashtags, special symbols, or extra spaces. For example, "#unwanted_characters" might become "unwanted characters" after cleaning.

Converting to Lowercase: Transforming all text to lowercase helps ensure that words like "The" and "the" are treated as the same word, preventing the model from learning two separate representations for what is essentially the same meaning.

Tokenization (Preliminary): While tokenization is a broad topic, the normalization step often involves an initial pass at tokenizing sentences or words. For example, a sentence like: "Yesterday I was playing with my friends. I always play but SOMETIMES I get angry when I lose." Could first be tokenized into two sentences: ["yesterday i was playing with my friends.", "i always play but sometimes i get angry when i lose."]. Then, these sentences can be further tokenized into individual words: ["yesterday", "i", "was", "playing", "with", "my", "friends", "."], ["i", "always", "play", "but", "sometimes", "i", "get", "angry", "when", "i", "lose", "."].

Stopwords: Filtering Out the Noise

Once text is tokenized, we often encounter words that, while grammatically necessary, don't carry much significant meaning for analysis. These are called stopwords. Stopwords are common words like articles ("the", "a"), pronouns ("he", "she"), prepositions ("on", "in"), and conjunctions ("and", "but"). There are many more of these.

Figure: Common stopwords in the English language.

Consider the sentence: "The quick brown fox jumped over the lazy dog's back." If we're trying to understand the core subject of this sentence, words like "the", "over", and "their" might not add much value. Removing them can help reduce the amount of data the model needs to process and focus on the more informative words.

Stopwords are typically filtered either before or after natural language data processing. The list of stopwords can vary depending on the specific NLP task and language. For example, a general list of English stopwords might include hundreds of words.

Stemming vs. Lemmatization: Getting to the Root

Another crucial part of normalization is reducing words to their base or root form. This helps in treating different forms of the same word (e.g., "run," "running," "ran") as a single unit, which is essential for accurate analysis and generalization. Two common techniques for this are stemming and lemmatization.

Stemming: A Stemmer shortens words using a heuristic process. It's a faster and simpler method, essentially chopping off suffixes from words to get to a "stem". For example, "ending," "loves," "wants," "started," "buying," and "likely" might be stemmed to "end," "love," "want," "start," "buy," and "like" respectively. Notice that the "stem" might not even be a real word. For instance, "amusing" might be stemmed to "amus," and "university" to "univers". While quick, stemming isn't always optimal because it doesn't consider the context or meaning of the word.

Lemmatization: A Lemmatizer is more sophisticated. In addition to the word itself, it considers the word's function in the text to derive its dictionary form, known as the lemma. This means it can distinguish between "bet" (the verb) and "bet" (the noun), or "better" and its lemma "good." So, "was" would become "be," "better" would become "good," and "bought" would become "buy". Lemmatization is generally more accurate than stemming but can be computationally more intensive due to its reliance on lexical knowledge bases.

Figure: Comparison between Stemming and Lemmatization.

Types of Tokenization

There is no unique way to tokenize. Beyond simple word-level tokenization, there are other strategies, each with its own advantages and disadvantages, as we summarize in the table below:

Table: Tokenization strategies, advantages and disadvantages.

Tokenization Strategy	Advantages	Disadvantages
Character Level Tokenization: In this approach, each individual character is treated as a token.	👍 It can model any word, even those it hasn't seen before, because it only needs to know the basic alphabet. This is particularly useful for handling misspellings or highly specialized vocabulary.	👎 The model needs to learn the relationships between characters, which can be computationally expensive and may not always capture the semantic meaning of words effectively.
Word Level Tokenization: This is the most intuitive form, where each word in a sentence is considered a token.	👍 It directly aligns with how humans understand language, making the tokens semantically meaningful.	👎 If a word doesn't exist in the model's training vocabulary (an “out-of-vocabulary" word), the model cannot directly process it. Also, different forms of the same word (e.g., "walk" and "walked") are treated as separate tokens, potentially leading to a larger vocabulary and less efficient learning.
Subword Level Tokenization: This approach offers a good compromise between character-level and word-level modeling. Instead of breaking text down to individual characters or full words, it breaks words into smaller, meaningful subword units. The final vocabulary includes both full words and fragments.	👍 It can handle out-of-vocabulary words by breaking them into known subwords. It also reduces the overall vocabulary size compared to word-level tokenization while still maintaining some semantic meaning. This is why it's more commonly used in practice for large language models (LLMs).	👎 It requires a specialized training process (a "token learner") to determine the optimal subword units and a segmentation process (a "token segmenter") to break down new text.

Tokenization Strategy

Advantages

Disadvantages

Character Level Tokenization:

In this approach, each individual character is treated as a token.

👍 It can model any word, even those it hasn't seen before, because it only needs to know the basic alphabet. This is particularly useful for handling misspellings or highly specialized vocabulary.

👎 The model needs to learn the relationships between characters, which can be computationally expensive and may not always capture the semantic meaning of words effectively.

Word Level Tokenization: This is the most intuitive form, where each word in a sentence is considered a token.

👍 It directly aligns with how humans understand language, making the tokens semantically meaningful.

👎 If a word doesn't exist in the model's training vocabulary (an “out-of-vocabulary" word), the model cannot directly process it. Also, different forms of the same word (e.g., "walk" and "walked") are treated as separate tokens, potentially leading to a larger vocabulary and less efficient learning.

Subword Level Tokenization:

This approach offers a good compromise between character-level and word-level modeling. Instead of breaking text down to individual characters or full words, it breaks words into smaller, meaningful subword units. The final vocabulary includes both full words and fragments.

👍 It can handle out-of-vocabulary words by breaking them into known subwords. It also reduces the overall vocabulary size compared to word-level tokenization while still maintaining some semantic meaning. This is why it's more commonly used in practice for large language models (LLMs).

👎 It requires a specialized training process (a "token learner") to determine the optimal subword units and a segmentation process (a "token segmenter") to break down new text.

Popular Subword Tokenization Algorithms:

Several algorithms have been developed for subword tokenization, each with its unique approach:

Byte-Pair Encoding (BPE): Introduced by Senrich et al. in 2016, BPE works by iteratively merging the most frequent pairs of characters or character sequences in a text until a desired vocabulary size is reached. For example, if "low" and "er" frequently appear together, they might be merged into "lower." This method is widely used in many modern NLP models.

Unigram Language Modeling: Developed by Kudo in 2018, this algorithm learns a vocabulary of subword units based on their probability of occurrence. It aims to find the most probable segmentation of a word into subwords.

WordPiece and SentencePiece: These are other popular subword tokenization methods, often used in large-scale NLP models like Google's BERT and T5. They build upon the principles of BPE and Unigram LM, optimizing for efficiency and performance in diverse linguistic contexts.

From Tokens to Vectors: The Concept of Vectorization

Once our text is tokenized and normalized, how do computers actually "understand" these tokens? They do so by converting them into numerical representations called vectors. This process is hence known as vectorization. Essentially, every word or token is assigned a unique numerical vector (a list of numbers) that captures its meaning and relationship to other words.

Think of it like plotting points on a graph. Words with similar meanings or that appear in similar contexts will have vectors that are numerically "closer" to each other in this multi-dimensional space

Figure: Tokens are vectorized into numerical collections of numbers that represent them in a high dimensional space which encodes the relationship between them. Image from https://community.intersystems.com/

Two common vectorization techniques are:

Bag of Words (BoW): This is a simple yet effective method where a text (like a document or sentence) is represented as a "bag" of its words, disregarding grammar and even word order, while only caring about the statistics of word occurrence. The core idea is to create a vocabulary of all unique words in a given collection of documents (the corpus). Then, each document is represented as a vector indicating the frequency of each word's occurrence in that document.

For example, consider two documents:

- Document 1: "The quick brown fox jumped over the lazy dog's back."

- Document 2: "Now is the time for all good men to come to the aid of their party."

First, we create a vocabulary from all unique words across both documents (excluding stopwords for simplicity). Then, for each document, we count how many times each word from the vocabulary appears. This count becomes the vector for that document. If a word appears, its corresponding entry in the vector is 1; if not, it's 0 (or a count of its occurrences if using frequency).

Figure: Illustration of Bag of Words (BoW), where each word is represented simply by the statistics of its occurrence, disregarding grammar or word order.

A limitation of basic BoW is that it doesn't capture phrases where words grouped together have specific meanings, like "United States" or "social networks". To address this, we can use N-grams, which are sequences of N contiguous words. For instance:

Unigrams (N=1): Each word is a token (e.g., "This", "is", "a", "sentence")

Bigrams (N=2): Pairs of words are tokens (e.g., "This is", "is a", "a sentence")

Trigrams (N=3): Three-word sequences are tokens (e.g., "This is a", "is a sentence")

By including N-grams, our vocabulary expands to include these multi-word phrases, allowing the model to capture more contextual meaning, such as in “United States”. However, we must be careful with the number of terms, as increasing N can lead to a very large vocabulary.

TF-IDF (Term Frequency-Inverse Document Frequency): While Bag of Words tells us how often a word appears in a document, it doesn't tell us how important that word is. Words that appear in almost all documents (like "the" or "is") are less informative than words that are unique to a particular document.

TF-IDF addresses this by measuring not only the frequency of a word within a single document (Term Frequency, TF) but also its frequency across the entire collection of documents (Inverse Document Frequency, IDF). Words that appear frequently in a specific document but rarely across the whole corpus receive a higher TF-IDF score, indicating their greater importance for that document.

Let's break down the components:

Term Frequency (TF): This measures how frequently a term (word) appears in a document.

Document Frequency (DF): This measures the number of documents in the corpus that contain a specific term.

Inverse Document Frequency (IDF): This is derived from the Document Frequency. It's essentially a logarithmic scaling of the inverse of the DF, giving rarer words higher scores. Words that appear in many documents will have a low IDF score, while words appearing in only a few documents will have a high IDF score.

TF-IDF Score: The final TF-IDF score for a term in a document is the product of its TF and IDF values. A high TF-IDF score suggests that the word is very specific and important to that particular document, making it a powerful tool for tasks like information retrieval and document similarity.

Conclusion: The Foundation of Language Understanding

Tokenization, along with normalization and vectorization, are the foundation and first part of the pipeline for preparing textual data for machine learning models. Tokenization constitutes the necessary bridge between the way humans produce and understand text, and how LLMs are able to interpret and process it. From breaking down sentences into individual words to understanding the importance of those words within a larger context, these steps are crucial for automated systems to process and comprehend the nuances of human language.

The principles of tokenization are at play in countless real-world applications these days, including:

Chatbots and Virtual Assistants: Tokenization allows LLMs in systems like Chat GPT or Gemini to break down your queries, understand your intent, and generate relevant responses.

Search Engines: When you type a query, tokenization helps the search engine match your individual words and phrases to content across billions of web pages.

Machine Translation: Breaking down text into tokens and then converting them into numerical representations is a core part of how LLMs are able to translate text, allowing them to map words and phrases from one language to another.

Spam Detection: By analyzing the tokens and their frequencies in an email, systems can identify patterns indicative of spam.

While it might seem like a simple concept, the different tokenization strategies and their implications for how machines "see" language are complex and constantly evolving, especially with the rise of powerful LLMs. Understanding tokenization is the first step on an exciting journey into the world of natural language processing, where the goal is to bridge the gap between human communication and computational understanding.

If you liked this article and want to learn more about how Tokenization, LLMs and how to use them in practice to build powerful LLM-based applications, we invite you to apply to our program here! Anyone AI offers an intensive 4.5 months training designed for engineers and software developers who want to become Machine Learning & AI experts.

`[CLS]`, `Join`, `AnyoneAI`, `and`, `do`,`not`,`miss`,`out`, `!`, `[SEP]`.

Apply now