MainHistoryExamplesRecommended Reading

What is Tokenization?

Help others learn from this page

A token is a collection of characters that has semantic meaning for a model. Tokenization is the process of converting the words in your prompt into tokens.
IBM Watsonx Docs/ Documentation
image for entry

Text is split into tokens before being processed by an LLM. And it's not always simple.

Tokenization is how we break text into smaller parts that a language model can understand. Think of it like prepping ingredients before you cook — models can't handle a whole sentence at once in raw form, so we chop it up into pieces called tokens. These might be whole words, chunks of words, or even just characters.

Let’s say you're feeding a sentence like "ChatGPT is amazing!" to a model. A tokenizer might split that into: ["Chat", "G", "PT", " is", " amazing", "!"]. These tokens are then converted into numerical IDs using a fixed vocabulary the model was trained on. This numeric form is what the model actually processes.

Different tokenization algorithms take different approaches:

  • Byte Pair Encoding (BPE): Merges common character sequences to form subwords; used in GPT models.
  • Unigram: Selects tokens based on probability; used in SentencePiece (e.g., T5).
  • WordPiece: Breaks words into frequent subwords; used in BERT.

Each approach balances trade-offs between compression, interpretability, and flexibility across languages.

Why It Matters:

  • Context Length: LLMs have a token limit — not a word limit. A long input may hit the cap sooner than expected.
  • Cost: Many LLM APIs (like OpenAI or Anthropic) bill per token. Efficient tokenization = cheaper calls.
  • Accuracy: Misaligned token boundaries can distort intent or make a model misinterpret input.

Real-World Implications:

  • Writing prompt-friendly input (e.g., avoiding weird hyphenation or rare Unicode) can save tokens
  • When a model gives confusing output, developers often inspect how their prompt was tokenized
  • Tools like OpenAI’s tokenizer playground or Hugging Face's tokenizers library help visualize token breakdowns

So next time your prompt isn’t behaving, check what the model thinks you said. It might be a tokenization mismatch under the hood.

FAQ

What is a token in LLM?
A token is a small chunk of text — it could be a word, part of a word, or even just punctuation — used as the basic input unit for an LLM.
How does tokenization affect cost in an LLM?
Most LLM providers charge by token. So shorter, more efficient prompts can reduce API costs.
Can I see how my input is tokenized in an LLM?
Yes — most LLM vendors provide tools to preview tokenization. OpenAI, for example, has a tokenizer tool online.
Does every LLM tokenize text the same way?
No — different models use different tokenizers and vocabularies. This can lead to variations in how the same text is split.

Related Stuff

Enjoyed this explanation? Share it!