A token is a collection of characters that has semantic meaning for a model. Tokenization is the process of converting the words in your prompt into tokens.
Text is split into tokens before being processed by an LLM. And it's not always simple.
Tokenization is how we break text into smaller parts that a language model can understand. Think of it like prepping ingredients before you cook — models can't handle a whole sentence at once in raw form, so we chop it up into pieces called tokens. These might be whole words, chunks of words, or even just characters.
Let’s say you're feeding a sentence like "ChatGPT is amazing!" to a model. A tokenizer might split that into: ["Chat", "G", "PT", " is", " amazing", "!"]
. These tokens are then converted into numerical IDs using a fixed vocabulary the model was trained on. This numeric form is what the model actually processes.
Different tokenization algorithms take different approaches:
Each approach balances trade-offs between compression, interpretability, and flexibility across languages.
tokenizers
library help visualize token breakdownsSo next time your prompt isn’t behaving, check what the model thinks you said. It might be a tokenization mismatch under the hood.