Tokenizer

Cite this page

What is a tokenizer?

A tokenizer is the component that converts raw text into the sequence of tokens a language model actually reads. Most modern models use a sub-word scheme such as byte-pair encoding (BPE), which breaks text into common chunks — frequent words stay whole, rare words split into pieces. The reverse step turns the model's token outputs back into text. Tokenizers are model-specific. The same sentence can become a different number of tokens depending on which tokenizer is used, which is why token counts aren't directly comparable across providers.

Why tokenizers matter in 2026

They matter because code tokenizes less efficiently than prose, and that directly affects an AI coding agent's cost. Identifiers, punctuation, whitespace and rare symbols common in source code often split into more tokens than ordinary English, so a file of code can cost more tokens than its character count suggests. Knowing this is why measuring real token usage (with the right tokenizer) beats eyeballing character counts when you're trying to cut your bill.

When the tokenizer detail doesn't matter

  • For rough estimates — the ~4-characters-per-token rule of thumb is fine when you just need a ballpark.
  • Within a single model — if you never switch providers, the absolute tokenizer differences are constant and you can reason in relative terms.

See also