BPE (Byte-Pair Encoding)

Cite this page

What is BPE (byte-pair encoding)?

Byte-pair encoding (BPE) is the sub-word algorithm that most modern tokenizers use to chop text into the tokens a language model actually reads. It starts from raw bytes and, during training, repeatedly finds the most frequent adjacent pair of symbols and merges them into a new symbol. Do that a few tens of thousands of times and you get a vocabulary where common words ("the", "import", "return") survive as single tokens while rare strings fall apart into smaller pieces. The mechanism is almost embarrassingly simple — it began life as a 1994 data-compression trick before NLP borrowed it — and that is exactly why it won. It needs no language-specific rules, it never hits an unknown word (worst case, it backs off to individual bytes), and the merge table is tiny. When people say a model "thinks in tokens," BPE is usually the thing that decided where the token boundaries fall.

Why it matters in 2026

It matters because BPE is why your code costs more tokens than your prose, and tokens are what you pay for. English text trained the merge tables, so an ordinary sentence packs ≈4 characters per token. Source code does not get that deal: snake_case identifiers, deep indentation, } ) ; punctuation runs, and CamelCase symbols that never appeared often enough to earn their own merge all shatter into several tokens each. I have measured the same logic costing 30–50% more tokens as code than its English description would — and at Claude Opus 4.8's $5 / $25 per MTok, or GPT-5.5's $5 / $30, that overhead lands straight on the invoice every time an agent re-reads a file. This is the whole reason I built Tokenade. Once you accept that BPE is going to be expensive on code, the lever is to stop shoving so many code tokens through it: Tokenade uses semantic code search and skeleton compression to feed the agent the few functions that matter instead of whole files, filters tool output before it re-enters the context window, and loads MCP tools lazily — with a dashboard so you can see the saved tokens rather than take my word for it. It drops into Claude Code, Cursor, Codex, Copilot and Windsurf, the free tier covers ~20M tokens/month, and the source is MIT so you can read exactly what it sends.

When NOT to use it (or worry about it)

  • For rough cost estimates — if you only need a ballpark, the ≈4-chars-per-token rule is close enough; you do not need to think about merge tables.
  • Comparing token counts across providers — BPE vocabularies differ per model, so "1,000 tokens" on one tokenizer is not the same text as 1,000 on another. Count with the right tokenizer or don't compare.
  • As a thing you can tune — you generally can't change a model's BPE vocabulary, so optimizing the encoder itself is a dead end. Spend that energy on sending fewer, denser tokens instead.

See also