Glossary

Plain-language definitions of 16 token, LLM and AI coding agent terms — from context windows to MCP, written for developers who pay the bills.

BPE (Byte-Pair Encoding)
Byte-pair encoding (BPE) is the sub-word algorithm most tokenizers use to split text into tokens by repeatedly merging the most frequent adjacent byte pairs.
Read definition
Context Compression
Context compression shrinks what an agent feeds the model — via skeletons, summaries and filtering — while preserving the signal, so the context window stays small and cheap.
Read definition
Context Window
A context window is the maximum amount of text, measured in tokens, a model can consider at once — everything the agent reads on a turn must fit inside it.
Read definition
Embeddings
Numeric vector representations of text (or code) that capture semantic meaning, enabling AI models to find, rank, and reason about content by similarity rather than keyword match.
Read definition
Fine-Tuning
Fine-tuning continues training a pre-trained model on your own examples so it learns your style, formats, or domain — trading upfront training cost for shorter, cheaper prompts at inference.
Read definition
Input Tokens vs Output Tokens
Input tokens are what you send a model; output tokens are what it generates. They're priced differently, and for AI coding agents the input side quietly dominates the bill.
Read definition
KV Cache
The KV cache is the model's per-request memory of attention keys and values for tokens it has already processed — what makes each next token cheap to generate but eats GPU memory as context grows.
Read definition
MCP (Model Context Protocol)
MCP is an open protocol that lets AI agents connect to external tools and data through servers — the standard way to extend coding agents like Claude Code.
Read definition
Output Filtering
Output filtering compacts noisy command and tool output — logs, builds, test runs — down to its signal before it reaches the model, cutting tokens with no loss of meaning.
Read definition
Prompt Caching
Prompt caching lets a model reuse a previously-processed, unchanging prompt prefix instead of re-billing it at full rate — cutting cost on long, repetitive sessions.
Read definition
RAG (Retrieval-Augmented Generation)
A pattern that fetches relevant documents at query time and injects them into the LLM prompt, letting the model answer from current, specific knowledge without retraining.
Read definition
Rate Limit
A provider-enforced ceiling on how many tokens or requests an API client can send per minute or day, which throttles or blocks calls that exceed the threshold.
Read definition
Semantic Code Search
Semantic code search finds code by meaning rather than exact keywords, using embeddings — so an agent retrieves the relevant functions instead of reading whole files.
Read definition
Token
A token is the unit of text an LLM processes — a word or sub-word chunk. AI coding agents are billed and rate-limited per token, input and output.
Read definition
Tokenizer
A tokenizer splits text into the tokens a model processes, usually with byte-pair encoding (BPE). The same text becomes different token counts across tokenizers.
Read definition
Tool Calling (Function Calling)
Tool calling lets an LLM ask your code to run a function — read a file, search code, hit an API — then continue with the result. It's how coding agents act.
Read definition