Embeddings

What are embeddings?

An embedding is a fixed-length list of numbers — a vector — that encodes the meaning of a piece of text, code, or other content so that semantically similar inputs land close together in vector space. When you ask an LLM-powered tool "find tests related to this function", it doesn't search for matching strings; it converts both the query and the candidate files into embeddings, then retrieves the nearest neighbours. The numbers themselves are meaningless in isolation. What matters is distance: two embeddings whose vectors are close (measured by cosine similarity or dot product) represent content that means roughly the same thing, even if they share no words.

Why embeddings matter for AI coding agents in 2026

Modern coding agents — Claude, Copilot, Cursor, and similar tools — operate under tight context-window budgets. Embedding-based retrieval is the main mechanism that decides which code fragments even get loaded into that window before the model reads them. Poor retrieval = irrelevant context = wasted tokens and degraded answers. Three patterns where embeddings appear directly in your workflow: Semantic code search. Tools like semantic code search embed your entire codebase at index time. At query time the agent embeds your natural-language question and pulls the k most-similar chunks — typically a few hundred tokens each — into the prompt. The quality of the embedding model directly caps the quality of the retrieved context. RAG pipelines. RAG (Retrieval-Augmented Generation) wraps every LLM call with an embedding-based lookup step. Docs, runbooks, Stack Overflow threads — all pre-embedded, retrieved on demand. Without good embeddings, RAG degrades to keyword search with extra steps. Fine-tuning signal. When you fine-tune a model on your codebase, embeddings from the base model initialise the process. Starting from a strong semantic space means the model needs far fewer gradient steps — and therefore far fewer tokens of training data — to specialise.

The token cost angle

Every embedding call has a token cost. Text-embedding-3-small charges per input token; text-embedding-3-large costs roughly 5× more but produces higher-quality vectors. For a codebase of 500 k tokens, embedding every file costs real money. Chunking strategy (how you split files before embedding) is therefore both a quality and a cost decision: chunks that are too large waste tokens on irrelevant padding; chunks that are too small lose cross-line context. Tokenade's compaction layer reduces the raw token count fed into downstream embedding calls by stripping noisy, repetitive output before it reaches the model — so your vector index stays leaner without sacrificing recall. See reducing AI coding agent token usage for the full picture.

When embeddings are NOT the right tool

Exact-match lookups. If you need to find every call site of getUserById, a grep or AST query is faster, cheaper, and perfectly precise. Embeddings trade exactness for semantic breadth; use them when intent matters more than spelling.
Tiny, stable codebases. If your repo fits in the model's context window in one shot, you don't need retrieval at all. Load everything and skip the indexing overhead.
Real-time streaming data. Embeddings are computed at index time. If your data changes faster than you can re-embed it, the vector index goes stale and retrieval quality degrades.
Tasks requiring ordered reasoning. Embedding retrieval is unordered by nature. For problems that require following a chain of logic — "trace the call from endpoint X to database write Y" — a call hierarchy traversal is more reliable.

What are embeddings?

Why embeddings matter for AI coding agents in 2026

The token cost angle

When embeddings are NOT the right tool

See also