RAG (Retrieval-Augmented Generation)

Cite this page

What is RAG?

Retrieval-Augmented Generation (RAG) is an architecture in which an LLM's prompt is dynamically supplemented with documents retrieved from an external store, immediately before generation. Instead of relying solely on weights baked in during training, the model reads freshly retrieved context and grounds its response in it. The minimal loop is: (1) embed the user's query, (2) retrieve the k nearest chunks from a vector store, (3) prepend those chunks to the prompt, (4) generate. Every step has cost and quality implications.

Why RAG matters for AI coding agents in 2026

Coding agents are among the heaviest RAG users in production. Their "documents" are source files, README pages, API references, runbooks, and error logs — content that changes with every commit and that no model can have memorised accurately. Context budget pressure. An agent working on a large monorepo cannot load the whole codebase into its context window. RAG is the gating mechanism: it decides which files get into the window at all. Retrieval quality directly determines how many of the model's tokens are spent on relevant code versus noise. Freshness. Training data has a cut-off. Your internal API, your custom SDK, your team's style guide — none of these are in any public model's weights. RAG is the standard answer: keep a live vector index, retrieve on demand, stay current without fine-tuning. Cost control. A naive agent might stuff every potentially related file into the prompt "just in case". RAG disciplines this: only the top-k chunks are included, capping the prompt size at a predictable ceiling. Pair that with a compaction step (stripping boilerplate from retrieved chunks before they enter the prompt) and you can cut per-query token spend dramatically. See reducing AI coding agent token usage.

The retrieval quality bottleneck

RAG's ceiling is set by retrieval, not generation. A powerful model reading irrelevant chunks will hallucinate confidently. The three levers:
  1. Chunk size. Too large: dilutes the signal with padding. Too small: loses cross-line context. For source code, function-level chunks typically outperform line-level or file-level chunks.
  2. Embedding model. Embeddings are the retrieval backbone. A weak embedding model means semantically relevant code ranks below superficially similar but unhelpful matches.
  3. Re-ranking. A lightweight cross-encoder re-scores the top-50 candidates before the final top-5 enter the prompt. Adds a small token cost for a meaningful precision gain.

RAG vs fine-tuning

Fine-tuning bakes knowledge into weights — expensive, slow to update, and opaque. RAG stores knowledge externally — cheap to update, auditable, and debuggable (you can log exactly which chunks were retrieved). For most agent use cases in 2026, RAG is the right default and fine-tuning is the optimisation you reach for after RAG plateaus.

When RAG is NOT the right tool

  • Tiny, stable knowledge bases. If your entire reference fits in one context window and rarely changes, skip the retrieval layer and load it directly. RAG adds latency and complexity for no gain.
  • Precise syntactic lookups. Finding all usages of a specific function signature is a grep/AST problem, not a retrieval problem. Semantic code search and RAG are complements to exact-match tooling, not replacements.
  • Latency-critical paths. Each RAG round-trip adds an embedding call plus a vector-store query before generation even starts. If your rate limit budget or user-facing SLA is tight, a cached prompt may beat a fresh retrieval.
  • Highly structured queries. When the answer is deterministic — "what is the return type of function X?" — a static analysis tool gives a faster, cheaper, exact answer.

See also