Agentic Coding: What It Is and Its Real Cost

Agentic coding is when an AI agent plans and executes multi-step coding tasks on its own. That autonomy is powerful — and it's why token costs can spiral fast.

Profile photo of Paul Irolla

By Paul Irolla

Founder · AI & developer tools · Tokenade

Ph.D. in AI · builds token-optimization tooling for AI coding agents

View author page
11 min read
Cite this page

What is agentic coding?

Agentic coding is when an AI assistant doesn't just answer a question — it autonomously plans and carries out a multi-step coding task, reading files, running commands, editing code, observing results, and repeating until the goal is met. You give it an objective; the agent drives the rest. Tools like Claude Code, Cursor's agent mode, GitHub Codex, Cline, Aider, Windsurf, Kilo Code, OpenCode, Roo Code and Hermes all work this way. The word "agentic" signals one specific thing: the model has access to tools and loops on its own outputs. A plain chat doesn't loop. Autocomplete predicts the next tokens once. An agent acts, checks, adjusts, and acts again — potentially dozens of times for a single request. That distinction matters for cost. Every action in the loop has a price, and the prices stack.

How is agentic coding different from autocomplete or chat?

The difference is the presence of a tool-using loop where the model controls what it reads next.
ModeWhat the model doesWho controls context
AutocompletePredicts the next few linesYou (implicit)
ChatProduces one response per promptYou (explicit)
AgenticPlans → acts → observes → adapts, in a loopThe agent
In chat, you decide what goes in the prompt. In agentic mode, the agent decides which files to read, which commands to run, which tools to call — and it makes those decisions turn after turn, accumulating context as it goes. That's what makes agentic coding qualitatively different from a smarter autocomplete: the model is in the driver's seat for context retrieval, not just code generation. This is also why agentic tools feel so much more capable. An agent can investigate a failing test, trace the call stack, open the relevant modules, run a targeted fix, re-run the tests, and confirm — all without you lifting a finger beyond the first request. The catch is that every one of those steps has a token cost.

What does the agent loop actually look like?

The canonical loop has four phases that repeat until the task is done or the agent gives up:
  1. Plan. Given the current context (transcript, previous observations, available tools), the model decides what to do next — "I should read auth/session.ts to understand the token format."
  2. Act. The model calls a tool: reads a file, runs a shell command, searches the codebase, edits a function, calls an MCP server.
  3. Observe. The tool's output — the file content, the command result, the search hits — is appended to the context and the model reads it.
  4. Repeat. The model evaluates whether the goal is met. If not, it plans the next step.
The loop is elegant and it works. The problem is step 3: every observation becomes input for the next turn. And the turn after that. And every turn until the session ends. Nothing leaves the context window on its own.

Why does agentic coding burn so many tokens?

Agentic coding burns tokens because the loop re-reads an accumulating context on every turn, and that cost compounds across every step in the session. Consider a realistic scenario. You ask the agent to fix a bug. It:
  1. Reads the project structure (~3,000 tokens of directory listing)
  2. Opens three candidate files (~8,000 tokens of source code)
  3. Runs the failing test (~4,000 tokens of output)
  4. Reads two more files after tracing the stack (~6,000 tokens)
  5. Makes the edit (a few hundred tokens)
  6. Runs the tests again (~4,000 tokens of output)
By step 6, the context window contains roughly 25,000 tokens of accumulated material — and all of it is re-sent to the model as input to produce step 6's tiny response. The agent doesn't forget; it re-reads. Now imagine the bug fix takes 12 steps instead of 6, and one of those steps triggers a verbose npm install or a Terraform plan with 200 resource lines. It compounds fast. The expensive part is rarely the code the agent writes. Output tokens matter, but on Claude Sonnet in 2026, input runs at about $3/M tokens and output at about $15/M — a 5:1 ratio. The volume, though, goes the other way: a long agentic session might generate 500 output tokens per turn while re-reading 15,000 input tokens. The math makes input the dominant cost by far, and most of that input is the agent re-ingesting context it already processed. For a concrete look at what those numbers add up to across popular tools, see AI coding agent token costs.

What drives the cost up — the five real culprits

Understanding where the tokens go is the first step to controlling them. Five patterns account for the majority of a typical session's bill: 1. Whole-file reads for one-function tasks. The agent opens auth/session.ts because it needs validateToken. The file is 450 lines — roughly 5,000 tokens. The agent needed maybe 20 lines. It read 450. And it re-reads all 450 on every subsequent turn. 2. Unfiltered command output. npm test emits 200 lines of checkmarks, timing rows and a stack trace at the bottom. The stack trace is the signal. The 190 decorative lines are deadweight that re-enter the loop verbatim. A build log or a kubectl get pods dump can be tens of thousands of tokens — most of it boilerplate the agent ignores but still pays for. 3. Exploratory directory reads. When the agent doesn't know where something lives, it explores. It lists directories, scans file trees, opens related modules "just in case". Done naively, orienting the agent in a medium-sized codebase can consume more tokens than the actual work. 4. A bloated tool manifest. Agents that speak MCP (Model Context Protocol) load tool definitions into the context — and re-send that manifest on every single turn. An agent connected to ten MCP servers may be prepending a thousand tokens of tool schemas to every request, even if it calls none of those tools in that turn. 5. Long sessions with no compaction. An agent that has been running for 50 turns is paying, every single turn, for the context accumulated in turns 1–49. The earlier reads don't shrink. The transcript grows. Turn 50 re-reads everything from the beginning. Each culprit maps to a lever. A good context engineering practice addresses all five.

Which agentic coding tools are available?

The field has grown quickly. As of mid-2026, the notable tools fit into a few categories: Terminal-first agents sit in your shell and can read, edit, run and query your whole project. Claude Code (Anthropic) and Aider (open-source) are the most-used examples. They're powerful but have the most open-ended context access, which makes unguided sessions expensive. IDE-embedded agents live inside your editor and have tight integration with the file tree and language server. Cursor and Windsurf are the most popular commercial options; Kilo Code and Roo Code are open-source variants built on VS Code. GitHub Copilot's agent mode runs in both VS Code and JetBrains. Cloud agents run in an isolated environment with their own compute. GitHub Codex (the cloud version) and OpenCode operate this way — you delegate a task and they return a diff. Framework-level agents like Cline and Hermes are designed to be embedded inside larger workflows or AI systems, often exposing MCP endpoints themselves. For a deeper comparison of how these tools differ on capabilities and economics, see Best AI coding tools. The short version: the tool you choose matters less for cost than how you configure its context access.

How do you keep agentic coding affordable?

The five culprits above have five matching answers: Replace whole-file reads with semantic retrieval. Instead of opening files to find relevant code, use a search that returns the specific functions or classes by meaning. A semantic search for "validate the login token" returns the three relevant chunks — not the three files around them. A task that would have read 25,000 tokens of files reads 2,000. This is the single biggest lever for navigation-heavy sessions. Filter command output before it enters the loop. Stripping progress bars, decorative tables and repeated warnings from command output is lossless: the model sees the result, not the raw firehose. Terse flags help too — git status --porcelain, kubectl get pods -o name, npm test --reporter=min. Per-format filtering can cut a typical build log by 90% with no loss of meaning. Read structure before content. Give the agent function signatures and exports rather than full file bodies. A skeleton view of a module preserves every public surface the agent reasons about while dropping the implementation it doesn't need yet. The agent can always request a specific function body once it knows where to look. This is one half of context compression. Load tools lazily. Hide MCP servers whose underlying binary isn't installed. Suppress tool definitions until the agent actually needs them. Every tool you remove from the manifest is a tax removed from every turn. If you've connected a dozen integrations over time, auditing and pruning them is free savings. Compact or restart long sessions. Either use the agent's built-in compaction (Claude Code has /compact) or restart with a summary instead of the full transcript when the session grows long. The earlier turns carry the most stale context and the least value. The full breakdown of these levers — with numbers — is in How to reduce AI coding agent token usage. If you'd rather not wire up each lever manually, Tokenade applies all of them automatically — semantic search, output compression, skeleton reads, lazy MCP loading — in one command, across the agents you already use. Tokenade's own benchmark shows up to ~88% fewer tokens on a balanced session mix with no quality loss. It's free up to roughly 20M tokens (no card required), then $9.90/mo.

What goes wrong — the anti-patterns

Even knowing the theory, a few patterns show up repeatedly in expensive sessions: Vague, open-ended objectives. "Improve the auth system" is a blank cheque for exploration. The agent reads broadly because the goal is broad. Narrow the scope: "The JWT refresh fails when the user's timezone is UTC+14 — trace it in auth/refresh.ts." Scope is the one lever that costs nothing and multiplies every other lever below it. Letting raw output accumulate. A failed build or a test run dumped verbatim into the context is paid for on every subsequent turn. One verbose log in a 20-turn session means that log is read 20 times. Filter it before it lands, not after. Never restarting or compacting. A marathon session with 80 turns is re-paying its entire history on every new request. The turns from 20 minutes ago rarely carry more signal than a one-paragraph summary. Restart more aggressively. Accumulating MCP integrations. Every tool you add to an MCP server is advertised to the agent on every turn, whether or not it's ever used. Adding tools feels harmless; the per-turn cost is invisible. Audit periodically. Trying to fix cost by asking the model to be brief. "Be concise" trims the output — the cheaper side. The token volume lives in the input. Asking for briefer answers while leaving a bloated context unchanged is optimising the wrong direction.

Frequently asked questions

Does agentic coding always cost more than regular chat?

For any task with more than two or three steps, yes — the loop multiplies context across turns in a way a single chat prompt cannot. For simple, self-contained questions, an agent loop adds overhead for no benefit. The right tool for one-turn questions is one turn. Agentic mode earns its keep on multi-step tasks where the alternative is manual copy-paste coordination across many prompts.

How many tokens does a typical agentic session use?

It varies wildly. A short bug-fix session on a clean codebase might finish in 10,000–30,000 tokens. A longer feature session with exploratory reads, test runs and back-and-forth on a large project can easily reach 300,000–500,000 tokens — sometimes more. The main driver is how much file and command content the agent ingests, which depends on codebase size, task scope and how aggressively it explores.

Is there a way to see how many tokens each agent step costs?

Yes. Claude Code shows per-turn token counts in its output. Cursor and Codex surface usage in their dashboards. Tokenade includes a real-time dashboard that breaks down token spending by type (file reads, tool output, model response) so you can see exactly where the budget goes and which lever to pull first.

Do these cost patterns apply to hosted agents (Codex cloud, etc.) the same way?

The same loop mechanics apply, but the billing surface is different. Cloud agents may charge per task or per diff rather than per token, which can obscure the underlying cost. The strategies still matter: a cloud agent that finishes in fewer steps with less context will complete faster and, on per-task pricing, often finish within a smaller usage tier.

Why does the agent read files I didn't ask it to?

Because the task context is ambiguous enough that the agent is hedging. When an objective has multiple plausible paths, the agent reads broadly to reduce its own uncertainty before committing to an approach. This is rational behaviour given its information — but it's expensive. Providing more upfront context (the relevant file, the specific function, the exact error message) reduces that uncertainty and the read that follows it.
See also:

Up to 88% fewer tokens. Zero config.

Tokenade is the simplest way to cut what your coding agent sends to the model — set it up once, save on every prompt. Works with Claude Code, Cursor, Codex, Copilot & more.