How to Reduce Claude Code Token Usage

Claude Code burns tokens on eager file reads, unfiltered tool output, bloated MCP manifests and runaway transcripts. Here's how to cut each one without losing quality.

Profile photo of Paul Irolla

By Paul Irolla

Founder · AI & developer tools · Tokenade

Ph.D. in AI · builds token-optimization tooling for AI coding agents

View author page
9 min read
Cite this page

How do you reduce Claude Code's token usage?

You reduce Claude Code's token usage by targeting the four sources that actually dominate cost: eager whole-file reads, raw command output dumped straight into the transcript, the MCP tool manifest re-sent on every turn, and a growing conversation that re-bills every past read. Tighten each source and the cost drops sharply — with no loss in output quality, because you're trimming noise, not signal. This is the Claude Code–specific companion to How to reduce AI coding agent token usage. If you want the agent-agnostic picture first, start there. Here we map those same levers onto Claude Code's concrete behaviour: how it reads files, which commands it tends to run, how the transcript grows, and what the /compact command actually does.

Why does Claude Code burn so many tokens?

Claude Code burns tokens primarily because it reads more than it needs, and then re-reads it on every subsequent turn. The root cause is architectural: on each step of an agentic loop, Claude re-reads the entire conversation history plus any tool results. That means any oversized read you incur early is paid again on every turn that follows. Four specific patterns drive most of the bill: Eager whole-file reads. When asked to "understand" a module, Claude Code defaults to reading the full file. A 500-line TypeScript file is roughly 5,000–7,000 tokens — and Claude Code may read a dozen files before it writes a single line of code. If the context window carries those reads for 20 turns, you've paid for each of them 20 times. Unfiltered tool output. Claude Code runs shell commands — git status, npm test, tsc --noEmit, build tools — and the raw output flows straight into the transcript. A failed npm test run can emit 15,000 tokens of stack traces, passing-test checkmarks and timing tables. Claude Code needed about 50 tokens of it: the failing test name and the broken assertion. Everything else is cargo. The MCP manifest. Every MCP server you've connected advertises its tool definitions on every turn, whether or not any tool is used. Five servers with ten tools each is a sizable constant overhead that accumulates silently over the whole session. See MCP for how the protocol works and Best MCP servers for Claude Code for which servers are worth keeping. An unbounded transcript. Claude Code sessions that sprawl across several tasks or re-visit the same files repeatedly can build up transcripts that cost more to re-read than the original work. Each new turn pays for every prior turn's context. Understanding which pattern dominates your workflow is step one. Check the token counter in your Claude.ai interface or use a usage dashboard — the answer usually points immediately at the right lever.

What are the biggest levers for Claude Code specifically?

The biggest levers are file retrieval, output filtering and MCP pruning — roughly in that order for most Claude Code sessions.

Make Claude Code search instead of read

The default pattern — "read src/auth/login.ts" — hands Claude a whole file when it needed one function. The better pattern is to direct Claude Code toward the specific symbol: "find the function that validates the JWT and show me its signature." When retrieval is working well, Claude Code reads 2,000 tokens instead of 20,000. In practice this means:
  • Ask Claude Code to locate the relevant symbol first, then read that function body
  • Avoid phrases like "read the repo", "check all the files in X", "go through the codebase" — these are open-ended file-read instructions
  • When starting a new task in a large repo, describe the behaviour you want to change ("where do we handle login errors") rather than the files you think are involved
Semantic retrieval — finding code by meaning rather than file name — is where tooling helps most. Tokenade installs into Claude Code in one command and replaces broad file reads with ranked symbol lookups, so Claude Code reads the three relevant chunks instead of every file in the directory.

Filter command output before it enters the transcript

This is often the highest-yield single change for projects with active test suites or build pipelines. The goal is to intercept tool output between the shell and the transcript, and hand Claude Code only what it needs. Concretely:
  • Prefer terse command flags: git status --porcelain over bare git status, jest --silent to suppress passing-test noise, tsc --noEmit 2>&1 | grep error to surface only type errors
  • For build systems, capture only the error lines rather than the full log
  • For test runners, surface only failing tests with their assertion, not the full runner output
Output filtering applied systematically to build and test commands can cut token usage by an order of magnitude on failure-heavy sessions, because those commands tend to produce the most verbose output and get run the most often.

Prune the MCP manifest

Every MCP server connected to Claude Code adds its tool definitions to every turn's context, whether or not you call any of its tools. If you've assembled a rich MCP setup over time — file systems, databases, external APIs — the manifest overhead can be substantial. Audit what you actually use per session and disable the rest temporarily. Tools whose backing binary or service isn't running are especially wasteful — their definitions are paid for but the tools can't succeed. Lazy loading (advertising tools only when relevant) cuts the per-turn manifest to the tools that matter for the current task.

Manage CLAUDE.md carefully

CLAUDE.md is re-read at the start of every session and can be injected into the context on each turn depending on your setup. A long, unstable CLAUDE.md is doubly costly: it consumes tokens and it defeats prompt caching, because cache hits require the prefix to be identical across turns. Keep CLAUDE.md to durable project conventions — build commands, code style rules, architecture decisions that don't change week to week. Strip transient notes, in-progress task lists and anything session-specific. A tight, stable CLAUDE.md stays cache-friendly and re-charges the cache on every session after the first.

Use /compact and session hygiene

Claude Code's /compact command compresses the conversation history, replacing earlier turns with a summary. It's most effective when you run it between distinct subtasks rather than letting the full transcript accumulate indefinitely. Good session hygiene for Claude Code:
  • Start a new session for unrelated tasks. Continuing a long session into a new task carries all prior context even when it's irrelevant.
  • Run /compact at natural breakpoints — after a feature is done, after a debugging session wraps up, before starting a refactor.
  • Don't paste large files or long logs into the chat directly. Claude Code has file tools; use them selectively rather than front-loading context.
  • Prefer narrow task descriptions. "Fix the failing test in auth/login.test.ts" produces far tighter retrieval than "improve the auth system" — and tighter retrieval means a shorter loop.

How to apply this today

  1. Stop opening whole files. Direct Claude Code to the function or symbol you need, not the file that contains it.
  2. Add terse flags to the commands Claude Code runs most: --porcelain, --quiet, --silent, 2>&1 | grep error.
  3. Audit your MCP servers. Disable any you haven't used this week; re-enable when needed.
  4. Trim CLAUDE.md to rules that won't change for months. Delete transient notes.
  5. Run /compact between subtasks, not just when the context warning appears.
  6. Start fresh sessions for tasks unrelated to what you just worked on.
If wiring all of this up manually sounds like a second job, Tokenade closes the gap: it applies semantic retrieval, output compression, structure-first reads and lazy MCP loading automatically inside Claude Code, with zero per-prompt configuration. In Tokenade's own benchmark, the stacked effect reached up to ~88% fewer tokens on a balanced session mix.

What goes wrong (anti-patterns)

"Read the whole project first." This feels like thoroughness; it's actually a token grenade. Claude Code front-loads tens of thousands of tokens it will largely ignore, and pays for each of them again on every subsequent turn. Leaving raw logs in the chat. Pasting a failed build verbatim — or letting Claude Code dump it unfiltered — can cost more tokens than the fix. Filter first, then let Claude Code see the error line. Treating output optimisation as the fix. Asking Claude Code to "be brief" trims its output (the cheap part). On Claude's pricing, output tokens cost roughly 5× input tokens, but output volume is small — it's the input you keep feeding in that dominates. The expensive direction is input. One endless session for everything. Unbounded conversation history re-bills every early read on every new turn. Compacting or restarting between tasks is not a workaround — it's correct session management. Ignoring the MCP manifest. Adding MCP servers is easy; the cost of each one accumulates invisibly. Every tool definition is paid for on every turn whether or not you need it.

Frequently asked questions

Does cutting tokens make Claude Code's answers worse?

No — when done correctly, it makes them better. The levers described here remove low-value context: boilerplate command output, irrelevant file content, unused tool definitions. They don't remove the signal Claude Code reasons over. Because models attend least to information buried in a bloated context window, removing noise raises the signal-to-noise ratio. The only way to hurt quality is to compress away something load-bearing — which is why structure-first reads keep every function signature, and output filters keep the actual error.

What does /compact actually do?

/compact replaces the earlier turns in Claude Code's conversation history with a compressed summary, so the context window is reclaimed. It preserves task-relevant facts (what was built, what errors were seen, decisions made) and drops the verbatim tool outputs and file reads that generated them. Running it proactively between subtasks — rather than only when the context warning fires — means you start each subtask with a leaner, cheaper baseline.

How much should I expect to save?

It depends on your session mix. Sessions heavy on file navigation and build/test cycles — the common pattern for greenfield features or debugging — have the most to gain, because those are the token-heavy activities. Sessions that are mostly the model writing new code from a clear spec have less headroom, since output is already the irreducible part. The honest approach: measure your baseline with a usage counter, apply the levers that match your bottleneck, then compare.

Does CLAUDE.md affect prompt caching?

Yes, directly. Prompt caching on Claude requires the prefix to be identical across turns to get a cache hit, priced at roughly 10% of a fresh token. A CLAUDE.md that changes frequently — because it contains in-progress notes or task-specific reminders — defeats the cache and gets re-billed at full price every session. Keeping it stable and concise is both a direct token reduction and a cache-hit multiplier. See prompt caching for the mechanics.

Do these techniques work with tools like Cursor or Windsurf too?

Yes. The mechanics are agent-agnostic — every token-billed agent that re-reads its transcript each turn benefits from the same levers. The broader guide covers the cross-agent picture, and context engineering goes deeper on the discipline behind these patterns.
See also:

Up to 88% fewer tokens. Zero config.

Tokenade is the simplest way to cut what your coding agent sends to the model — set it up once, save on every prompt. Works with Claude Code, Cursor, Codex, Copilot & more.