Cut Claude Code Costs, Keep the Model

How do you cut Claude Code costs without switching to a weaker model?

You cut Claude Code costs by removing the tokens the model never needed in the first place, not by downgrading the model that reads them. The instinct when the bill stings is to swap Opus 4.8 for Sonnet or Haiku and accept worse code. That's the wrong lever. The bill is dominated by how much context you push through the model on every turn, and most of that context is noise: whole files read to use one function, raw test logs, MCP manifests re-sent on every step. Trim the noise and you keep Opus's reasoning at a fraction of the cost. I build token tooling for a living, so I'll be blunt about the trade most people don't see. Downgrading the model is a quality cut you feel on every task. Cutting wasted tokens is a cost cut you don't feel at all — same prompts, same model, same output, smaller invoice. One of these is free money and people keep reaching for the other one. This is the cost-focused companion to How to reduce Claude Code token usage. That piece covers the mechanics; this one is about the money: where the dollars actually go, and why model choice is the last knob you should touch, not the first. If you're still at the "what does this thing cost in the first place" stage, Claude Code pricing has every plan and API rate in one table.

Why does downgrading the model save less than you think?

Downgrading saves on the per-token rate but does nothing about the token count — and the count is the part you can cut by 50–90% without a quality hit. Here are the published Anthropic rates, per million tokens (MTok):

Model	Input / MTok	Output / MTok
Claude Opus 4.8	$5	$25
Claude Sonnet 5	$2	$10
Claude Haiku 4.5	$1	$5

(For reference, GPT-5.5 sits at $5 input / $30 output per MTok — see LLM API token pricing for the wider table.) Going Opus → Sonnet cuts the input rate from $5 to $3. Real, but modest, and you pay for it in reasoning depth on the gnarly refactors that are exactly when you want the strong model. Meanwhile the token count is where the leverage hides, because of one architectural fact: on every step of an agentic loop, Claude Code re-reads the entire transcript. An oversized file you read on turn 2 is re-billed on turns 3 through 20. A 6,000-token file read carried across 15 turns isn't 6,000 tokens — it's closer to 90,000 input tokens. Trimming that one read beats any rate change you could make. There's a second multiplier most cost discussions skip: prompt caching. Cached input reads bill at roughly 10% of the normal input rate — so on Opus, cached context is about $0.50/MTok instead of $5. That's a 90% discount on the stable part of your context. But caching only helps the part of the prompt that doesn't churn. If your transcript is full of one-off file dumps that never repeat, there's nothing stable to cache. Cutting noise and caching reinforce each other: less junk in the window means a higher share of what remains is cacheable.

Where does the Claude Code bill actually go?

The bill is dominated by re-read context, not by the tokens of the answer you actually wanted. Four sources do most of the damage: Eager whole-file reads. Ask Claude Code to "understand the auth module" and it reads the whole file. A 500-line TypeScript file is roughly 5,000–7,000 tokens, and it'll read a dozen before writing a line. Each one is re-billed every subsequent turn. Raw command output. A failed npm test can dump 15,000 tokens of stack traces and green checkmarks into the transcript. The model needed about 50 tokens of it — the failing test name and the broken assertion. The rest is cargo, and it gets re-read on every following turn too. The MCP manifest. Every connected MCP server advertises its full tool schema on every turn, used or not. Five servers at ten tools each is a fixed tax on the whole session. See Best MCP servers for Claude Code for which ones earn their keep. Unbounded transcripts. A session that wanders across several tasks builds a transcript that can cost more to re-read than the original work did to produce. Notice none of these get cheaper when you downgrade the model. They get cheaper when you stop sending them.

A worked example

Let me put numbers on it, because "trim the noise" is easy to nod along to and hard to feel. Say a single feature task runs 18 turns on Opus 4.8. Along the way the agent reads eight files averaging 6,000 tokens (48,000 tokens of reads) and runs a handful of test and build commands that dump another 30,000 tokens of raw output. Both pools get re-read on the turns that follow them — call it an average of 10 surviving turns of re-billing each. That's roughly 780,000 input tokens of reads-and-logs carried through the session. At Opus's $5/MTok, the waste alone is about $3.90 on a single task — and that's before the actual reasoning and output. Now apply the cuts: semantic search drops the eight whole-file reads to targeted chunks (say 8,000 tokens instead of 48,000), and output filtering drops the 30,000 tokens of logs to maybe 1,500 tokens of error lines. Same model, same task, same answer — the re-billed pool shrinks by an order of magnitude. The rate never changed; the count did, and the count is what you control. Run that math across a month of real work and the gap between "downgraded to Haiku" and "kept Opus, cut the waste" inverts: the second option is usually both cheaper and better.

What should I cut first, in order?

Cut retrieval waste first, output noise second, MCP bloat third — that order returns the most dollars per minute of effort for a typical Claude Code session.

Make it search, not read. Replace "read src/auth/login.ts" with "find the function that validates the JWT and show its signature." Semantic code search finds code by meaning, so the agent pulls the three relevant chunks instead of every file in the directory. This is usually a 5–10x cut on retrieval tokens, and retrieval is the biggest line item in most sessions.
Filter output before it lands. Prefer git status --porcelain over bare git status, tsc --noEmit 2>&1 | grep error over the full type-check log, terse test flags over the full runner dump. On failure-heavy sessions this alone can be an order-of-magnitude cut.
Prune MCP. Disconnect servers you're not using this session. Lazy-loading tool schemas — sending them only when a tool is actually invoked — removes the per-turn manifest tax entirely.
Reset the transcript between tasks. When you switch tasks, start fresh rather than dragging an unrelated 80k-token history into the new work.

Do the first three and the model you run on top barely matters to the bill — which is the whole point. You get to keep Opus.

Where does Tokenade fit?

Tokenade automates exactly the cuts above so you don't have to police your own prompts. It installs into Claude Code (and Cursor, Codex, Copilot, Windsurf and the rest) and sits between the agent and your tools: it serves semantic code search instead of whole-file reads, applies output filtering to noisy commands, compresses large files to skeletons, and lazy-loads MCP tool schemas so the manifest tax disappears. A savings dashboard shows the tokens and dollars it actually saved, so this isn't a faith-based claim — you can watch the line drop. It's source-available under MIT, which matters to me: a tool that sits in your context pipeline shouldn't be a black box you can't audit. Free up to about 10M tokens/month; Pro is $24.90/mo excluding tax in the US (€19.90/mo TTC in France), unlimited machines. If you're weighing options, Best Claude Code token optimizers compares the field, Tokenade included. The honest pitch: keep the strong model, install the thing that stops feeding it garbage, and let the dashboard prove the savings. That beats degrading every task to shave the rate.

What goes wrong (anti-patterns)

The most common mistake is treating model choice as the cost dial when it's really the quality dial. Watch for these:

"We moved to Haiku to save money." You also moved to worse plans on hard tasks, and you left the 50–90% token waste fully intact. Cut the waste first; downgrade only if the bill is still too high afterward, and only on tasks where the weaker model genuinely suffices.
Optimizing output tokens. Output is the small number; on most agentic sessions input dwarfs it because of re-reads. Chasing shorter answers while ignoring bloated context is optimizing the rounding error.
Assuming caching will save you regardless. Caching only discounts stable context. A transcript full of one-off dumps has little to cache. Clean the context and the cache discount compounds.
Pruning so hard the agent goes blind. The goal is removing noise, not signal. If the model starts asking for files it clearly needs, you've over-trimmed retrieval — loosen it. Done right you trim cargo, and quality is untouched.

See also:

How to reduce Claude Code token usage
The AI tool nobody at Uber could put down — adoption outrunning the budget.
Why Microsoft pulled Claude Code from its own devs — the same problem, resolved the blunt way.
Best Claude Code token optimizers
Best MCP servers for Claude Code
LLM API token pricing