How to Reduce Codex Token Usage

Codex bills you for eager file reads, raw command output, the MCP manifest and a growing transcript. Here's how to cut each one without losing quality.

Profile photo of Paul Irolla

By Paul Irolla

Founder · AI & developer tools · Tokenade

Ph.D. in AI · builds token-optimization tooling for AI coding agents

View author page
7 min read
Cite this page

How do you reduce Codex token usage?

You reduce Codex token usage by attacking the four things that actually dominate its bill: eager whole-file reads, raw shell output dumped into the transcript, the tool manifest re-sent on every turn, and a conversation history that re-bills every past read. Trim each one and the cost drops sharply — without hurting output quality, because you're cutting noise, not signal. This is the Codex-specific companion to How to reduce AI coding agent token usage. If you want the agent-agnostic picture first, start there. Here I map those same levers onto OpenAI's Codex CLI specifically: how it reads files, which commands it runs, and how its transcript grows under GPT-5.5 pricing — which, at $5 per million input tokens and $30 per million output tokens, is the most expensive output rate among the agents I run daily, so the math matters more here than almost anywhere else.

Why does Codex burn so many tokens?

Codex burns tokens primarily because it reads more than it needs, then re-reads it on every subsequent turn. That's not a Codex bug — it's how every agentic loop works. On each step, the agent re-sends the entire conversation history plus all tool results back to the model. So any oversized read you incur early gets paid again on every turn that follows. With GPT-5.5's $30/MTok output rate sitting on top, a sloppy session adds up fast. Four patterns drive most of the bill: Eager whole-file reads. Asked to "understand" a module, Codex defaults to reading the full file. A 500-line TypeScript file is roughly 5,000–7,000 tokens, and Codex may pull a dozen of them before writing a line. If the context window carries those reads across 20 turns, you've paid for each one 20 times. Unfiltered shell output. Codex runs commands — git status, npm test, tsc --noEmit, build scripts — and the raw output flows straight into the transcript. A failed test run can emit 15,000 tokens of stack traces, green checkmarks and timing tables. Codex needed maybe 50 of them: the failing test name and the broken assertion. The rest is cargo, and you re-pay for it every turn it stays in context. This is the single lever I see ignored most often — see output filtering for the principle. The tool manifest. Every tool Codex can call — including every MCP server you've connected — advertises its definition on every turn, whether or not it's used. Five servers with ten tools each is a fixed overhead that compounds silently over a long session. Best MCP servers for Claude Code covers which ones actually earn their keep; the same discipline applies to Codex. An unbounded transcript. A Codex session that sprawls across several tasks, or revisits the same files repeatedly, builds a transcript that costs more to re-read than the original work. Every new turn pays for every prior turn. Figuring out which pattern dominates your sessions is step one. Watch your usage dashboard — the answer usually points straight at the right lever.

What are the biggest levers for Codex specifically?

The biggest levers are file retrieval, output filtering and tool pruning — roughly in that order for most Codex sessions.

Make Codex search instead of read

Semantic retrieval — finding code by meaning rather than by reading whole files — is where the savings concentrate. Instead of "read auth.ts, read session.ts, read middleware.ts," you want Codex to retrieve the three functions that actually matter. That's the difference between 6,000 tokens and 600. Semantic code search and the embeddings behind it are what make this possible; conceptually it's RAG applied to your repository. A cheaper habit you can adopt today: ask Codex to read the skeleton first — signatures, imports, type definitions — and only pull a function body when it needs to change it. Skeleton compression keeps every interface Codex reasons over while dropping the line-by-line bulk.

Filter command output before it hits context

When Codex runs a noisy command, you only want the part that changes its next decision. A passing test suite needs a one-line "all green," not 300 lines of checkmarks. A failing one needs the error and the file:line — not the framework's full traceback. Piping verbose commands through a filter, or having Codex grep the output instead of reading it whole, routinely strips 90%+ of a command's tokens with zero loss of usable information.

Prune the tool manifest

Disconnect MCP servers you aren't using in the current task. If Codex is debugging a CSS bug, it does not need your database MCP, your Stripe MCP and your browser-automation MCP all advertising themselves every turn. Lazy loading — only exposing a tool's full schema when it's about to be called — is the structural fix; manual pruning is the version you can do tonight.

How does prompt caching change the Codex math?

Prompt caching changes it a lot, because a cached token costs roughly 10% of a fresh one. On GPT-5.5 that turns a $5/MTok input rate into an effective ≈$0.50/MTok for the cached prefix. The catch: caching only fires when the prefix is byte-for-byte identical across turns. A system prompt or an AGENTS.md-style config file that mutates between runs — because it carries in-progress notes — defeats the cache and gets re-billed at full price. Keep that prefix stable and concise and you get a double win: fewer tokens and a higher cache-hit rate. Prompt caching covers the mechanics. It's also worth knowing the relative pricing across the agents you might switch between, because the right model can be a token-cost lever by itself. Claude Opus 4.8 runs $5/$25 per MTok, Sonnet 4.6 is $3/$15, and Haiku 4.5 is $1/$5 — so for grunt work like reading-and-summarising, a smaller model on the same task can cost a fraction of GPT-5.5's $30/MTok output. Match the model to the job. The full breakdown lives in LLM API token pricing and AI coding agent token costs.

How to apply this today

  1. Measure first. Open your usage dashboard and note your baseline tokens-per-task. You can't tell which lever matters without it. The token counter and LLM token cost calculator help you turn token counts into dollars.
  2. Cap the reads. Tell Codex to retrieve symbols, not files — skeleton first, bodies on demand. This is usually the single biggest line item.
  3. Filter the noise. Pipe test/build/lint output through a summariser or grep before it reaches the transcript.
  4. Prune the manifest. Disconnect MCP servers irrelevant to the current task.
  5. Keep the prefix stable. Freeze your config/system prompt so prompt caching actually fires.
  6. Re-measure. Compare against your baseline. If a lever didn't move the number, it wasn't your bottleneck — move on.
If wiring all of this up by hand sounds like a second job, that's exactly why I built Tokenade. It applies semantic retrieval, output compression, skeleton reads and lazy MCP loading automatically — inside Codex, Claude Code, Cursor, Copilot, Windsurf and the rest — with a savings dashboard so you can see the effect per session instead of taking my word for it. It's free up to about 20M tokens a month; Pro is $9.90/mo (excl. tax) with three seats, and it's source-available under MIT, so you can read exactly what it sends and what it strips before you trust it with your repo.

What goes wrong (anti-patterns)

Compressing away the signal. The failure mode that actually hurts quality is dropping something load-bearing — a function signature Codex needed, the one error line in a wall of output. The fix is structural: skeleton reads keep every interface, output filters keep the actual error. Trim noise, never signal. Optimising the wrong lever. If your sessions are mostly Codex writing new code from a clear spec, your output tokens are already the irreducible part and file-read tuning won't help much. Measure before you optimise, or you'll spend effort where there's no headroom. A bloated context degrades reasoning, not just cost. Models attend least to information buried deep in a long context window. A transcript stuffed with stale command output doesn't only cost more — it makes Codex worse at finding the relevant detail. Token reduction and quality move in the same direction here, which is the part people don't expect. This is the discipline I'd group under context engineering: deciding deliberately what the model sees, rather than letting the agent fill the window with whatever it happened to read. It's the same skill that separates productive agentic coding from vibe-coding your way into a $40 session.
See also:

Up to 88% fewer tokens. Zero config.

Tokenade is the simplest way to cut what your coding agent sends to the model — set it up once, save on every prompt. Works with Claude Code, Cursor, Codex, Copilot & more.