AI Coding Agent Token Costs

Current per-token prices for Claude, GPT-4.1, and Gemini 2.5 Flash, the output premium, prompt-caching discounts, and real-world session costs — all sourced from primary documentation and verified reporting.

Profile photo of Paul Irolla

By Paul Irolla

Founder · AI & developer tools · Tokenade

Ph.D. in AI · builds token-optimization tooling for AI coding agents

View author page
15 min readUpdated
Cite this page

Key figures

TL;DR
  • $3 / $15
    Claude Sonnet 4.6 input / output per million tokens — the most common coding default
    Anthropic, Claude API Pricing docs, platform.claude.com (2026)
  • output tokens cost roughly five times input tokens across Claude models
    Anthropic, Claude API Pricing docs, platform.claude.com (2026)
  • 10%
    cost of a cached input token vs. a fresh one under Anthropic's prompt caching
    Anthropic, Prompt Caching docs, platform.claude.com (2026)
  • $500–$2,000
    monthly Claude Code spend per engineer at Uber after 84% agentic adoption
    The Information / Fortune, reported May 2026
  • ≈85%
    share of agentic session cost driven by input tokens, not output
    Vantage, 'The Hidden Cost Driver in Agentic Coding Sessions' (2026)
  • $1.3M
    OpenAI token bill in 30 days running 100 Codex agents on OpenClaw
    Tom's Hardware / TheNextWeb, reporting on Peter Steinberger (2026)
  • 50%
    Anthropic Batch API discount on both input and output tokens (async, ≤24 h)
    Anthropic, Claude API Pricing docs, platform.claude.com (2026)
Running an AI coding agent in 2026 is a billing event driven by a small set of numbers: the per-million-token rates for the model you pick, the ratio between output and input prices, the caching discount on repeated context, and — crucially — how many tokens an agentic session actually burns. This page collects those numbers from primary sources so they are in one place. All prices are quoted as of June 2026. Model versions, context windows, and rates change frequently; re-verify at the primary source before quoting commercially.

Key Takeaways

Key numbers at a glance

  • Claude Sonnet 4.6 costs $3/MTok input and $15/MTok output — the most common coding default. [1]
  • Output tokens cost ≈5× more than input tokens across all Claude model tiers. [1]
  • Prompt-cache reads cost 10% of fresh input on Anthropic; cache writes carry a 1.25× premium for 5-min TTL or 2× for 1-hour TTL. [2]
  • Agentic coding tasks consume up to 1,000× more tokens than standard code-chat — a figure reported across multiple industry sources as real-world usage patterns have become clear. [6]
  • A typical 50-turn agentic session uses roughly 1 million input tokens and 40,000 output tokens — a 25:1 ratio — and costs $3–$6 on Sonnet. [7]
  • Uber burned its entire 2026 AI budget in four months on Claude Code: monthly spend ran $500–$2,000 per engineer at 84% agentic adoption. [10]
  • Re-sent context makes up ≈62% of the typical agent bill — the largest single lever is what the model reads, not what it writes. [8]

How much does a million tokens cost on Claude in 2026?

Claude's three production tiers — Opus 4.8, Sonnet 4.6, and Haiku 4.5 — span a 5× price range from the top to the bottom of the stack. Claude Opus 4.8 costs $5.00 per million input tokens and $25.00 per million output tokens. [1] Claude Sonnet 4.6 costs $3.00 per million input tokens and $15.00 per million output tokens. It is the most commonly used coding-agent model, balancing capability and cost. [1] Claude Haiku 4.5 costs $1.00 per million input tokens and $5.00 per million output tokens — used for grunt-work steps where lower latency and cost matter more than peak reasoning. [1]
ModelInput ($/MTok)Output ($/MTok)Output/Input ratioContext window
Claude Opus 4.8$5.00$25.00200K
Claude Sonnet 4.6$3.00$15.00200K
Claude Haiku 4.5$1.00$5.00200K

Source: Anthropic, Claude API Pricing docs (platform.claude.com, June 2026) [1]

All three models support a 200K-token context window as the billing unit. Anthropic removed a long-context surcharge that previously applied above ≈200K tokens when Claude Opus 4 and Sonnet 4 were launched in 2025; the flat-rate window has been maintained for the current generation.

How does Claude pricing compare to GPT-4.1 and Gemini 2.5 Flash?

Claude is not the only option for coding agents. GPT-4.1 from OpenAI and Gemini 2.5 Flash from Google are the most commonly benchmarked alternatives. GPT-4.1 costs $2.00 per million input tokens and $8.00 per million output tokens, with a 4× output-to-input ratio. [3] OpenAI applies automatic prompt caching on GPT-4.1 at a 75% discount on cached input tokens — higher than Anthropic's 90% savings but different in structure: OpenAI caches automatically with no explicit API call, while Anthropic requires a cache_control breakpoint. [4] Gemini 2.5 Flash costs $0.30 per million input tokens and $2.50 per million output tokens for non-thinking output. [5] Thinking tokens on Gemini 2.5 Flash are billed separately at $3.50 per million, meaning reasoning-heavy tasks cost substantially more than the headline rate suggests. [5]
ModelInput ($/MTok)Output ($/MTok)Cache read discountContext window
Claude Sonnet 4.6$3.00$15.00−90% (10% of input)200K
Claude Opus 4.8$5.00$25.00−90% (10% of input)200K
GPT-4.1$2.00$8.00−75% (automatic)1M
Gemini 2.5 Flash$0.30$2.50 (non-thinking)1M

Sources: Anthropic pricing [1], OpenAI pricing [3], Google AI pricing [5] (June 2026)

The headline rates for GPT-4.1 and Gemini 2.5 Flash are lower than Claude Sonnet 4.6 on input. Whether that translates to a lower total bill depends on session structure, because the input-to-output token ratio and caching behaviour interact with price in ways that can reverse the ranking.

Why does the output/input price ratio matter so much for agents?

Output tokens cost 5× more than input tokens across the entire Claude model line — $25 vs $5 per million on Opus 4.8, $15 vs $3 on Sonnet 4.6, $5 vs $1 on Haiku 4.5. [1] At face value, this makes output the expensive side of the ledger. The counter-intuitive reality for coding agents is that input is the higher-cost line item in practice. A 2026 analysis of production agentic sessions found that input tokens drive approximately 85% of total session cost. [7] This is because the agent re-reads its full context — system prompt, tool definitions, conversation history, file contents, command output — on every single tool call. A 50-turn debugging session accumulates that context turn by turn, so by turn 30 the agent is paying to re-read 25,000–35,000 tokens of accumulated history on every request. [7] The output premium (5×) is real, but it applies to a much smaller volume. A typical 50-turn session generates roughly 40,000 output tokens and 1,000,000 input tokens — a 25:1 ratio. [7]
25:1
input-to-output token ratio in a typical 50-turn agentic session
Source: Vantage, 2026 [7]
≈85%
of total agent cost driven by input tokens, not output
Source: Vantage, 2026 [7]
The practical implication: optimising what the agent reads on each turn is more impactful than making it write shorter responses. Retrieve only relevant context, prune conversation history aggressively, and route simpler steps to lighter models. For further background on the context window mechanics that drive this accumulation pattern, see also the prompt caching glossary entry.

How much does prompt caching save?

Anthropic's prompt caching bills cache-read tokens at 10% of the base input price — a 90% saving on repeated context. [2] The cache-write cost is 1.25× the base input price for a 5-minute TTL, or 2× for a 1-hour TTL. [2] The breakeven is favourable: with the 5-minute cache, a single re-read pays off the write premium; with the 1-hour cache, two re-reads are enough. For a 50-turn session on Sonnet 4.6 where the system prompt and tool definitions (10,000 tokens) are stable across all turns:
  • Without caching: 10,000 × 50 × $3 / 1,000,000 = $1.50 just for that static context
  • With caching (1 write + 49 reads): (10,000 × 1.25 × $3 / 1,000,000) + (10,000 × 49 × $0.30 / 1,000,000) = $0.038 + $0.147 = $0.185 — a reduction of roughly 88%
The requirement is that the cached prefix must be byte-identical across requests. Any change to the cached block — including reordering instructions — invalidates the cache. One documented failure mode: a developer left Claude Code running overnight in an automation loop; Anthropic had silently reduced the default cache TTL from one hour to five minutes, so an 800,000-token context was rebuilt from scratch on every 30-minute cycle — 48 times overnight — resulting in a $6,000 bill by morning. [12] GitHub measured the impact of systematic token optimisation in their own production agentic workflows in 2026, achieving up to a 62% reduction in token spend by pruning unused MCP tool registrations, maintaining stable tool definitions across turns, and running daily auditor agents. [9] For the underlying mechanics of how tokens are counted, the tokenizer and token glossary entries provide reference definitions.

How many tokens does a real coding session consume?

Production usage data from 2026 makes clear that agentic coding costs are not theoretical. At the session level, Vantage's analysis of production teams found a typical 50-turn coding session at approximately 1 million input + 40,000 output tokens, with indicative costs ranging from $0.60 (lighter models) to $6.00 (Opus-class), and average monthly spend of $400–$1,500 for developers running coding agents full-time. [7] At the organisation level, Uber burned its entire 2026 AI budget in four months after rolling out Claude Code to approximately 5,000 engineers. Monthly costs ran $500–$2,000 per engineer for power users; overall adoption reached 84% agentic and 95% of engineers using AI tools monthly. [10] Microsoft's response was to cancel Claude Code licences across its Experiences and Devices division and redirect engineers to GitHub Copilot CLI ($39/seat flat vs Claude Code's $20 seat + usage), with a deadline of 30 June 2026. [11] At the extreme end, Peter Steinberger (creator of OpenClaw) ran 100 Codex agents simultaneously for a full month, accumulating 603 billion tokens across 7.6 million requests — a total bill of $1,305,088.81 covered by OpenAI. [13] A research paper on agent context management (arXiv:2508.21433, presented at the NeurIPS 2025 DL4Code workshop) confirmed that re-read context is the dominant cost driver and found that simple observation masking — replacing old tool outputs with placeholders — halves agent cost while matching the solve rate of more expensive LLM-summarisation approaches. [6] LeanOps audited 30 engineering teams running agentic AI in production between March and May 2026, finding that re-sent context accounts for ≈62% of the total bill at those organisations. [8]
Session typeApprox. input tokensApprox. output tokensIndicative cost (Sonnet 4.6, no cache)
Typical 50-turn agentic session≈1,000,000≈40,000≈$3.60
Complex single debugging session500,000+variable$1.50+
Multi-agent runaway loop (11 days)billionsvariable$47,000 total [14]

Sources: Vantage [7], dev.to postmortem [14]. Indicative costs use $3/MTok input + $15/MTok output; no caching applied.

What discounts are available beyond prompt caching?

Three additional cost-reduction mechanisms are available on Anthropic's platform as of 2026. Batch API (50% discount). The Anthropic Message Batches API processes requests asynchronously within 24 hours at exactly 50% off both input and output tokens. Claude Sonnet 4.6 drops from $3/$15 to $1.50/$7.50 per million; Haiku 4.5 from $1/$5 to $0.50/$2.50. [1] There is no quality difference between batch and synchronous responses — only timing. This is suited to offline code-analysis pipelines, nightly evaluation runs, and bulk refactoring jobs that can tolerate hours of latency. Combining caching and Batch API. The two discounts stack. A batch request that also hits the cache on a stable 10,000-token system prompt pays 50% off the write premium (1.25× × 0.5 = 0.625×) and 50% off cache-read tokens (0.1× × 0.5 = 0.05× — 5% of the base input price). [2] Model tier routing. Because Haiku 4.5 costs one-third of Sonnet 4.6 at every tier, routing simpler agent steps — file listing, output formatting, search-equivalent reads — to the lighter model is a compound saving. The LeanOps audit found that teams achieving 50–70% cost reductions within two weeks consistently combined per-user budget caps, prompt caching, model-tier routing, and context-window pruning. [8]

What the numbers mean in aggregate

Cost levers ranked by typical impact

Reduce re-sent context (pruning, retrieval)≈62% of bill addressed
Prompt caching on stable context (−90% on cached reads)high leverage
Model-tier routing (Haiku = 1/3 of Sonnet price)medium leverage
Batch API for non-real-time workloads (−50%)medium leverage
Reducing output length ("be concise")low leverage

Sources: LeanOps audit [8], Vantage analysis [7], Anthropic pricing [1] (2026). Bar widths are qualitative rankings, not a single computed metric.

The figures above point to a consistent finding across multiple 2026 sources: what the model reads dominates costs, and reducing output length is the lowest-leverage optimisation available. The full practical playbook for reducing AI coding agent token usage covers these techniques in detail. Tokenade applies them automatically — context compression, cache-stable prefixes, and model-tier routing — without requiring manual instrumentation.

Methodology note

Per-token prices are from Anthropic's official pricing page (primary source, verified June 2026). Prompt-caching multipliers are from Anthropic's prompt caching documentation (primary source). OpenAI and Google prices are from their respective developer pricing pages. Session-volume figures are from the Vantage production analysis and the LeanOps industry audit of 30 production teams. The Uber and Microsoft figures are from The Information and Windows Central respectively. The OpenClaw figures are from Tom's Hardware and TheNextWeb, reporting on a public disclosure by Peter Steinberger. The $47,000 agent-loop and $6,000 overnight figures are from published postmortems on dev.to and MakeUseOf. The arXiv 2508.21433 findings are from a peer-reviewed paper presented at the NeurIPS 2025 DL4Code workshop. All figures should be re-verified at primary sources before citing in commercial contexts.

Sources and references

  1. [1]Anthropic. "Pricing — Claude API Docs". platform.claude.com, 2026. Link ↗
  2. [2]Anthropic. "Prompt caching — Claude API Docs". platform.claude.com, 2026. Link ↗
  3. [3]OpenAI. "OpenAI API Pricing". openai.com, 2026. Link ↗
  4. [4]OpenAI. "Prompt caching — OpenAI Platform Docs". platform.openai.com, 2026. Link ↗
  5. [5]Google. "Gemini Developer API pricing". ai.google.dev, 2026. Link ↗
  6. [6]Lindenbauer et al. (JetBrains Research). "The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management". arXiv:2508.21433, presented at NeurIPS 2025 DL4Code workshop. Link ↗
  7. [7]Vantage. "The Hidden Cost Driver in Agentic Coding Sessions in 2026". vantage.sh, 2026. Link ↗
  8. [8]LeanOps. "AI Agents Burn 50x More Tokens Than Chats". leanopstech.com, 2026. Link ↗
  9. [9]GitHub. "Improving token efficiency in GitHub Agentic Workflows". github.blog, 2026. Link ↗
  10. [10]The Information. "Uber CTO Shows How Claude Code Can Blow Up AI Budgets". theinformation.com, May 2026. Link ↗
  11. [11]Windows Central. "Microsoft cancels Claude Code licenses, shifting developers to GitHub Copilot CLI". windowscentral.com, 2026. Link ↗
  12. [12]MakeUseOf. "Someone left Claude Code running overnight, and it cost $6,000". makeuseof.com, 2026. Link ↗
  13. [13]Tom's Hardware. "OpenClaw creator burns through $1.3 million in OpenAI API tokens in a single month". tomshardware.com, 2026. Link ↗
  14. [14]Anhaia, Gabriel. "The Agent That Spent $47K on Itself: An Autonomous-Loop Postmortem". dev.to, 2026. Link ↗

Up to 88% fewer tokens. Zero config.

Tokenade is the simplest way to cut what your coding agent sends to the model — set it up once, save on every prompt. Works with Claude Code, Cursor, Codex, Copilot & more.