Rate Limit

What is a rate limit?

A rate limit is a hard ceiling placed by an API provider on the volume of requests or tokens a single client (identified by API key or organisation) can consume within a rolling time window — typically per minute (RPM / TPM) or per day. Calls that exceed the limit receive a 429 Too Many Requests response and must be retried after a delay. For LLM APIs, rate limits come in two flavours that compound each other: requests per minute (RPM) and tokens per minute (TPM). A single large prompt can exhaust your TPM budget even if you've only sent one request.

Why rate limits matter for AI coding agents in 2026

Coding agents are rate-limit-hostile by design. They issue many small, rapid calls — tool-use loops, parallel file reads, inline completions — and each call carries a prompt that may already be hundreds of tokens long before any code is attached. Under sustained agent load, TPM limits are typically the binding constraint, not RPM.

The compounding effect

A typical agent loop looks like: read file → build prompt (300–800 tokens of boilerplate + file content) → call the model → parse result → repeat. If the boilerplate is verbose, every iteration burns tokens it didn't need to. Hit the TPM ceiling and the whole loop stalls with exponential back-off, adding wall-clock latency that compounds across a session. Token reduction directly expands your effective rate-limit headroom: fewer tokens per call means more calls per minute before the ceiling is reached. See reducing AI coding agent token usage for concrete techniques.

Tier progression

Most providers (Anthropic, OpenAI, Google) offer tiered rate limits that scale with cumulative spend. New API keys start at conservative limits — often 40k–100k TPM — that are quickly saturated by an active agent. Staying within the lower tiers is manageable if and only if each call is lean; bloated prompts are what push teams to pay for higher tiers earlier than necessary.

Multi-agent and parallel tool use

When an orchestrator spawns parallel sub-agents — common in RAG pipelines that fan out to multiple retrieval calls simultaneously — each sub-agent draws from the same TPM bucket. Five agents each sending 10k-token prompts in parallel is a 50k-token burst. Without token discipline, parallel agents hit the ceiling on the first coordinated action.

MCP and rate limits

Model Context Protocol servers that proxy tool calls to LLM APIs introduce an extra layer: the MCP server's own outbound requests count against your quota. A poorly implemented MCP tool that sends the full tool-call transcript on every invocation multiplies your token spend — and your rate-limit exposure — with each hop.

Diagnosing a rate-limit problem

Check the response header. Providers return x-ratelimit-remaining-tokens (and equivalent headers) on every response. Log these.
Distinguish 429 from 503. A 429 is a rate limit; a 503 is a capacity error on the provider's side. Retry logic for both looks similar, but the root cause differs.
Profile prompt size, not just request count. Teams that only monitor RPM miss TPM exhaustion entirely. Profile the token count of each prompt at the agent level.

When rate limits are NOT the problem

Single-user, low-frequency usage. Interactive chat with occasional code questions rarely approaches TPM ceilings. Rate limits become a concern at agent-loop scale, not conversational scale.
Batch / async workloads. Most providers offer a batch API with higher throughput limits and lower per-token cost. If your use case tolerates minutes of latency, the batch path bypasses real-time rate limits entirely.
Self-hosted models. If you run Llama, Mistral, or a quantised open-weight model locally, there is no external rate limit — just GPU memory and inference throughput. Rate limits are a cloud-API constraint.