Last updated: May 2026
Multi-step AI agents don't cost 10× one step — they cost ~55× because context accumulates. See the real math, per-step breakdown, and what prompt caching saves.
Why agentic loops are expensive — and non-obvious
Think of it like a meeting transcript. Every step, the AI re-reads everything that happened before. Step 8 reads 7 steps of history before responding. This means input tokens grow with every turn — and total cost grows roughly quadratically with the number of steps.
A 10-step agent: Step 1 processes ~1,400 tokens of input. Step 10 processes ~7,600 tokens of input. Sum it up and the loop consumes N×(N+1)/2 times the per-step input — about 55× for 10 steps. This calculator shows that in real dollars.
Model & Pricing
Context Configuration
Tokens that are constant every step vs. tokens that grow with the loop
Volume
Results
| Step | Input Tokens | Output Tokens | Step Cost | Cumulative Cost |
|---|
Prompt Caching Savings
Assumes 90% cache hit rate; cached tokens cost 10% of normal input price
| Scenario | Cost per Loop | Cost per Day | Cost per Month | Savings vs. No Cache |
|---|
Model Comparison — Same Loop Configuration
| Model | Input Price | Output Price | Cost / Loop | Cost / Day | Cost / Month |
|---|
Pricing based on publicly available API rates as of May 2026 and may change. Prompt caching availability and exact discount rates vary by provider. Use for estimation purposes only.
At each step N of an agentic loop, the model receives the entire conversation history as input context:
Output tokens are the same every step. So total input across the full loop is the sum from N=1 to N=steps:
The quadratic term (steps² / 2) is what makes long agentic loops expensive. Doubling the number of steps roughly quadruples the input token cost — not doubles it.
Prompt caching lets you cache the KV state of the system prompt and tool definitions (the constant prefix). With a 90% cache hit rate and 10% cached price, savings at each step = 0.9 × 0.9 × (sys_prompt + tool_defs) × input_price_per_token.
At step 1, the model reads only the system prompt + tool definitions + first user message. At step 10, it reads all of that plus 9 prior assistant responses and 9 additional user/tool messages. Summing input tokens from step 1 to step 10, you get roughly 1 + 2 + 3 + ... + 10 = 55 times the "unit" of per-step context. Output tokens are constant (one response per step), but input dominates cost for most models, especially at long context. This is called the N×(N+1)/2 accumulation pattern.
Prompt caching saves the most when your system prompt and tool definitions are large (1,000+ tokens) and your loop has many steps. For a loop with a 1,200-token constant prefix (system prompt + tools), 10 steps, and Claude 3.5 Sonnet pricing: uncached cost for those constant tokens = 10 × 1,200 × $3.00/1M = $0.000036. With 90% cache hit at 10% price: cached cost = 10 × 1,200 × $3.00/1M × (0.1 × 0.9 + 0.1) = $0.0000072. That's an 80% reduction on the constant prefix portion of your loop cost.
Agent verbosity directly controls future input cost — every token the model outputs at step N becomes part of the input at step N+1. Strategies: (1) Add "be concise" instructions to your system prompt. (2) Use structured output (JSON) instead of prose — structured formats are typically 30–50% shorter. (3) Summarize completed sub-tasks rather than keeping full trace in context. (4) Use a "working memory" pattern: the agent maintains a compact state object rather than full conversational history.
Almost always cheaper to run shorter loops. A single 20-step loop has total input proportional to 20×21/2 = 210 "steps of context." Two 10-step loops have total input proportional to 2 × 10×11/2 = 110. So splitting a 20-step loop into two 10-step loops cuts input token cost by roughly 48%. The tradeoff is that splitting requires careful state handoff between loops, which adds engineering complexity.
As of mid-2026: Anthropic Claude 3.x and Claude 4.x series support prompt caching with explicit cache control headers — you mark the prefix to cache, and subsequent calls hitting that prefix are billed at 10% of normal input price. OpenAI GPT-4o models have automatic prompt caching that triggers for prompts over 1,024 tokens with a 50% discount on cached tokens. Google Gemini 1.5 and 2.0 models support "context caching" for prefixes over 32k tokens (minimum cache size). For short system prompts under 1k tokens, Anthropic's explicit caching is the most accessible option.
Most developers are surprised by agentic loop costs because they reason about them linearly: "10 steps = 10 API calls, so 10× the cost of one call." But this misses how context windows work. Each call in a loop carries forward all prior messages, so the input size — and therefore input cost — grows with every step.
The classic example: you build a research agent that calls 8 tools in sequence. You benchmark the cost of a single tool call ($0.002) and estimate your 8-step loop at $0.016. The real cost is closer to $0.072 — because step 8 processes 7 prior assistant responses plus 7 prior tool results as input context, not just one tool call. Steps 1–7 also accumulated context, making the total much higher than expected.
The most impactful lever is reducing output tokens per step, because these accumulate as future input. A 50% reduction in output verbosity cuts total input token cost by roughly 25% on a 10-step loop — a bigger effect than switching from GPT-4o to GPT-4o mini on the output side alone.
Prompt caching is the second most impactful optimization when your constant prefix (system prompt + tool definitions) is large. At 1,200 tokens of constant prefix across 10 steps with Claude 3.5 Sonnet, caching saves roughly $0.000032 per loop — not huge for one loop, but at 20 loops/day and 30 days/month that's ~$0.19/month per agent instance. At enterprise scale (10,000 loops/day), that's $960/month from just enabling caching.
What is the fastest way to cut agentic loop costs by 50%?
Switch to a cheaper model for non-critical steps. If your agent uses Claude 3.5 Sonnet for all 10 steps, using Claude 3.5 Haiku for steps 1–7 and Sonnet only for the final synthesis step cuts costs by roughly 70%. The key insight: most intermediate steps in a research or coding agent don't require the full capabilities of a frontier model — they're doing simple tool dispatching or formatting. Reserve the expensive model for synthesis, judgment, and final output.
How do token counts change if my agent uses function calling?
Function calling (tool use) typically adds 200–600 tokens per call to your context: the tool call JSON, the tool result, and any parsing overhead. These count as part of the accumulated context and grow with each step. If your agent makes one function call per step with a 300-token tool result, that adds to the input_per_step baseline — your effective input accumulation is higher than the raw "input per step" figure alone.