Last updated: May 2026
Estimate and compare costs across major AI model providers before committing to a model.
Your Usage Parameters
Cost for selected model
Full model comparison — same usage parameters
| Model | Provider | Input $/1M | Output $/1M | Per Request | Daily | Monthly |
|---|
Prices last updated: April 2026. Pricing may change — always verify current rates at anthropic.com/pricing, openai.com/pricing, and ai.google.dev/pricing. Prices shown are standard API rates and do not include batch discounts, enterprise agreements, or free-tier credits.
Token-based pricing: All major AI APIs charge separately for input tokens (your prompt) and output tokens (the model's response). Output tokens are typically 3–5× more expensive than input tokens.
Formula: Cost per request = (input_tokens ÷ 1,000,000 × input_price) + (output_tokens ÷ 1,000,000 × output_price). Then multiply by daily requests and days per month for totals.
What counts as a token? Roughly 4 characters of English text equals 1 token. A 500-word essay is approximately 650–700 tokens. System prompts, conversation history, and tool/function definitions all count toward input tokens.
Hidden costs to consider: Context caching (some models charge for cached tokens), fine-tuning, embeddings, image/audio inputs, and rate-limit overages are not included in this calculator. Always review your provider's full pricing page.
AI APIs charge per token - roughly 0.75 words, or about 4 characters of text. Every request has input tokens (your prompt and context) and output tokens (the model's response). These are priced separately, and input is almost always cheaper than output.
Worked example: 10,000 requests per day with 1,000 input tokens and 500 output tokens each using Claude Sonnet 4 at $3/M input and $15/M output: Daily input cost = 10M tokens x $3 = $30. Daily output cost = 5M tokens x $15 = $75. Total: $105/day or approximately $3,150/month.
Prompt caching (available on Claude and GPT-4o) dramatically reduces input costs for repeated context. Cached tokens cost 80-90% less. For RAG applications with long system prompts, enabling caching can cut your monthly bill by 60% or more.
Model prices shown reflect current published rates as of May 2026. Prices change frequently - verify against each provider's official pricing page before budgeting.
A token is the basic unit of text that AI models process. One token is roughly 0.75 words in English - "calculator" is one token, "calculatorapp.io" is two or three. A typical paragraph of 100 words is about 130-140 tokens. The exact tokenization varies by model: OpenAI uses tiktoken, Anthropic uses its own tokenizer, and Google uses SentencePiece.
Output tokens require the model to generate text sequentially, which is computationally intensive and cannot be parallelized. Input tokens are processed in parallel, making them faster and cheaper to compute. For most models, output tokens cost 3-5x more than input tokens per million. This is why concise prompts that minimize output length can meaningfully reduce costs.
Prompt caching stores frequently repeated content such as a long system prompt or document so it does not need to be reprocessed on every request. Claude's cache hits cost 90% less than standard input tokens. If your app sends a 10,000-token system prompt with every request, enabling caching turns that from your biggest cost driver into a minor line item.
For simple high-volume tasks like classification, summarization, or extraction: use smaller models like Claude Haiku, GPT-4o mini, or Gemini Flash. They are 10-20x cheaper than frontier models with comparable quality on simpler tasks. For complex reasoning, coding, or nuanced generation: frontier models justify the higher cost. Use this calculator to compare costs at your specific request volume.
LLM API pricing is denominated per million tokens, where one token is roughly 0.75 words in English. The key variables in any cost estimate are the ratio of input to output tokens in your workload (output is always more expensive), the model tier you select, and whether your architecture benefits from prompt caching. Prices below reflect published rates as of May 2025 and change frequently — always verify against provider pricing pages.
Context window size matters for workloads with long documents or extended conversations. Models with 1M token context windows (Gemini) can process entire codebases in a single request; the tradeoff is that longer contexts cost proportionally more in input token fees.
| Model | Input Cost (per 1M) | Output Cost (per 1M) | Context Window | Best For |
|---|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | 128k tokens | General purpose |
| GPT-4o mini | $0.15 | $0.60 | 128k tokens | Cost-efficient tasks |
| Claude 3.5 Sonnet | $3.00 | $15.00 | 200k tokens | Long docs/coding |
| Claude 3 Haiku | $0.25 | $1.25 | 200k tokens | Fast/cheap tasks |
| Gemini 1.5 Pro | $1.25 | $5.00 | 1M tokens | Huge context needs |
| Gemini 1.5 Flash | $0.075 | $0.30 | 1M tokens | High-volume workloads |
| Llama 3 70B (self-hosted) | ~$0.50–$1.00 | ~$0.50–$1.00 | 8k tokens | Privacy/cost control |
What is a token in LLM APIs?
A token is the smallest unit of text an LLM processes. In English, one token is roughly 0.75 words — so "calculator" is one token, but "extraordinary" might be two. Punctuation, spaces, and special characters each consume tokens too. Most providers offer free tokenizer tools: OpenAI's tiktoken playground and Anthropic's token counter let you paste text and see exactly how many tokens it will use before you send a request. For cost estimation, a rule of thumb of 1.3 tokens per word (or 750 words per 1,000 tokens) is accurate enough for budgeting.
How many tokens is a typical message?
A short question like "What is the capital of France?" is about 10 tokens. A typical email is 150–250 tokens. A one-page document is roughly 500–700 tokens. A 10-page PDF is approximately 3,000–5,000 tokens. A full novel is 100,000–150,000 tokens, which exceeds most models' context windows. For API cost estimation, model your average input as: system prompt tokens + conversation history tokens + current message tokens. Output is usually shorter than input for most task types except creative writing.
How do I reduce LLM API costs?
Six high-impact strategies: (1) Use a cheaper model — switching from GPT-4o to GPT-4o mini saves 94% on a typical workload. (2) Enable prompt caching — 90% discount on repeated system prompt tokens in Claude, 50% in GPT-4o. (3) Shorten system prompts — every token in your system prompt is billed on every request. (4) Set output token limits — use max_tokens to cap response length. (5) Cache responses for repeated queries. (6) Batch non-urgent requests — OpenAI's Batch API offers 50% cost reduction for async workloads. Together these strategies can reduce costs by 80–95% without changing models.
What is the difference between input and output token pricing?
Output tokens cost more than input tokens because generating text is computationally sequential — the model must produce one token at a time, which cannot be parallelized. Input tokens are processed in parallel, requiring less compute time per token. Most models price output at 3–6× the input rate. This means applications that generate long responses (creative writing, detailed analysis, code generation) have higher per-request costs than those that classify, extract, or give short answers. Designing prompts to elicit concise responses is one of the most effective cost reduction techniques.
How does context window size affect cost?
Every token in the context window — including system prompt, conversation history, and retrieved documents — is billed as input tokens on each request. A 10,000-token system prompt sent with 1,000 daily requests costs $25/day on GPT-4o (without caching) vs. $2.50/day with prompt caching enabled. Larger context windows enable more powerful applications (whole-document analysis, long conversations) but drive up costs if not managed carefully. Best practices: use the minimum context needed for the task, summarize older conversation history, and always enable prompt caching for any repeated context over ~500 tokens.