Last updated: May 2026
Estimate the full monthly cost of your Retrieval-Augmented Generation pipeline — embedding, vector storage, retrieval, reranking, and LLM inference — broken down per query and per month.
Knowledge Base & Embeddings
One-time and ongoing ingestion costs
Vector Database
Query Volume & Retrieval
RAG Pipeline Cost Results
Cost breakdown by pipeline stage
| Stage | Monthly Cost | % of Total | Cost/Query |
|---|
Embedding cost = Total chunks × tokens/chunk × embedding price/token. Monthly re-ingestion adds a fraction of this based on your update frequency.
Vector DB cost = Based on number of vectors stored (Serverless) or compute hours (Pod/dedicated). Self-hosted eliminates API cost but adds server infrastructure cost.
Retrieval query cost = Embedding the user query (query tokens × embedding price). For rerankers, add the reranker API cost per search.
LLM inference cost = (input tokens × input price + output tokens × output price) × queries/month. This is typically the largest cost driver at scale.
Cost per query = Total monthly cost ÷ number of queries. Useful for pricing decisions if building a product on top of this RAG system.
⚠️ Pricing is based on publicly available API rates as of April 2026 and may change. Self-hosting costs (GPU, bandwidth, ops) are not fully modeled. Use for estimation only.
Retrieval-Augmented Generation (RAG) systems have five cost components that add up per query. This calculator breaks down each layer to find your true cost per answer.
Component breakdown:
Embedding: Converting query text to vectors (e.g., OpenAI text-embedding-3-small at $0.02/M tokens — negligible per query at 100 tokens = $0.000002).
Vector DB: Hosting cost amortized per query (Pinecone Serverless: ~$0.10/M queries; Qdrant Cloud: ~$0.08/M).
LLM inference: Largest cost — retrieved context (1,000-5,000 tokens) + query + response at your chosen model's rate.
Worked example: 1,000 daily queries, 3,000 token context per query, Claude Haiku ($0.80/$4 per M in/out), 500 output tokens: Daily LLM cost = 1,000 x ((3,000 x $0.00000080) + (500 x $0.000004)) = $2.40 + $2.00 = $4.40/day ($132/month). Vector DB + embedding adds roughly $0.10-0.50/day.
For most RAG applications, OpenAI's text-embedding-3-small ($0.02/M tokens) provides an excellent quality-to-cost ratio. For higher accuracy on technical or domain-specific content, text-embedding-3-large ($0.13/M) improves retrieval quality at 6.5x the cost. Cohere's embed-v3 and Voyage AI are strong alternatives with competitive pricing. The embedding model choice matters most for retrieval recall — poor embeddings mean the right context is not retrieved, degrading answer quality regardless of how good your LLM is.
For most use cases: Pinecone Serverless (easiest to start, pay-per-use), Qdrant Cloud (open-source, good at scale), Weaviate (strong hybrid search), or pgvector (if you already use PostgreSQL). For high-query-volume production: Qdrant or Weaviate self-hosted reduces per-query cost to near zero once infrastructure is paid. For prototyping: Chroma (local, free) or Supabase pgvector (free tier available). Cost differences become significant only above ~100,000 queries/month.
Five high-impact optimizations: (1) Use a smaller, cheaper LLM for the generation step — Claude Haiku or GPT-4o mini is 10-20x cheaper than frontier models for most RAG tasks. (2) Reduce retrieved context chunks — 3-5 chunks of 300-500 tokens is often better than 10 chunks. (3) Enable prompt caching for repeated system prompts (90% discount on cached tokens). (4) Cache frequent queries — many RAG questions repeat; cache top answers for 24-48 hours. (5) Use smaller embedding models — quality difference is minor for most use cases.
RAG retrieves relevant information at query time and injects it into the context — it requires no model modification and stays current with updated knowledge. Fine-tuning trains the model on your specific data, encoding knowledge into weights — it improves style and domain-specific reasoning but cannot be updated without retraining. RAG is better for current, factual, or frequently updated information. Fine-tuning is better for specific formatting requirements, tone, or behavior patterns. Most production AI applications use RAG; fine-tuning adds cost and complexity best reserved for specific performance gaps.