Last updated: May 2026
Estimate the full monthly cost of your Retrieval-Augmented Generation pipeline — embedding, vector storage, retrieval, reranking, and LLM inference — broken down per query and per month.
Knowledge Base & Embeddings
One-time and ongoing ingestion costs
Vector Database
Query Volume & Retrieval
RAG Pipeline Cost Results
Cost breakdown by pipeline stage
| Stage | Monthly Cost | % of Total | Cost/Query |
|---|
Embedding cost = Total chunks × tokens/chunk × embedding price/token. Monthly re-ingestion adds a fraction of this based on your update frequency.
Vector DB cost = Based on number of vectors stored (Serverless) or compute hours (Pod/dedicated). Self-hosted eliminates API cost but adds server infrastructure cost.
Retrieval query cost = Embedding the user query (query tokens × embedding price). For rerankers, add the reranker API cost per search.
LLM inference cost = (input tokens × input price + output tokens × output price) × queries/month. This is typically the largest cost driver at scale.
Cost per query = Total monthly cost ÷ number of queries. Useful for pricing decisions if building a product on top of this RAG system.
⚠️ Pricing is based on publicly available API rates as of April 2026 and may change. Self-hosting costs (GPU, bandwidth, ops) are not fully modeled. Use for estimation only.
Retrieval-Augmented Generation (RAG) systems have five cost components that add up per query. This calculator breaks down each layer to find your true cost per answer.
Component breakdown:
Embedding: Converting query text to vectors (e.g., OpenAI text-embedding-3-small at $0.02/M tokens — negligible per query at 100 tokens = $0.000002).
Vector DB: Hosting cost amortized per query (Pinecone Serverless: ~$0.10/M queries; Qdrant Cloud: ~$0.08/M).
LLM inference: Largest cost — retrieved context (1,000-5,000 tokens) + query + response at your chosen model's rate.
Worked example: 1,000 daily queries, 3,000 token context per query, Claude Haiku ($0.80/$4 per M in/out), 500 output tokens: Daily LLM cost = 1,000 x ((3,000 x $0.00000080) + (500 x $0.000004)) = $2.40 + $2.00 = $4.40/day ($132/month). Vector DB + embedding adds roughly $0.10-0.50/day.
For most RAG applications, OpenAI's text-embedding-3-small ($0.02/M tokens) provides an excellent quality-to-cost ratio. For higher accuracy on technical or domain-specific content, text-embedding-3-large ($0.13/M) improves retrieval quality at 6.5x the cost. Cohere's embed-v3 and Voyage AI are strong alternatives with competitive pricing. The embedding model choice matters most for retrieval recall — poor embeddings mean the right context is not retrieved, degrading answer quality regardless of how good your LLM is.
For most use cases: Pinecone Serverless (easiest to start, pay-per-use), Qdrant Cloud (open-source, good at scale), Weaviate (strong hybrid search), or pgvector (if you already use PostgreSQL). For high-query-volume production: Qdrant or Weaviate self-hosted reduces per-query cost to near zero once infrastructure is paid. For prototyping: Chroma (local, free) or Supabase pgvector (free tier available). Cost differences become significant only above ~100,000 queries/month.
Five high-impact optimizations: (1) Use a smaller, cheaper LLM for the generation step — Claude Haiku or GPT-4o mini is 10-20x cheaper than frontier models for most RAG tasks. (2) Reduce retrieved context chunks — 3-5 chunks of 300-500 tokens is often better than 10 chunks. (3) Enable prompt caching for repeated system prompts (90% discount on cached tokens). (4) Cache frequent queries — many RAG questions repeat; cache top answers for 24-48 hours. (5) Use smaller embedding models — quality difference is minor for most use cases.
RAG retrieves relevant information at query time and injects it into the context — it requires no model modification and stays current with updated knowledge. Fine-tuning trains the model on your specific data, encoding knowledge into weights — it improves style and domain-specific reasoning but cannot be updated without retraining. RAG is better for current, factual, or frequently updated information. Fine-tuning is better for specific formatting requirements, tone, or behavior patterns. Most production AI applications use RAG; fine-tuning adds cost and complexity best reserved for specific performance gaps.
A RAG (Retrieval Augmented Generation) pipeline combines several distinct services, each with its own cost model. Unlike a simple API call, RAG has both one-time costs (document ingestion and embedding) and recurring per-query costs (vector search + LLM inference). Understanding each component helps you optimize spending and identify where the bulk of your bill originates.
For most RAG applications at moderate query volumes (under 50,000/month), LLM inference dominates total cost by a wide margin. Vector database and embedding costs are typically secondary. At very high volumes (500,000+ queries/month), infrastructure and vector DB costs become more significant.
| Component | Tool Example | Cost Model | Typical Monthly Cost (10k queries) |
|---|---|---|---|
| Embedding model | text-embedding-3-small | $0.02/1M tokens | ~$2–$5 |
| Vector database | Pinecone Starter | Free–$70/mo | $0–$70 |
| LLM inference | GPT-4o mini | Per token | $5–$20 |
| Document ingestion | One-time embed cost | $0.02/1M tokens | $1–$10 (one-time) |
| Reranking model | Cohere Rerank | $1/1,000 searches | $10 |
| Hosting/infrastructure | AWS/GCP compute | Per compute hour | $20–$100 |
What is RAG and why does it cost money?
RAG (Retrieval Augmented Generation) is a technique that lets an LLM answer questions about documents it was never trained on, by first retrieving the relevant text and injecting it into the prompt. It costs money because it involves multiple paid services: an embedding model to convert documents into numerical vectors, a vector database to store and search those vectors, and an LLM to generate the final answer using retrieved context. Each component is billed separately, though at moderate scale the total cost is typically very low relative to the value delivered.
What is the most expensive part of a RAG pipeline?
For most RAG applications, LLM inference is the dominant cost — often 70–90% of the total monthly bill. This is because every query requires sending the retrieved context (often 1,000–5,000 tokens) plus the question to the LLM, and output tokens are expensive. Vector database costs are typically modest (free tiers cover most early-stage apps; paid tiers run $70–$200/month for moderate scale). Embedding ingestion is a small one-time cost. The ratio shifts at very high document volumes (millions of docs) where vector storage becomes meaningful.
How do embedding costs compare to inference costs?
Embedding costs are almost always negligible compared to inference. text-embedding-3-small costs $0.02/1M tokens — embedding a 1,000-page document (~500,000 words, ~667,000 tokens) costs $0.013. That same document processed as context for 1,000 queries on GPT-4o mini would cost $100+. The ratio is roughly 1:10,000 for typical RAG workloads. This means optimizing your embedding strategy (model choice, re-embedding frequency) has almost no impact on your total cost; optimizing LLM model selection and context length has an enormous impact.
How can I reduce RAG pipeline costs?
Five proven optimizations: (1) Downgrade the LLM — GPT-4o mini is 16× cheaper than GPT-4o and performs comparably on most RAG question-answering tasks. (2) Reduce retrieved chunk count — 3 chunks of 400 tokens is often better retrieval than 10 chunks of 400 tokens, and costs 70% less. (3) Enable prompt caching for system prompts. (4) Cache responses for repeated or near-identical queries (a knowledge base often sees the same questions repeatedly). (5) Compress or summarize retrieved chunks before injecting them into the prompt. Together these changes routinely reduce costs by 80%+ without quality degradation.
What is the difference between RAG and fine-tuning cost-wise?
RAG has low upfront cost (ingestion is cheap) and recurring per-query costs (LLM inference per request). Fine-tuning has high upfront cost (training runs cost $100–$10,000+ depending on model size and dataset) and lower per-query inference costs (fine-tuned models run faster with shorter prompts since knowledge is baked in). For most use cases under 1 million queries/month, RAG is cheaper overall. Fine-tuning becomes cost-competitive at very high query volumes where the shorter prompts (no retrieved context) significantly reduce per-query token costs. RAG is also re-indexable; fine-tuning requires a new training run to incorporate new information.