Last updated: May 2026
Calculate the true cost of giving an AI agent persistent memory — from one-time document embedding to monthly vector database storage and per-query retrieval costs. Compare providers and see how costs scale.
① Embedding
One-time ingestion + ongoing re-embedding costs
② Vector Database Storage
Monthly storage cost based on document count and vector dimensions
③ Runtime Queries
Daily query volume and retrieval configuration
Results
Cost breakdown
| Component | Monthly Cost | % of Total | Notes |
|---|
Scale Calculator
Monthly ongoing cost at different document volumes (same query volume & provider)
| Documents | Storage (GB) | Storage Cost/mo | Query Cost/mo | Re-embed Cost/mo | Total/mo |
|---|
One-time embedding cost = Number of documents × avg tokens/doc × embedding price per token. This is paid once on initial ingestion.
Monthly re-embedding cost = One-time cost × re-embedding multiplier (Daily = ×30, Weekly = ×4.33, Monthly = ×1, Never = ×0).
Storage size estimate = Dimensions × 4 bytes × number of documents ÷ 1,073,741,824 (bytes per GB). This is the raw vector data; actual storage including metadata and index overhead is typically 1.5–3× higher.
Monthly storage cost = Storage GB × provider's per-GB monthly rate. Free tiers cap out at specific limits; above the limit, paid rates apply.
Monthly query cost = Queries/day × 30 × provider's per-query rate. For providers with free query tiers, cost is $0 until the tier is exceeded.
⚠️ Pricing based on publicly available rates as of May 2026. Free tier limits, pricing tiers, and minimum charges may vary. Storage estimates reflect raw vector data only — metadata, indexes, and overhead typically add 50–200% to actual storage. Use for estimation only.
An AI agent with persistent memory uses a three-stage pipeline: embed documents into vectors, store those vectors in a vector database, and retrieve relevant context at query time. Each stage has its own cost model.
Stage 1 — Embedding: Each document is converted to a high-dimensional vector using an embedding model. OpenAI text-embedding-3-small costs $0.02/M tokens — embedding 1,000 documents of 500 tokens costs $0.01 one-time. Gemini text-embedding-004 is free up to 1M tokens/day.
Stage 2 — Vector storage: Vectors are stored in a vector database. At 1,536 dimensions, 1,000 documents require ~6MB of raw storage. Most free tiers cover up to 1–2GB (roughly 150K–300K documents). Above that, expect $0.033–$0.095/GB/month.
Stage 3 — Runtime retrieval: When the agent processes a query, it embeds the query and runs a similarity search to retrieve the top-k most relevant documents. Pinecone Serverless charges $2/1M queries; Weaviate Standard charges $0.0025/1K queries. At 100 queries/day, this is typically under $0.20/month.
Worked example: 5,000 documents, 500 tokens each, text-embedding-3-small, Pinecone Serverless, 100 queries/day, re-embed monthly. One-time embed: $0.05. Storage: ~30MB = ~$0.001/mo. Query cost: 3,000 queries/mo × $0.000002 = $0.006/mo. Re-embed: $0.05/mo. Total monthly ongoing: ~$0.06/month. Agent memory is very cheap at small scale.
Embedding costs are extremely low. OpenAI text-embedding-3-small costs $0.02 per million tokens — embedding 1,000 documents of 500 tokens each costs just $0.01 as a one-time cost. text-embedding-3-large is $0.13/M tokens for higher accuracy on domain-specific content. Gemini text-embedding-004 is free up to 1 million tokens per day, making it ideal for prototypes and smaller deployments. For most agent memory setups under 100,000 documents, total embedding cost is under $5 one-time. The real ongoing costs come from vector database storage and runtime queries.
For small-scale agents, free tiers are available on: Pinecone Starter (2GB / 100K queries/month), Weaviate Cloud Sandbox (90-day trial), and Qdrant Cloud Free (up to 1GB). All three cover typical development and small production workloads. For production scale, Qdrant Cloud (~$0.09/GB/month) and Weaviate Standard (~$0.095/GB/month + $0.0025/1K queries) are cost-competitive with Pinecone Serverless ($0.033/GB + $2.00/1M queries). Self-hosting (Chroma, Qdrant OSS) eliminates API costs but requires server management. The best choice depends on your query volume — at high query volume, Weaviate or Qdrant Cloud often win on price.
Storage depends on vector dimensions and document count. Each vector requires dimensions × 4 bytes: at 1,536 dimensions (OpenAI embeddings), that's ~6KB per document. So 1,000 documents = ~6MB; 100,000 documents = ~600MB; 1,000,000 documents = ~6GB. In practice, databases add metadata and index overhead, making real storage 1.5–3× the raw vector size. Most small-to-medium agent deployments (under 100K documents) fit within free tiers. Only high-document-count enterprise deployments (1M+ chunks) require significant paid storage, typically $30–$150/month.
For static knowledge bases (company FAQs, documentation), never re-embed — the initial embedding is sufficient. For dynamic content (news feeds, live databases, frequently updated wikis), monthly re-embedding balances freshness against cost. Daily re-embedding is only necessary for real-time or rapidly changing data and multiplies your embedding cost by 30×. A better pattern for frequently updated content is delta indexing: only embed new or changed documents, not the full corpus. This keeps re-embedding costs near zero regardless of your corpus size.
Self-hosting (Chroma, Qdrant OSS, Weaviate OSS, or pgvector) makes sense when: (1) query volume exceeds ~500K/month where managed query costs become significant, (2) data privacy requirements prevent sending documents to third-party services, or (3) you have existing server infrastructure. The break-even vs. Pinecone Serverless is typically around 500K–1M queries/month or 20GB+ of vectors stored. Below that, managed services are usually cheaper once you factor in engineering time for self-hosted setup, maintenance, scaling, and uptime. For most agents with under 1M monthly queries, start managed and self-host only when costs justify it.
Vector databases use two primary pricing models: storage-based (charged per GB of vector data, regardless of query volume) and compute-based (charged per query or per vector operation). Most managed services combine both. Understanding which dominates your cost profile helps choose the right provider.
| Provider | Storage Rate | Query Rate | Free Tier | Best For |
|---|---|---|---|---|
| Pinecone Serverless | $0.033/GB/mo | $2.00/1M queries | No | Variable query load |
| Pinecone Starter | Free | Free (100K/mo limit) | Yes — 2GB/100K q | Development, prototypes |
| Weaviate Cloud Sandbox | Free | Free | Yes — 90 days | Evaluation, POCs |
| Weaviate Cloud Standard | $0.095/GB/mo | $0.0025/1K queries | No | Steady query volume |
| Qdrant Cloud Free | Free | Free | Yes — 1GB | Small agents |
| Qdrant Cloud | $0.09/GB/mo | Included | No | Large corpora |
| Self-hosted | ~$0 (infra) | ~$0 (infra) | N/A | Privacy, high volume |
What is agent memory and why does it cost money?
Agent memory refers to giving an AI agent access to a persistent knowledge store it can query at runtime — enabling it to remember past interactions, access company documents, or retrieve relevant context beyond its context window. It costs money because it requires three paid services: an embedding model (to convert text to searchable vectors), a vector database (to store and search those vectors), and compute for retrieval queries. The good news: at small to medium scale, agent memory is very affordable — often under $5/month or even free using available free tiers.
Is agent memory the same as RAG?
Agent memory and RAG (Retrieval-Augmented Generation) share the same underlying infrastructure — both use embedding models and vector databases — but differ in purpose. RAG typically retrieves context to answer a specific question within a single conversation turn. Agent memory is broader: it can include episodic memory (past interactions), semantic memory (facts and knowledge), and procedural memory (how to perform tasks). An agent memory system may retrieve context across many turns, maintain a dynamic knowledge base that updates over time, and use more sophisticated retrieval strategies than a simple RAG pipeline.
What embedding dimensions should I use for agent memory?
For most agent memory applications, 1,536-dimensional embeddings (OpenAI text-embedding-3-small or ada-002) provide an excellent quality-to-cost ratio. Higher dimensions (3,072 from text-embedding-3-large) improve retrieval accuracy for complex, technical, or domain-specific content but double your storage requirements. Smaller dimensions (384–768 from open-source models) are sufficient for general-purpose memory and can dramatically reduce storage costs at scale. Match dimensions to your chosen embedding model — mismatched dimensions will prevent proper similarity search.
How do I reduce AI agent memory costs at scale?
Five cost-reduction strategies: (1) Use a smaller embedding model — text-embedding-3-small is 6.5× cheaper than 3-large with minimal quality difference for most tasks. (2) Use delta indexing — only re-embed new or changed documents, never the full corpus. (3) Take advantage of free tiers — Qdrant Cloud Free (1GB), Pinecone Starter (2GB/100K q), or Weaviate Sandbox cover most development and small production workloads. (4) Reduce top-k retrieval — returning 3 results instead of 10 cuts vector DB compute and context length. (5) Cache frequent query results — if the same queries repeat, cache retrieved context to avoid redundant vector DB calls.
Do I need a separate vector database or can I use my existing database?
Several options exist for teams that want to avoid a separate vector database: pgvector (PostgreSQL extension) enables vector similarity search in your existing Postgres instance at near-zero additional cost. Redis Stack and MongoDB Atlas also offer native vector search capabilities. SQLite with sqlite-vss works for small local agent deployments. These options reduce architectural complexity and can be cheaper at small scale. Dedicated vector databases (Pinecone, Qdrant, Weaviate) offer better performance at scale, more sophisticated indexing algorithms, and purpose-built features like multi-tenancy and hybrid search — worth the added service if you need them.