Do I need a dedicated vector database or can I use PostgreSQL?

PostgreSQL with pgvector works well for sub-1M document datasets with infrequent updates and modest query volume. Dedicated vector databases scale better for high-throughput retrieval, support richer indexing algorithms (HNSW, IVF), and offer features like namespacing and metadata filtering at scale. Choose pgvector for prototypes and small applications; move to Pinecone or self-hosted Qdrant above 10M vectors or 100+ queries per second.

What is hybrid search and why does it matter for RAG?

Hybrid search combines semantic similarity (vector distance) with lexical matching (keyword BM25 scores). It catches queries where exact terminology matters alongside conceptual relevance. Most modern RAG systems benefit from hybrid retrieval because embedding models sometimes miss dense named entities or precise jargon. Pinecone and Weaviate support native hybrid search; pgvector requires manual implementation alongside a full-text search index.

How do I choose between HNSW and IVF indexing?

HNSW (hierarchical navigable small world) offers faster queries with higher recall on static datasets and is the default in Pinecone and Chroma. IVF (inverted file) is more memory efficient and supports fast updates but has lower recall at scale. For RAG with frequent document ingestion, prefer HNSW. For very large datasets (100M+ vectors) or cost-sensitive deployments, evaluate IVF or quantization.

Can I run a vector database on my own infrastructure?

Yes. Self-hosted options include Qdrant, Weaviate, and Milvus. These offer better data privacy, no egress costs, and full control over indexing parameters. Tradeoff: you manage operational overhead, backups, scaling, and monitoring. Most teams moving to self-hosted do so after hitting Pinecone cost limits (typically 10M+ vectors with significant query volume) or facing strict data residency requirements.

What metadata filtering overhead should I expect?

Metadata filtering during vector search adds 5 to 50 percent latency depending on filter selectivity and cardinality. A filter that eliminates 99 percent of candidates before distance calculation is cheap; one that excludes only 1 percent is expensive. Design your metadata schema to support common filtering patterns. Avoid unbounded string matching on high-cardinality fields without indexed partitions.

How much does a vector database cost?

Pinecone starts around $100/month for the starter tier (up to 100K vectors) and scales to thousands monthly for production. Self-hosted options have zero per-vector fees but require infrastructure spend ($50 to 500/month for a single instance). pgvector is free within Postgres but adds compute and storage costs. Estimate total cost by multiplying average vector count, query volume, and your cloud region's pricing, then add 30 percent operational overhead.

Should my RAG system always use a vector database?

No. Skip a dedicated vector database if your document corpus is small (under 10K documents), queries are infrequent, or latency is not critical. A simple in-memory cosine similarity search against all embeddings works fine for proof-of-concept and can run on a single application server. Move to a vector database when retrieval latency exceeds 500ms, when you need sub-second queries, or when you surpass 1M vectors.

Vector Databases for RAG: Pinecone, pgvector, Chroma, and How to Choose

A vector database is a purpose-built system for storing, indexing, and querying high-dimensional embeddings at scale. In a retrieval-augmented generation pipeline, it serves a single critical function: given a user query encoded as a vector, return the nearest relevant documents from your corpus in milliseconds. This piece of infrastructure sits between your embedding model and your language model, and choosing the right one shapes whether your RAG system is production-ready or a prototype that crumbles under load.

Why this matters now

By 2026, RAG is no longer experimental. Teams are moving retrieval systems into production, where latency, cost, and reliability matter. The vector database market has consolidated around a handful of proven players, each with distinct trade-offs. Pinecone dominates in ease of use and managed infrastructure. pgvector (the PostgreSQL extension) has become viable for teams already running Postgres and seeking to avoid new infrastructure. Chroma and Weaviate dominate open-source deployments. The question is no longer "should I use a vector database" but "which one fits my constraints and doesn't blow my budget."

The practical reality: most teams building RAG systems in 2026 choose between a managed service (Pinecone, Supabase with pgvector) and self-hosted open source (Qdrant, Weaviate, Milvus). The decision hinges on three variables that rarely conflict. First, dataset size (vectors count and cardinality of metadata). Second, query latency requirements (is 100ms acceptable or do you need sub-20ms?). Third, data residency and cost constraints. Get these three inputs right and the choice becomes obvious.

How vector databases solve the indexing problem

Without an index, vector search means computing the distance from a query embedding to every document embedding in your corpus. With 1 million vectors, that is 1 million distance calculations per query. With reasonable embedding dimensionality (768 or 1536 for modern models) and commodity hardware, a naive search takes roughly 300 to 500 milliseconds. That is unacceptable for production RAG.

Vector databases solve this through indexing algorithms that partition vector space into regions, allowing the search to skip most candidates. The two dominant approaches are HNSW (hierarchical navigable small world) and IVF (inverted file). HNSW builds a proximity graph where each vector connects to its nearest neighbors at multiple levels, enabling log(N) traversal. IVF partitions vectors into clusters, searches only nearby clusters, then computes exact distance within those clusters. Pinecone and Chroma default to HNSW, which offers high recall and consistent performance for most RAG workloads. Milvus and Weaviate support both, with tunable trade-offs between memory, speed, and recall.

The practical impact: a well-tuned HNSW index returns the top 10 most similar documents from 10 million vectors in 5 to 15 milliseconds. The same operation with IVF might hit 8 to 12 milliseconds with lower memory overhead. Neither approach is inherently superior. HNSW suits static or slowly changing corpora where recall is paramount. IVF suits massive datasets where memory is constrained and you can tolerate a small recall trade-off in exchange for faster ingestion of new documents.

Comparing the major players

Pinecone is the most complete managed offering and is purpose-built for RAG. It handles HNSW indexing, automatic scaling, multi-region replication, and namespace isolation out of the box. Pinecone also supports hybrid search (combining vector similarity with BM25 lexical search) natively, which is essential for many RAG systems where exact terminology matters alongside semantic relevance. The cost is higher than self-hosted alternatives: roughly $0.10 per 1 million vectors per month for storage, plus query costs that can range from $0.001 to $0.01 per thousand queries depending on tier. For a production system with 5 million vectors and 10 million queries per month, expect $100 to 300 base cost plus overages.

pgvector is the PostgreSQL extension that brought vector search to a database ecosystem already running on millions of servers. Setup is trivial if you have a Postgres instance: create a vector column, add a GiST or IVF index, and write SQL. Recall is respectable for small to medium corpora (under 5 million vectors) and you inherit all of Postgres's transaction semantics, backup infrastructure, and monitoring. The catch: pgvector uses exact distance calculation at query time for GiST indexes (trading latency for guaranteed correctness) and has limited support for the most sophisticated indexing tricks. For datasets above 10 million vectors, query latency creeps past 500 milliseconds even on powerful hardware. pgvector shines for teams already managing Postgres, for whom the marginal cost of vector search is almost zero.

Chroma is lightweight and embeds easily into Python applications. It runs in-memory or can persist to SQLite or DuckDB, making it ideal for prototypes, local development, and small deployments (under 100K vectors). Chroma uses HNSW under the hood and has minimal operational overhead. The limitation is scale: there is no clustering, no multi-replica failover, and no managed hosting option. Chroma is not a replacement for production systems but a stepping stone toward them.

Weaviatelanguage models for re-ranking or generation) baked into the query pipeline, which can save latency in end-to-end RAG systems.

Qdrant

Hybrid search and metadata filtering in practice

Pure vector search has a blind spot: it relies entirely on the quality of your embedding model and the semantic similarity between query and documents. A query like "What is GDPR Article 5?" will fail if documents use the acronym "GDPR" without unpacking it. A vector database that only optimizes for cosine similarity will rank a document about privacy principles higher than one containing the exact acronym, even though the latter is what the user wanted.

Hybrid search solves this by combining semantic search with keyword (BM25) scores. The query is encoded as a vector and also tokenized for lexical matching. Results are ranked by a weighted blend of both signals. Pinecone's hybrid search implementation assigns adjustable weights to vector and lexical components. The practical effect: adding hybrid search boosts recall for named-entity-heavy domains (legal, medical, finance) by 15 to 30 percent, at the cost of marginal latency (add 5 to 10 milliseconds per query).

Metadata filtering is equally important. In a RAG system, documents often have associated metadata: creation date, author, source URL, topic tags. A user query might include a filter: "Show me results from the last 30 days" or "Restrict to documents tagged 'urgent'". Efficient metadata filtering pushes the filter down into the index itself so that distance calculations only run on candidate vectors, not the entire corpus. Pinecone and Weaviate handle this natively. With pgvector, you combine vector search with SQL predicates, which works but can be slow if the filter is selective on a low-cardinality field.

The engineering decision: invest in metadata filtering from day one. Identify the fields users will filter on (date ranges, categorical tags, source, ownership) and index them. Avoid unbounded string matching. A filter like "source contains 'customer'" is expensive; "source = 'customer-feedback'" is fast. If your schema evolves to add new metadata frequently, Weaviate and Qdrant are more flexible than pgvector, which requires schema migrations.

Sizing and cost projections

The cost of a vector database is a function of three inputs: corpus size (total vector count), query volume, and metadata complexity. A rough framework:

Under 1 million vectors, 1000 queries per day: pgvector or Chroma. Cost is infrastructure only, roughly $20 to 100 per month if co-hosted with your app database.
1 to 10 million vectors, 100K queries per day: Pinecone starter or mid-tier, or self-hosted Qdrant on a single instance. Pinecone runs $200 to 1000 per month. Self-hosted costs $100 to 300 per month in compute.
10 to 100 million vectors, 1 million+ queries per day: Pinecone professional tier with multi-region replication, or self-hosted Qdrant cluster. Pinecone can exceed $5000 per month; self-hosted infrastructure scales with your machines but operational overhead grows.

A critical hidden cost: embedding generation. Encoding all documents in your corpus (and re-encoding them when you refresh the dataset) consumes API calls or compute time. A corpus of 1 million documents at 500 tokens each, embedded via OpenAI's text-embedding-3-small API, costs roughly $50. If you refresh monthly, that is $600 per year just for embeddings. Self-hosted embedding models (like Sentence Transformers) cost nothing in API fees but require GPU compute, which is not free either. Factor this into your decision. Teams often underestimate embedding costs relative to vector database costs.

When to skip a dedicated vector database

Not every RAG system needs a vector database. If your corpus is small (fewer than 10,000 documents), query latency is not critical, and you control the execution environment (a single server or local application), in-memory vector search is simpler and cheaper. Load all embeddings into memory, compute cosine similarity on each query, and return the top K results. This works for document sets under 1 gigabyte of embeddings and scales to roughly 100K vectors on a modest instance.

The in-memory approach is common in prototypes, internal tools, and systems where users accept 500 millisecond or higher query latency. Once you require sub-200-millisecond latency, need to handle concurrent query traffic, or exceed a few hundred thousand vectors, a dedicated database becomes necessary.

Another valid pattern: use Postgres full-text search exclusively, without vectors at all. This works for corpora where keyword search is sufficient (support documentation, knowledge bases with consistent terminology). Vector search adds power but also complexity and cost. Evaluate whether the additional retrieval quality justifies that cost before defaulting to embeddings.

Common pitfalls and limitations

Stale embeddings are the silent failure mode of RAG systems. Teams generate embeddings once, load them into a vector database, and forget that the embedding model can change or the documents themselves may have drifted. If you swap embedding models (say, from text-embedding-3-small to a larger open-source model), old vectors are incompatible. If documents are updated but embeddings are not, retrieval becomes stale. Plan for periodic re-embedding and version-control your embedding model choice.

Curse of dimensionality is real at scale. Embedding models produce 768 to 1536 dimensional vectors. At 100 million vectors, this consumes terabytes of storage. Quantization (reducing precision from float32 to int8 or binary) saves 75 percent of storage but loses recall. Product quantization and other compression techniques help but require careful tuning. If storage cost becomes your primary constraint, evaluate Qdrant's quantization or switch to sparse embeddings (which are higher dimensional but sparse, reducing actual storage).

Metadata cardinality explosion can tank performance. If every document has unique metadata (e.g., a fine-grained user ID for every result), metadata filtering becomes slow because there is no selectivity benefit. Design metadata schemas conservatively. Use categorical tags over free-text fields. Partition high-cardinality dimensions (like user IDs) outside the vector database if possible.

Embedding model limitations directly limit retrieval quality. No vector database can overcome a poor embedding model. If your embedding model is task-specific (trained on legal documents, for example), it will excel for legal RAG but fail for general-domain queries. Evaluate embedding quality empirically with a small test set before committing to a large ingestion. The vector database is not the bottleneck; the embeddings are.

The path forward

Start with a clear assessment of your corpus size, latency requirements, and data residency constraints. For prototypes under 1 million vectors, use Chroma or pgvector and avoid infrastructure overhead. For production systems at scale, choose Pinecone if budget is available and operational simplicity is paramount. Choose self-hosted (Qdrant, Weaviate) if cost control or data privacy is non-negotiable and your team can manage Kubernetes or similar. Implement hybrid search from the start if your domain has significant named-entity density. Design metadata filtering into your document schema early; retrofitting it is painful. Most importantly, build a pipeline for periodic re-embedding and monitor retrieval quality empirically, not by assumption. The vector database is one component of RAG. The architecture around it, the embedding model, and the retrieval evaluation loop determine success.