Can I use PostgreSQL with pgvector instead of a dedicated vector database?

For small datasets (under ~500K vectors) and moderate query loads, pgvector works well and simplifies your stack. Beyond that scale, or with demanding latency requirements, purpose-built vector databases provide better performance. The crossover point depends on your specific requirements—benchmark both if you're unsure.

How do I handle vectors from multiple embedding models?

Create separate indexes (or namespaces) for each embedding model. Vectors from different models have different dimensions and semantic spaces—they shouldn't be mixed in the same index.

Should I use managed or self-hosted?

Managed services (Pinecone, cloud offerings of Weaviate/Qdrant/Milvus) reduce operational burden significantly. Self-host only if you have: (1) compliance requirements mandating it, (2) data engineering resources to operate it, or (3) scale economics that make self-hosting dramatically cheaper.

How do I migrate between vector databases?

Export vectors and metadata from your source database, transform to the target format, and import. Most databases support bulk import. The challenge is usually re-configuring your application code and testing thoroughly. Consider maintaining the ability to re-embed from source documents as a fallback.

What dimensions should I use for embeddings?

Use the native dimensions of your embedding model (1536 for text-embedding-3-large, 1024 for BGE-M3, etc.). Some models support dimensionality reduction (Matryoshka embeddings), which can reduce storage and improve query speed with modest accuracy tradeoffs.

How do I handle real-time updates?

Most modern vector databases support real-time inserts and updates without full index rebuilds. HNSW indexes particularly excel at incremental updates. For very high write volumes, consider batching updates or using databases with optimized write paths.

Vector Databases: A Comprehensive Guide

Vector databases have become the backbone of modern AI applications. Every RAG system, semantic search engine, and recommendation system relies on efficiently storing and querying high-dimensional embeddings. Yet choosing the right vector database remains confusing—each option makes different tradeoffs between performance, scalability, features, and operational complexity.

This guide provides a comprehensive comparison of the major vector databases in 2025: Pinecone, Weaviate, Qdrant, Milvus, and Chroma. We'll cover architecture fundamentals, indexing algorithms, performance characteristics, and practical guidance for selecting the right database for your use case.

Why Vector Databases Matter

Traditional databases excel at exact matching—finding rows where user_id = 123 or price < 100. But AI applications need similarity matching: finding items that are "like" a query, even if they share no exact attributes. This is what vector databases provide.

The Embedding Revolution

Modern AI represents everything as vectors. Text becomes 1536-dimensional embeddings via models like text-embedding-3. Images become 512-dimensional vectors via CLIP. Audio, code, molecules—everything can be embedded into a continuous vector space where similar items cluster together.

The magic happens when you query this space. Instead of asking "find documents containing 'machine learning'," you ask "find documents semantically similar to this query about ML optimization techniques." The vector database returns results ranked by cosine similarity or Euclidean distance, capturing meaning rather than keywords.

Why Not Just Use PostgreSQL?

PostgreSQL with pgvector can store and query vectors. For small datasets (under a few hundred thousand vectors), this works fine. But vector search has fundamentally different performance characteristics than traditional queries:

Scale: Production RAG systems often have millions or billions of vectors. Naive brute-force search (comparing the query to every vector) becomes impossibly slow.

Latency requirements: Users expect sub-100ms search latency. At scale, this requires specialized indexing algorithms that trade perfect accuracy for speed.

Filtering complexity: Real queries combine vector similarity with metadata filters ("find similar documents from the last month"). This intersection of vector and traditional search requires careful optimization.

Purpose-built vector databases address these challenges with specialized storage engines, query planners, and index structures optimized for high-dimensional similarity search.

Indexing Algorithms: The Foundation

Understanding indexing algorithms is essential for making informed database choices. Each algorithm makes different tradeoffs between build time, query speed, memory usage, and recall accuracy.

HNSW: The Industry Standard

Hierarchical Navigable Small World (HNSW) has become the dominant indexing algorithm for vector search. It constructs a multi-layer graph where each layer provides increasingly fine-grained navigation toward similar vectors.

How it works: HNSW builds a hierarchy of graphs. The top layer contains a sparse subset of vectors, enabling quick navigation to the approximate region of interest. Lower layers add more vectors, refining the search. The algorithm "hops" through layers, using each layer's graph structure to navigate toward the query vector efficiently.

Strengths: HNSW provides excellent query performance with high recall (typically 95%+ with proper tuning). It handles high-dimensional vectors well and supports incremental updates without full index rebuilds.

Weaknesses: Memory intensive—the graph structure requires significant RAM. Build time can be slow for very large datasets. Performance degrades when combined with restrictive metadata filters.

Used by: Pinecone, Weaviate, Qdrant, Milvus, and most modern vector databases.

IVF: The Memory-Efficient Alternative

Inverted File (IVF) indexes partition the vector space into clusters, then search only the clusters most likely to contain relevant results.

How it works: During indexing, vectors are assigned to clusters based on their proximity to cluster centroids. During search, the query is compared to centroids to identify promising clusters, then only vectors within those clusters are searched.

Strengths: More memory-efficient than HNSW. Build time scales better for very large datasets. Works well with quantization techniques for further memory reduction.

Weaknesses: Lower recall than HNSW at equivalent query speeds. Requires tuning the number of clusters and probes. Less effective for datasets that don't cluster naturally.

Variants: IVF_FLAT (no compression within clusters), IVF_PQ (product quantization for compression), IVF_SQ (scalar quantization).

DiskANN: For Billion-Scale Datasets

When datasets exceed available RAM, disk-based indexes become necessary. DiskANN (developed by Microsoft) enables graph-based search with most data on SSD.

How it works: DiskANN builds a graph similar to HNSW but optimized for disk access patterns. It minimizes random I/O by clustering graph neighbors and prefetching data strategically.

Strengths: Enables billion-vector search on commodity hardware. Query performance remains reasonable despite disk access. Dramatically reduces infrastructure costs for large datasets.

Weaknesses: Higher latency than in-memory indexes. Requires SSD (not HDD). More complex deployment and tuning.

Used by: Milvus supports DiskANN natively.

Flat Index: The Baseline

Flat indexes perform brute-force search, comparing the query to every vector. This provides perfect recall but scales poorly.

When to use: Small datasets (under 10,000 vectors), accuracy-critical applications where you can accept higher latency, or as a baseline for benchmarking approximate algorithms.

GPU-Accelerated Indexes

GPU acceleration has transformed vector search performance, particularly for large-scale deployments where latency and throughput are critical.

NVIDIA cuVS Integration: Since FAISS v1.10.0, NVIDIA cuVS provides enhanced versions of IVF-PQ, IVF-Flat, Flat (brute-force), and CAGRA—a high-performance graph-based index built from the ground up for GPUs.

GPU_CAGRA: CUDA ANN Graph (CAGRA) outperforms CPU HNSW build times by up to 12.3x and reduces search latency by as much as 4.7x. CAGRA is ideal for scenarios demanding maximum performance, though it consumes approximately 1.8x the memory of the original vector data.

GPU_IVF_PQ: For inverted file indexing, cuVS outperforms classical GPU-accelerated IVF build times by up to 4.7x, with search latency reduced by as much as 8.1x. It utilizes a smaller memory footprint depending on compression settings—ideal for memory-constrained GPU environments.

GPU_IVF_FLAT: Serves as a balanced option, offering a compromise between performance and memory usage, requiring memory equal to the size of the original data.

Real-World Benchmarks: In large-scale tests, Milvus with 8 DGX H100 GPUs built an index for 635M 1024-dimensional vectors in approximately 56 minutes. The equivalent CPU task would take approximately 6.22 days—a 160x speedup.

GPU-accelerated Milvus offers a 21x speedup compared to its CPU counterpart, though GPU operational costs are higher. The tradeoff makes sense for latency-critical applications or batch processing where throughput matters more than per-query cost.

GPU Index	Memory Usage	Speed	Best For
GPU_CAGRA	1.8x original	Fastest	Maximum performance
GPU_IVF_FLAT	1x original	Fast	Balanced workloads
GPU_IVF_PQ	Configurable (lowest)	Moderate	Memory-constrained

Vector Quantization

Quantization compresses vectors to reduce memory footprint and improve search speed, trading some accuracy for dramatic efficiency gains.

Scalar Quantization

Scalar quantization compresses float values into narrower data types. This converts floating-point numbers (e.g., FP32) to simpler formats like INT8, shrinking data by mapping values to a smaller range.

Compression ratio: 4x reduction (32-bit → 8-bit)

Accuracy impact: Minimal—typically 1-2% recall loss with rescoring

Performance: Moderate speedup from reduced memory bandwidth

Best for: General-purpose compression where accuracy matters. Works well with most embedding models.

Binary Quantization

Binary quantization represents each vector component as a single bit, providing the most aggressive compression available.

Compression ratio: 32x reduction (32-bit float → 1-bit)

Speed improvement: Up to 40x speedup compared to original vectors. Binary representation enables highly optimized CPU instructions (XOR and Popcount) for fast distance computations.

Accuracy impact: Higher than scalar quantization—requires rescoring with original vectors for acceptable recall.

Best for: High-dimensional embeddings (1024+ dimensions) where the representational loss is minimized. Recommended for models with at least 1024 dimensions.

Product Quantization (PQ)

Product quantization divides vectors into sub-vectors and quantizes each independently, achieving higher compression than scalar quantization.

Compression ratio: Up to 64x (configurable based on codebook size)

Accuracy impact: Significant—higher loss than scalar quantization

Performance: Slower than binary quantization but better compression than scalar

Best for: Scenarios where memory footprint is the top priority and some accuracy loss is acceptable.

Quantization Decision Framework

Approach	Compression	Speed Gain	Accuracy Loss	Use Case
None	1x	Baseline	None	Small datasets, accuracy-critical
Scalar (INT8)	4x	2-3x	Minimal	General purpose
Binary	32x	Up to 40x	Moderate	High-dimensional, latency-critical
Product (PQ)	Up to 64x	Moderate	Higher	Memory-constrained at scale

Rescoring for Accuracy

Oversampled candidates are rescored using uncompressed original vectors to recover accuracy lost during quantization. The typical pattern:

Search quantized index with k × oversample_factor candidates
Load original vectors for top candidates
Rescore with full precision
Return top k results

Tests show 73-96% memory reduction with scalar quantization preserving recall performance, and binary quantization's recall maintained with rescoring.

Database Comparison

Pinecone: Managed Simplicity

Pinecone pioneered the managed vector database category. It's designed for teams who want reliability and performance without operational complexity.

Architecture: Fully managed, serverless. You don't provision or manage infrastructure—Pinecone handles scaling, replication, and maintenance automatically. Data is organized into "indexes" (similar to tables) within "projects."

Strengths:

Operational simplicity: No infrastructure to manage. Automatic scaling handles traffic spikes.
Query performance: Optimized for low-latency search with excellent p95 latencies.
Enterprise features: Multi-region deployment, SOC2 compliance, SSO integration.
Hybrid search: Native support for combining dense vectors with sparse vectors (BM25-style).

Weaknesses:

Vendor lock-in: Proprietary system with no self-hosting option.
Cost at scale: Can become expensive for very large datasets or high query volumes.
Limited customization: Fewer indexing options than open-source alternatives.

Best for: Teams prioritizing time-to-market and operational simplicity. Startups, enterprises without dedicated infrastructure teams, and production applications where reliability matters more than cost optimization.

Pricing: Serverless pricing based on storage and queries. Free tier available with limited capacity.

Weaviate: Knowledge Graph Integration

Weaviate combines vector search with knowledge graph capabilities, enabling queries that understand relationships between entities.

Architecture: Open-source with managed cloud option. Uses HNSW for vector indexing. Unique "schema" system defines object types and their relationships. GraphQL API for complex queries.

Strengths:

Hybrid search: Excellent BM25 + vector fusion with configurable weights.
Knowledge graph: Native support for entity relationships and graph traversal.
Multimodal: Built-in support for text, images, and other modalities.
GraphQL API: Powerful query language for complex retrieval patterns.

Weaknesses:

Resource intensive: Teams report higher memory and compute requirements than alternatives at scale. Efficient below 50 million vectors, requires careful capacity planning beyond that.
Complexity: More concepts to learn than simpler alternatives.
Performance: Graph features add overhead; not always top performer in pure vector benchmarks.

Best for: Applications needing semantic search with structural understanding. Knowledge management systems, complex RAG pipelines, and use cases where entity relationships matter.

Pricing: Open-source (self-hosted free). Cloud starts at $25/month after 14-day trial.

Qdrant: Performance and Flexibility

Qdrant is a Rust-based vector database emphasizing performance, filtering capabilities, and developer experience.

Architecture: Open-source with managed cloud. Written in Rust for performance and memory safety. Uses HNSW with custom optimizations. Strong emphasis on payload (metadata) filtering.

Strengths:

Filtering performance: Payload-based filtering integrates directly into search rather than being applied as post-processing. This makes filtered queries much faster than alternatives that filter after search.
Low overhead: Rust implementation provides excellent performance with minimal resource usage.
Developer experience: Clean API, excellent documentation, active community.
Free tier: 1GB of vector storage forever, no credit card required—best free tier among managed options.

Weaknesses:

Smaller ecosystem: Fewer integrations and plugins than more established alternatives.
Scale limitations: Less proven at billion-vector scale compared to Milvus.

Best for: Applications requiring vector similarity with complex metadata filtering. Recommendation systems with many filter dimensions, document search with date/category constraints, and cost-sensitive production deployments.

Pricing: Open-source (self-hosted free). Cloud: 1GB free forever, paid plans from $25/month.

Milvus: Enterprise Scale

Milvus is designed for enterprise-scale deployments with billions of vectors and demanding performance requirements.

Architecture: Open-source with Zilliz Cloud managed option. Distributed architecture supporting horizontal scaling. Separates storage and compute for independent scaling. Supports the widest range of index types.

Strengths:

Scale: Proven at billion-vector scale with major enterprises.
Index variety: Supports HNSW, IVF variants, DiskANN, GPU indexes, and more—more indexing strategies than any competitor.
GPU acceleration: Native GPU support for both indexing and search.
Distributed: True distributed architecture for horizontal scaling.

Weaknesses:

Operational complexity: Distributed deployment requires Kubernetes expertise and careful configuration.
Resource requirements: Heavy infrastructure footprint even for moderate workloads.
Latency variability: Not always on-par with others for high-dimension embeddings or very large vector counts.

Best for: Organizations with data engineering capabilities handling massive scale. Financial services, large e-commerce platforms, and enterprises with existing Kubernetes infrastructure.

Pricing: Open-source (self-hosted free). Zilliz Cloud: free up to 5GB, serverless ~$89 for 1M reads/writes at 1536 dimensions.

Chroma: Developer-First Simplicity

Chroma prioritizes developer experience and ease of getting started, making it popular for prototyping and smaller deployments.

Architecture: Open-source, embeddable. Can run in-process (no separate server), as a standalone server, or in the cloud. SQLite-based persistence for simplicity.

Strengths:

Ease of use: Minimal setup—can embed directly in Python applications.
Local development: Perfect for prototyping and local testing without infrastructure.
LangChain integration: First-class integration with popular LLM frameworks.
Low resource usage: Lightweight footprint for small to medium datasets.

Weaknesses:

Scale limitations: Not designed for production scale beyond millions of vectors.
Feature gaps: Fewer enterprise features than purpose-built databases.
Performance: Slower than optimized alternatives at scale.

Best for: Prototyping, local development, small production deployments, and applications where simplicity matters more than scale.

Pricing: Open-source (free). Hosted Chroma available for production deployments.

Performance Benchmarks

Performance varies dramatically based on dataset size, dimensionality, filter complexity, and hardware. Published benchmarks should be interpreted carefully—vendors naturally benchmark scenarios favorable to their products.

General Observations

Based on 2025 benchmarks:

Pure vector search at scale: Milvus and Pinecone edge ahead in benchmarks focused purely on vector similarity at massive scale.

Filtered search: Qdrant excels when queries combine vector similarity with metadata filters, thanks to its integrated filtering approach.

Hybrid search: Weaviate's BM25 + vector fusion provides excellent results for keyword-aware semantic search.

Memory efficiency: Qdrant has particularly low overhead. Milvus requires more resources but offers more indexing options.

Benchmarking Your Workload

Published benchmarks rarely match your specific use case. Before committing to production:

Use your actual data: Synthetic benchmarks miss characteristics of real datasets.
Test your query patterns: If 80% of your queries include date filters, benchmark filtered queries.
Measure what matters: P95 latency often matters more than average latency. Recall@10 matters more than recall@100.
Test at target scale: Performance characteristics change dramatically between 100K and 100M vectors.

Choosing the Right Database

Decision Framework

Start with constraints:

Constraint	Recommendation
No infrastructure team	Pinecone or managed cloud options
Must self-host (security/compliance)	Qdrant, Weaviate, or Milvus
Under 1M vectors	Any option works; optimize for features/cost
1M-100M vectors	Qdrant, Weaviate, or Pinecone
100M+ vectors	Milvus or Pinecone enterprise
Heavy filtering requirements	Qdrant
Knowledge graph needs	Weaviate
Prototyping/local dev	Chroma

Common Patterns

Startup building RAG MVP: Start with Chroma for prototyping. Migrate to Pinecone or Qdrant Cloud for production. This path minimizes initial infrastructure complexity while providing a clear scaling path.

Enterprise with existing Kubernetes: Evaluate Milvus if you have data engineering resources and need maximum control. Consider Qdrant or Weaviate for simpler self-hosted deployment.

SaaS product with vector search: Pinecone provides the reliability and multi-tenancy features needed for customer-facing applications. The higher cost is offset by reduced operational burden.

Research/experimentation: Chroma's simplicity makes it ideal for rapid experimentation. Graduate to production databases when you've validated your approach.

Production Considerations

Capacity Planning

Vector databases have predictable resource requirements:

Memory: HNSW indexes require roughly 4 * dimensions * num_vectors bytes for the graph structure, plus the vectors themselves. A 1536-dimensional dataset with 10M vectors needs approximately 60GB for vectors plus 60GB for the HNSW graph.

Storage: Plan for 2-3x the raw vector size to account for indexes and metadata.

Compute: Query latency scales with index complexity and filter selectivity. Benchmark with your expected query mix.

Multi-Tenancy

SaaS applications need tenant isolation:

Namespace-based: Most databases support namespaces or partitions within an index. Simple but limited isolation—a "noisy neighbor" can affect other tenants.

Index-per-tenant: Stronger isolation but higher operational overhead. Works well for enterprise customers with dedicated resources.

Metadata filtering: Filter by tenant_id on every query. Simple to implement but requires careful index design for performance.

Backup and Recovery

Vector data is often derived (re-embeddable from source documents), but re-embedding is expensive:

Regular backups: Most managed services handle this automatically. For self-hosted, schedule regular snapshots.

Point-in-time recovery: Important for applications where vector data represents significant processing investment.

Disaster recovery: Multi-region replication for critical applications.

Monitoring

Key metrics to track:

Query latency: P50, P95, P99 latencies. Alert on degradation.

Recall: If you have ground truth, periodically measure recall to detect index degradation.

Index freshness: Time between document ingestion and searchability.

Resource utilization: Memory, CPU, and disk usage trends.

Table of Contents

Vector Databases: A Comprehensive Guide

Why Vector Databases Matter

The Embedding Revolution

Why Not Just Use PostgreSQL?

Indexing Algorithms: The Foundation

HNSW: The Industry Standard

IVF: The Memory-Efficient Alternative

DiskANN: For Billion-Scale Datasets

Flat Index: The Baseline

GPU-Accelerated Indexes

Vector Quantization

Scalar Quantization

Binary Quantization

Product Quantization (PQ)

Quantization Decision Framework

Rescoring for Accuracy

Database Comparison

Pinecone: Managed Simplicity

Weaviate: Knowledge Graph Integration

Qdrant: Performance and Flexibility

Milvus: Enterprise Scale

Chroma: Developer-First Simplicity

Performance Benchmarks

General Observations

Benchmarking Your Workload

Choosing the Right Database

Decision Framework

Common Patterns

Production Considerations

Capacity Planning

Multi-Tenancy

Backup and Recovery

Monitoring

Sources

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Building Production-Ready RAG Systems: Lessons from the Field

Hybrid Search Strategies: Combining BM25 and Vector Search for Better Retrieval

Embedding Models & Strategies: Choosing and Optimizing Embeddings for AI Applications

LLM Cost Engineering: Token Budgeting, Caching, and Model Routing for Production