What is a vector database?

The short answer

A vector database is a database that stores data as high-dimensional numerical vectors (embeddings) and retrieves them by mathematical similarity rather than exact key match. It powers semantic search, retrieval-augmented generation (RAG), and recommendation systems by running approximate nearest neighbor (ANN) search over millions to billions of embeddings produced by models like OpenAI text-embedding-3 or sentence-transformers. Leading systems include Pinecone, Weaviate, Milvus, Qdrant, Chroma, and pgvector (Postgres).

The longer answer

A vector database stores objects as fixed-length arrays of floating-point numbers — typically 384, 768, 1024, 1536, or 3072 dimensions — and indexes them so that the nearest neighbors of a query vector can be returned in milliseconds. The “vector” comes from an embedding model: a neural network that maps text, images, audio, or code into a point in a high-dimensional space where semantically similar inputs land near one another. OpenAI’stext-embedding-3-largeoutputs 3072-dimensional vectors; Cohere’sembed-v3outputs 1024; Google’stext-embedding-004outputs 768; the open-sourceall-MiniLM-L6-v2(Reimers & Gurevych, “Sentence-BERT,” arXiv:1908.10084) outputs 384.

The defining operation is approximate nearest neighbor (ANN) search. Exact nearest neighbor over a billion 768-dimensional vectors is computationally infeasible at query time, so vector databases use indexes that trade a small amount of recall for orders-of-magnitude faster lookup. The two dominant families are HNSW (Hierarchical Navigable Small World graphs, Malkov & Yashunin, arXiv:1603.09320, 2016) and IVF (inverted file with product quantization, Jégou et al., “Product Quantization for Nearest Neighbor Search,” IEEE TPAMI 2011). Facebook AI Research’s FAISS library (Johnson, Douze & Jégou, arXiv:1702.08734) is the reference implementation that most vector databases either embed directly or borrow algorithms from.

The current generation of vector databases emerged around 2022–2023 with the rise of large language models. Pinecone (founded 2019) launched a fully managed serverless tier in 2024. Weaviate (open-source, Apache 2.0) added native hybrid search combining BM25 with vector similarity. Milvus (a CNCF graduated project as of June 2024) supports billions of vectors with GPU-accelerated indexing. Qdrant (written in Rust, Apache 2.0) emphasizes payload filtering. Chroma is the lightweight in-process choice favored by LangChain and LlamaIndex prototypes. pgvector (PostgreSQL extension, MIT license) reached version 0.7.0 in April 2024 with HNSW support and is the path for teams that want vector search inside their existing Postgres without operating a second system.

The dominant production use case is retrieval-augmented generation (RAG), introduced by Lewis et al. (“Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” arXiv:2005.11401, NeurIPS 2020). In RAG, a user query is embedded, the vector database returns the top-k most similar document chunks, and those chunks are passed as context to an LLM. This grounds the LLM’s response in retrieved evidence and reduces hallucination on domain-specific or recent knowledge that the model was not trained on.

Distance metrics matter and are not interchangeable. Cosine similarity is the standard for normalized text embeddings; Euclidean (L2) distance is common for image embeddings; dot product is used when embeddings are not normalized and magnitude carries signal. Using the wrong metric for the embedding model you’re querying produces ranked results that look plausible but are subtly wrong — a class of bug that survived into production at multiple vendors before being caught.

Vector databases are not a replacement for relational or document databases. They are a complement: the embedding is a pointer-by-meaning, while the source-of-truth row still lives in Postgres, S3, or a content store. Most production architectures store the canonical document in a primary store and the embedding plus a foreign-key payload in the vector index.

Key facts

HNSW indexes scale to billion-vector workloads with sub-100ms p95 latency on commodity hardware (Malkov & Yashunin, arXiv:1603.09320).
OpenAI text-embedding-3-large outputs 3072 dimensions and supports Matryoshka truncation down to 256 dimensions with controlled quality loss (OpenAI embeddings docs, January 2024 release).
FAISS, released by Meta AI in 2017, remains the most-cited ANN library and underlies indexes in many commercial vector databases (Johnson et al., arXiv:1702.08734).
Milvus graduated as a CNCF top-level project in June 2024, joining Kubernetes and Prometheus at that tier (CNCF announcement, 2024-06-25).
pgvector 0.7.0 added HNSW support inside PostgreSQL and is the default vector backend for Supabase (pgvector GitHub release, April 2024).
The Massive Text Embedding Benchmark (MTEB, Muennighoff et al., arXiv:2210.07316) is the standard leaderboard for ranking embedding models across 58 datasets and 8 task types.
Lewis et al. 2020 defined the retrieval-augmented generation (RAG) architecture that drives most current production vector-database deployments (arXiv:2005.11401).
Product quantization, which underpins IVF-PQ indexes, compresses each vector to roughly 8–32 bytes with minimal recall loss (Jégou et al., IEEE TPAMI 2011, DOI:10.1109/TPAMI.2010.57).
Cosine similarity, dot product, and L2 distance are the three standard metrics; using the wrong one for a given embedding model degrades recall silently rather than throwing an error.
ANN-Benchmarks (Aumüller, Bernhardsson & Faithfull, arXiv:1807.05614) is the canonical reproducible benchmark suite for comparing vector index implementations across recall/QPS curves.

Sources

Malkov, Y. & Yashunin, D. “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs.” arXiv:1603.09320. arxiv.org/abs/1603.09320
Johnson, J., Douze, M. & Jégou, H. “Billion-scale similarity search with GPUs.” arXiv:1702.08734 (FAISS). arxiv.org/abs/1702.08734
Lewis, P. et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” arXiv:2005.11401, NeurIPS 2020. arxiv.org/abs/2005.11401
Jégou, H., Douze, M. & Schmid, C. “Product Quantization for Nearest Neighbor Search.” IEEE TPAMI 2011. doi.org/10.1109/TPAMI.2010.57
Muennighoff, N. et al. “MTEB: Massive Text Embedding Benchmark.” arXiv:2210.07316. arxiv.org/abs/2210.07316
Aumüller, M., Bernhardsson, E. & Faithfull, A. “ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms.” arXiv:1807.05614. arxiv.org/abs/1807.05614
pgvector GitHub. github.com/pgvector/pgvector
CNCF Milvus graduation announcement. cncf.io/announcements/2024/06/25

What is a vector database?

The short answer

The longer answer

Key facts

Related questions

Sources