
Embeddings
Geometry of meaning · semantic search · retrieval pipelines · honest limits
What an embedding actually is
Historical line: from Word2Vec to sentence transformers
2013
Word2Vec (Mikolov et al., Google)
Two architectures — skip-gram and CBOW — trained on the prediction task of "given a word, predict its neighbors" (or vice versa). Produced dense word vectors, typically 100–300 dimensions. The king/queen arithmetic demo made the technique famous. Paper: "Efficient Estimation of Word Representations in Vector Space" (arxiv.org/abs/1301.3781).
2014
GloVe (Pennington, Socher, Manning, Stanford)
Global Vectors for Word Representation. Reformulated word embedding training as a matrix factorization problem on global word-word co-occurrence counts, rather than local context windows. Often comparable to Word2Vec; sometimes better on analogy tasks. Paper: "GloVe: Global Vectors for Word Representation" (nlp.stanford.edu/pubs/glove.pdf).
2016
fastText (Bojanowski et al., Facebook AI)
Extended Word2Vec with subword (character n-gram) information. Big win on morphologically rich languages and out-of-vocabulary words. Paper: "Enriching Word Vectors with Subword Information" (arxiv.org/abs/1607.04606).
2018
ELMo and BERT — contextual embeddings
ELMo (Peters et al.) and then BERT (Devlin et al., Google) replaced static word vectors with contextual ones — the same word gets a different vector depending on its sentence. BERT's [CLS] token or mean-pooled hidden states became the de facto sentence representation, though imperfectly.
2019
Sentence-BERT (Reimers and Gurevych)
The pivotal paper. Showed that raw BERT [CLS] embeddings are bad for semantic similarity, and that fine-tuning BERT with a siamese-network objective on NLI data produces sentence embeddings that work well with cosine similarity. The sentence-transformers library (SBERT.net) became the open-source standard. Paper: "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks" (arxiv.org/abs/1908.10084).
2022
OpenAI text-embedding-ada-002
The first widely-adopted commercial embedding API at scale. 1536 dimensions, single model for everything, cheap. Became the default for most production RAG systems through 2023.
2023–2024
The encoder renaissance
BGE (BAAI), E5 (Microsoft), GTE (Alibaba), Cohere embed-v3, Voyage AI, mxbai-embed-large, NV-Embed. All push hard on MTEB. Many open-weight models match or beat proprietary ones on retrieval benchmarks. Matryoshka embeddings (Kusupati et al., NeurIPS 2022) and quantization-aware training enter the mainstream.
2024–2026
Long context vs RAG tension
Frontier models with 1M+ token context windows force a rethink of when RAG is the right architecture. Embeddings remain dominant for very large corpora, latency-sensitive lookups, and cost-bound workloads, but the boundary keeps moving. The discipline matures from "always RAG" to "measure and decide."
The current model landscape
Best-effort snapshot as of mid-2026. Model lineups, dimensions, and leaderboard positions shift monthly — check the MTEB leaderboard (huggingface.co/spaces/mteb/leaderboard) and each provider's docs before you commit. Dimensions listed are native; many of these models support Matryoshka-style truncation to smaller sizes with a quality tradeoff.
Cosine, dot product, Euclidean — which one to use
MTEB: read it carefully
MTEB (Massive Text Embedding Benchmark, Muennighoff et al., 2022 — arxiv.org/abs/2210.07316) is the standard leaderboard for embedding models, covering 56+ datasets across retrieval, reranking, clustering, classification, semantic textual similarity, and summarization. It is the right place to start. It is also routinely gamed and routinely misread. Three things to remember: (1) An overall MTEB score averages across very different tasks. If you only care about retrieval, look at the retrieval subset, not the overall number. (2) Many top-ranked models were trained on data that overlaps with MTEB evaluation sets. Treat headline scores as upper bounds, not predictions for your domain. (3) Your corpus is not Wikipedia, MS MARCO, or BEIR. Always run your own evaluation on a held-out sample of your actual data before locking in a model. The MTEB leaderboard lives at huggingface.co/spaces/mteb/leaderboard and is updated continuously. Treat any "best model" claim older than 90 days with suspicion.
Semantic search workflow, end to end
The minimum-effective-dose pipeline. Skip a step only after you've measured what it gives you.
- Chunk your documents. Typically 200–800 tokens per chunk with some overlap (50–100 tokens). Chunking quality dominates most retrieval outcomes; it's underrated.
- Embed each chunk with your chosen model. Store the vectors plus the source text plus metadata (document ID, position, timestamps).
- Build a vector index. HNSW (Hierarchical Navigable Small World, Malkov & Yashunin 2016, arxiv.org/abs/1603.09320) is the dominant ANN structure. IVF, ScaNN, DiskANN are alternatives.
- At query time: embed the query with the same model (and the same input-type prefix if the model is asymmetric, like E5 or Cohere v3). Retrieve top-k candidates by cosine or inner product.
- Optionally apply hybrid search — combine dense vector retrieval with sparse keyword retrieval (BM25) using reciprocal rank fusion. Wins consistently on domain-specific terms, model numbers, and proper nouns that embeddings sometimes blur.
- Rerank the top-k with a cross-encoder. Usually the largest quality jump per dollar in the whole pipeline.
- Feed reranked context to your LLM (for RAG) or return directly to the user (for search).
- Measure end-to-end. Recall@k and nDCG on a labeled eval set are the standard metrics. If you don't have an eval set, you don't have a retrieval system — you have a vibe.
RAG: retrieve, augment, generate
Vector databases
The vector database landscape is crowded and the differences matter less than vendors imply. Most production workloads at moderate scale (under ~10M vectors, sub-100ms p99 latency targets) can be served by any of these. Differences become real at scale, at multi-tenancy, or when filters and hybrid search complicate the query.
Pinecone
pinecone.io
Managed-only, serverless. Mature, well-instrumented, expensive at scale. Strong for teams that want zero ops and a clean API. Hybrid search via sparse-dense vectors. Released serverless tier in 2023.
Weaviate
weaviate.io
Open-source plus managed cloud. GraphQL API. Strong multi-tenancy story. Built-in modules for embedding generation at write time. Native hybrid search (BM25 + dense).
Qdrant
qdrant.tech
Open-source, written in Rust. Strong filtering performance (filters applied during HNSW search, not after). Self-host or managed cloud. Quantization support is mature.
Milvus
milvus.io
Open-source, Apache 2.0. CNCF graduated project. Designed for very large scale (billions of vectors). More operational complexity than Qdrant or Weaviate. Zilliz is the managed offering.
Chroma
trychroma.com
Open-source, lightweight, embedded-first. Pip-installable. Excellent for prototyping and small-to-mid corpora. Distributed Chroma exists for production.
pgvector
github.com/pgvector/pgvector
Postgres extension. The pragmatic choice when you already run Postgres. HNSW and IVFFlat indexes. Integrates with your existing transactional data and SQL filters. Performance has improved substantially since version 0.5.
LanceDB
lancedb.com
Embedded vector database built on the Lance columnar format. Disk-first, no separate server needed. Good for local-first apps and analytical workloads where vectors live alongside other columnar data.
FAISS
github.com/facebookresearch/faiss
Facebook AI Similarity Search. Not a database — a library. The reference implementation for ANN algorithms. Many of the systems above use FAISS internally or were inspired by it. Use directly when you want maximum control and minimum infrastructure.
Reranking with cross-encoders
When embeddings + RAG beats long-context — and when it doesn't
The honest decision matrix. Long-context frontier models (Claude with 200K, GPT-4 Turbo / GPT-4o, Gemini 1.5/2 with 1M+) made many RAG pipelines from 2022–2023 obsolete. They also did not replace RAG everywhere. Use this matrix as a starting point, not a verdict.
Honest gotchas
Things that bite production embedding systems that aren't on the leaderboard pages.
- Model drift — re-embedding millions of documents because you switched models is expensive. Pick a model you can live with for a year, or design for dual-indexing during migrations.
- Asymmetric models — E5, BGE, Cohere v3 and others expect different prefixes or input types for queries vs documents (e.g., 'query: ' vs 'passage: ' for E5). Forgetting this silently degrades retrieval by 5–15 points.
- Truncation — most embedding models have a max input length (often 512 tokens, sometimes 8192). Documents longer than that get silently truncated. Chunk first.
- Language mismatch — most leaderboards are English-heavy. For non-English or multilingual corpora, evaluate on multilingual benchmarks (MIRACL, MTEB multilingual) and prefer models explicitly trained multilingually (multilingual-e5-large, Cohere embed-multilingual-v3, BGE-M3).
- Domain mismatch — generic embedding models trained on web text underperform on legal, medical, financial, and code domains. Domain-specific fine-tuning or domain-trained models (Voyage's vertical variants, code-specific embedders) often beat a generic SOTA model on your specific task.
- Quantization — float32 vectors are wasteful. int8 quantization typically loses <1% recall and quarters the storage. Binary quantization with rescoring is even more aggressive and surprisingly competitive in recent benchmarks. Worth evaluating.
- Cosine traps — two vectors with similarity 0.85 sound close. They aren't necessarily. Calibrate against your own distribution; "high similarity" is relative to your corpus, not absolute.
Minimum-effective-dose checklist
If you're starting a new RAG project today: (1) Use OpenAI text-embedding-3-small or BGE-small as a baseline — both are cheap, fast, and good enough to prove out the pipeline. (2) Use pgvector if you already have Postgres, otherwise Qdrant or Chroma. Don't pick Pinecone or Milvus until you've outgrown the simpler option. (3) Add hybrid search (BM25 + dense) on day one — it's a 50-line change that pays for itself. (4) Add a reranker (Cohere Rerank or ms-marco-MiniLM-L-12-v2) before optimizing the embedding model. (5) Build a 50-query labeled eval set before you tune anything else. Without it, you're guessing. (6) Don't switch to a more expensive embedding model until your reranker is in place and your eval set tells you the embeddings are the bottleneck. Most of the time, they aren't.
Sources
- [01]
Mikolov et al., 'Efficient Estimation of Word Representations in Vector Space' — original Word2Vec paper introducing skip-gram and CBOW.
arxiv.org/abs/1301.3781
- [02]
Pennington, Socher, Manning — 'GloVe: Global Vectors for Word Representation,' EMNLP 2014.
nlp.stanford.edu/pubs/glove.pdf
- [03]
Bojanowski et al., 'Enriching Word Vectors with Subword Information' — fastText paper.
arxiv.org/abs/1607.04606
- [04]
Reimers and Gurevych, 'Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,' EMNLP 2019 — the canonical sentence-transformers paper.
arxiv.org/abs/1908.10084
- [05]
Sentence-Transformers library documentation and model hub (UKP Lab).
sbert.net
- [06]
Muennighoff et al., 'MTEB: Massive Text Embedding Benchmark' — defines the standard evaluation suite for embedding models.
arxiv.org/abs/2210.07316
- [07]
Live MTEB leaderboard hosted on Hugging Face Spaces; updated continuously.
huggingface.co/spaces/mteb/leaderboard
- [08]
OpenAI announcement of text-embedding-3-small and text-embedding-3-large (January 2024), including Matryoshka dimension reduction.
openai.com/blog/new-embedding-models-and-api-updates
- [09]
Cohere embed v3 documentation, including input_type conditioning and multilingual variants.
docs.cohere.com/docs/cohere-embed
- [10]
Voyage AI embedding models documentation, including voyage-3 and domain-specific variants (voyage-code, voyage-finance, voyage-law).
docs.voyageai.com
- [11]
Lee et al., 'NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models' — NVIDIA's NV-Embed paper.
arxiv.org/abs/2405.17428
- [12]
BAAI BGE model card and family documentation on Hugging Face.
huggingface.co/BAAI/bge-large-en-v1.5
- [13]
Chen et al., 'BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity' — BGE-M3 paper covering dense, sparse, and multi-vector retrieval.
arxiv.org/abs/2402.03216
- [14]
Wang et al., 'Text Embeddings by Weakly-Supervised Contrastive Pre-training' — Microsoft's E5 paper.
arxiv.org/abs/2212.03533
- [15]
Mixedbread AI mxbai-embed-large-v1 model card with Matryoshka training details.
huggingface.co/mixedbread-ai/mxbai-embed-large-v1
- [16]
Kusupati et al., 'Matryoshka Representation Learning,' NeurIPS 2022 — the dimension-flexible embeddings approach.
arxiv.org/abs/2205.13147
- [17]
Malkov and Yashunin, 'Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs' — the HNSW algorithm.
arxiv.org/abs/1603.09320
- [18]
Lewis et al., 'Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,' NeurIPS 2020 — the foundational RAG paper.
arxiv.org/abs/2005.11401
- [19]
FAISS — Facebook AI Similarity Search library, reference implementation for ANN.
github.com/facebookresearch/faiss
- [20]
pgvector — open-source vector similarity extension for Postgres.
github.com/pgvector/pgvector
- [21]
Qdrant documentation covering HNSW with filtering, quantization, and self-hosting.
qdrant.tech/documentation
- [22]
Weaviate documentation including hybrid search and multi-tenancy.
weaviate.io/developers/weaviate
- [23]
Milvus open-source vector database documentation.
milvus.io/docs
- [24]
Cohere Rerank API documentation including rerank-english-v3 and rerank-multilingual-v3 models.
docs.cohere.com/docs/rerank-overview
- [25]
ms-marco-MiniLM cross-encoder model card — standard open-weight reranker baseline.
huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2