A single point of bio-cyan light suspended inside a clear crystal cube — a vector in semantic space.

Embeddings

Geometry of meaning · semantic search · retrieval pipelines · honest limits

An embedding is a learned map from a piece of content — a word, a sentence, a paragraph, an image — into a fixed-length vector of real numbers. The geometry of that vector space carries meaning. Points that are close together represent content with similar semantics; points far apart represent content that means different things. That single primitive — turning language into geometry — is what makes semantic search, retrieval-augmented generation, deduplication, classification, clustering, and recommendation systems work the way they do today. This page walks the embeddings stack honestly. The historical line from Word2Vec through GloVe and fastText to Sentence-BERT and the modern transformer-encoder families. The current model landscape — OpenAI's text-embedding-3 series, Cohere embed v3, Voyage AI, NVIDIA NV-Embed, the BGE family from BAAI, Microsoft's E5, Mixedbread's mxbai — and how to read the MTEB leaderboard without getting fooled. The distance metrics that matter and the ones that don't. The full RAG pipeline from chunking through retrieval to reranking. The vector database landscape — Pinecone, Weaviate, Qdrant, Milvus, Chroma, pgvector, LanceDB — and where each one fits. Cross-encoder reranking with Cohere Rerank and ms-marco-MiniLM, which usually matters more than which embedding model you picked. And — most importantly — when embeddings plus retrieval beats just throwing the whole document into a long-context model, and when it doesn't. The honest answer is that the calculus shifted in 2024–2025 and is still shifting. Long context kills RAG for many use cases. Retrieval beats long context for many others. The decision is empirical, not ideological. Voice throughout: lab-grade, anti-hype, plain language. Where a fact is time-sensitive (model versions, pricing, leaderboard positions), I say so explicitly. Check provider docs before you ship.

What an embedding actually is

Pick a sentence: "the cat sat on the mat." An embedding model — let's say a sentence transformer with 768 output dimensions — converts that sentence into a list of 768 floating-point numbers. "A feline rested on the rug" gets converted into a different list of 768 numbers. If the embedding model is any good, those two lists will be close to each other in 768-dimensional space, because the sentences mean nearly the same thing. "The stock market closed lower" gets converted into a list that's far away from both, because it's about something unrelated. The geometric intuition is the whole game. Imagine plotting every sentence in English as a point in some high-dimensional space. A good embedding arranges those points so that semantic neighbors are spatial neighbors. Once you have that arrangement, every question about meaning becomes a question about geometry — find the nearest neighbors, find the centroid of a cluster, measure the angle between two vectors. Linear algebra replaces hand-coded rules. The famous early demonstration was Mikolov's 2013 Word2Vec result that king − man + woman ≈ queen as vectors. That arithmetic-on-meaning property doesn't hold cleanly in modern sentence embeddings (the geometry is more entangled), but the broader claim — that meaning has structure that survives projection into a fixed-dimensional space — is the foundation everything else is built on. The vector itself is opaque. You can't read it. You can only measure distances and angles against other vectors from the same model. Mixing vectors from different models is meaningless — they live in different spaces.

Historical line: from Word2Vec to sentence transformers

2013
Word2Vec (Mikolov et al., Google)
Two architectures — skip-gram and CBOW — trained on the prediction task of "given a word, predict its neighbors" (or vice versa). Produced dense word vectors, typically 100–300 dimensions. The king/queen arithmetic demo made the technique famous. Paper: "Efficient Estimation of Word Representations in Vector Space" (arxiv.org/abs/1301.3781).
2014
GloVe (Pennington, Socher, Manning, Stanford)
Global Vectors for Word Representation. Reformulated word embedding training as a matrix factorization problem on global word-word co-occurrence counts, rather than local context windows. Often comparable to Word2Vec; sometimes better on analogy tasks. Paper: "GloVe: Global Vectors for Word Representation" (nlp.stanford.edu/pubs/glove.pdf).
2016
fastText (Bojanowski et al., Facebook AI)
Extended Word2Vec with subword (character n-gram) information. Big win on morphologically rich languages and out-of-vocabulary words. Paper: "Enriching Word Vectors with Subword Information" (arxiv.org/abs/1607.04606).
2018
ELMo and BERT — contextual embeddings
ELMo (Peters et al.) and then BERT (Devlin et al., Google) replaced static word vectors with contextual ones — the same word gets a different vector depending on its sentence. BERT's [CLS] token or mean-pooled hidden states became the de facto sentence representation, though imperfectly.
2019
Sentence-BERT (Reimers and Gurevych)
The pivotal paper. Showed that raw BERT [CLS] embeddings are bad for semantic similarity, and that fine-tuning BERT with a siamese-network objective on NLI data produces sentence embeddings that work well with cosine similarity. The sentence-transformers library (SBERT.net) became the open-source standard. Paper: "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks" (arxiv.org/abs/1908.10084).
2022
OpenAI text-embedding-ada-002
The first widely-adopted commercial embedding API at scale. 1536 dimensions, single model for everything, cheap. Became the default for most production RAG systems through 2023.
2023–2024
The encoder renaissance
BGE (BAAI), E5 (Microsoft), GTE (Alibaba), Cohere embed-v3, Voyage AI, mxbai-embed-large, NV-Embed. All push hard on MTEB. Many open-weight models match or beat proprietary ones on retrieval benchmarks. Matryoshka embeddings (Kusupati et al., NeurIPS 2022) and quantization-aware training enter the mainstream.
2024–2026
Long context vs RAG tension
Frontier models with 1M+ token context windows force a rethink of when RAG is the right architecture. Embeddings remain dominant for very large corpora, latency-sensitive lookups, and cost-bound workloads, but the boundary keeps moving. The discipline matures from "always RAG" to "measure and decide."

The current model landscape

Best-effort snapshot as of mid-2026. Model lineups, dimensions, and leaderboard positions shift monthly — check the MTEB leaderboard (huggingface.co/spaces/mteb/leaderboard) and each provider's docs before you commit. Dimensions listed are native; many of these models support Matryoshka-style truncation to smaller sizes with a quality tradeoff.

Model family	Provider	Native dim	Open weights?	Notes
text-embedding-3-small / -3-large	OpenAI	1536 / 3072	No	Released January 2024. Supports dimension reduction via the `dimensions` parameter (Matryoshka-style). -3-large is OpenAI's flagship; -3-small is the cost-optimized option. Replaced ada-002.
Cohere embed v3	Cohere	1024 (English) / 1024 (multilingual)	No	Strong on retrieval. Cohere also publishes embed-multilingual-v3 covering 100+ languages. Trained with input-type conditioning (search_document vs search_query vs classification).
Voyage AI (voyage-3, voyage-3-large, voyage-code)	Voyage AI	1024 / 2048 / varies	No	Voyage emphasizes domain-tuned variants (code, finance, legal). Frequently sits at or near the top of MTEB English retrieval; check current leaderboard for exact ranking.
NV-Embed-v1 / -v2	NVIDIA	4096	Yes (research license)	Built on Mistral-7B decoder. Topped MTEB English retrieval leaderboard in 2024. Paper: arxiv.org/abs/2405.17428.
BGE (bge-small, -base, -large, -m3)	BAAI (Beijing Academy of AI)	384 / 768 / 1024 / 1024	Yes (MIT)	Workhorse open-weight family. bge-m3 supports dense, sparse, and multi-vector retrieval in one model. Widely deployed.
E5 (e5-small, -base, -large, multilingual-e5)	Microsoft	384 / 768 / 1024	Yes (MIT)	Wang et al., 2022 (arxiv.org/abs/2212.03533). Trained with weakly supervised contrastive pretraining. multilingual-e5-large is a common default for non-English.
mxbai-embed-large-v1	Mixedbread AI	1024	Yes (Apache 2.0)	Trained with Matryoshka representation learning; truncates gracefully to smaller dimensions. Strong open-weight contender.
sentence-transformers/all-MiniLM-L6-v2	UKP Lab / community	384	Yes (Apache 2.0)	The classic small workhorse. Not state-of-the-art but tiny, fast, and a sane baseline. Still appropriate when latency or cost dominates.

Model familytext-embedding-3-small / -3-large

ProviderOpenAI

Native dim1536 / 3072

Open weights?No

NotesReleased January 2024. Supports dimension reduction via the `dimensions` parameter (Matryoshka-style). -3-large is OpenAI's flagship; -3-small is the cost-optimized option. Replaced ada-002.

Model familyCohere embed v3

ProviderCohere

Native dim1024 (English) / 1024 (multilingual)

Open weights?No

NotesStrong on retrieval. Cohere also publishes embed-multilingual-v3 covering 100+ languages. Trained with input-type conditioning (search_document vs search_query vs classification).

Model familyVoyage AI (voyage-3, voyage-3-large, voyage-code)

ProviderVoyage AI

Native dim1024 / 2048 / varies

Open weights?No

NotesVoyage emphasizes domain-tuned variants (code, finance, legal). Frequently sits at or near the top of MTEB English retrieval; check current leaderboard for exact ranking.

Model familyNV-Embed-v1 / -v2

ProviderNVIDIA

Native dim4096

Open weights?Yes (research license)

NotesBuilt on Mistral-7B decoder. Topped MTEB English retrieval leaderboard in 2024. Paper: arxiv.org/abs/2405.17428.

Model familyBGE (bge-small, -base, -large, -m3)

ProviderBAAI (Beijing Academy of AI)

Native dim384 / 768 / 1024 / 1024

Open weights?Yes (MIT)

NotesWorkhorse open-weight family. bge-m3 supports dense, sparse, and multi-vector retrieval in one model. Widely deployed.

Model familyE5 (e5-small, -base, -large, multilingual-e5)

ProviderMicrosoft

Native dim384 / 768 / 1024

Open weights?Yes (MIT)

NotesWang et al., 2022 (arxiv.org/abs/2212.03533). Trained with weakly supervised contrastive pretraining. multilingual-e5-large is a common default for non-English.

Model familymxbai-embed-large-v1

ProviderMixedbread AI

Native dim1024

Open weights?Yes (Apache 2.0)

NotesTrained with Matryoshka representation learning; truncates gracefully to smaller dimensions. Strong open-weight contender.

Model familysentence-transformers/all-MiniLM-L6-v2

ProviderUKP Lab / community

Native dim384

Open weights?Yes (Apache 2.0)

NotesThe classic small workhorse. Not state-of-the-art but tiny, fast, and a sane baseline. Still appropriate when latency or cost dominates.

Cosine, dot product, Euclidean — which one to use

Three distance functions dominate vector search. Cosine similarity measures the angle between two vectors, ignoring magnitude — it ranges from −1 (opposite) to 1 (identical direction). Dot product (inner product) multiplies the vectors elementwise and sums; it depends on both angle and magnitude. Euclidean distance (L2) is the straight-line distance in n-dimensional space. For most modern sentence embedding models, the vectors are L2-normalized to unit length. When all vectors have magnitude 1, cosine similarity, dot product, and Euclidean distance produce identical rankings — they're monotonic transformations of each other. So in practice, the choice rarely affects retrieval quality. Where it matters: (1) when vectors are not normalized, dot product and cosine diverge — dot product favors vectors with larger magnitudes, which is sometimes what you want (e.g., when magnitude encodes confidence) and usually isn't. (2) Vector database performance differs. Approximate nearest-neighbor indices like HNSW are typically optimized for one metric; FAISS and most databases let you pick. Inner product on normalized vectors is usually fastest. Default recommendation: L2-normalize your embeddings at write time and query time, then use cosine or inner product interchangeably. Check the model card — OpenAI, Cohere, and most sentence-transformers models return normalized vectors by default; some open-weight models do not.

MTEB: read it carefully

MTEB (Massive Text Embedding Benchmark, Muennighoff et al., 2022 — arxiv.org/abs/2210.07316) is the standard leaderboard for embedding models, covering 56+ datasets across retrieval, reranking, clustering, classification, semantic textual similarity, and summarization. It is the right place to start. It is also routinely gamed and routinely misread. Three things to remember: (1) An overall MTEB score averages across very different tasks. If you only care about retrieval, look at the retrieval subset, not the overall number. (2) Many top-ranked models were trained on data that overlaps with MTEB evaluation sets. Treat headline scores as upper bounds, not predictions for your domain. (3) Your corpus is not Wikipedia, MS MARCO, or BEIR. Always run your own evaluation on a held-out sample of your actual data before locking in a model. The MTEB leaderboard lives at huggingface.co/spaces/mteb/leaderboard and is updated continuously. Treat any "best model" claim older than 90 days with suspicion.

Semantic search workflow, end to end

The minimum-effective-dose pipeline. Skip a step only after you've measured what it gives you.

Chunk your documents. Typically 200–800 tokens per chunk with some overlap (50–100 tokens). Chunking quality dominates most retrieval outcomes; it's underrated.
Embed each chunk with your chosen model. Store the vectors plus the source text plus metadata (document ID, position, timestamps).
Build a vector index. HNSW (Hierarchical Navigable Small World, Malkov & Yashunin 2016, arxiv.org/abs/1603.09320) is the dominant ANN structure. IVF, ScaNN, DiskANN are alternatives.
At query time: embed the query with the same model (and the same input-type prefix if the model is asymmetric, like E5 or Cohere v3). Retrieve top-k candidates by cosine or inner product.
Optionally apply hybrid search — combine dense vector retrieval with sparse keyword retrieval (BM25) using reciprocal rank fusion. Wins consistently on domain-specific terms, model numbers, and proper nouns that embeddings sometimes blur.
Rerank the top-k with a cross-encoder. Usually the largest quality jump per dollar in the whole pipeline.
Feed reranked context to your LLM (for RAG) or return directly to the user (for search).
Measure end-to-end. Recall@k and nDCG on a labeled eval set are the standard metrics. If you don't have an eval set, you don't have a retrieval system — you have a vibe.

RAG: retrieve, augment, generate

Retrieval-augmented generation is the workflow Lewis et al. named in their 2020 paper ("Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," arxiv.org/abs/2005.11401). The pattern is now ubiquitous and the architecture has converged. The pipeline: a user asks a question. The system embeds the question, retrieves the most relevant chunks from a vector index, optionally reranks them, then constructs a prompt that includes the retrieved chunks as context plus the original question. A generation model (typically a frontier LLM — Claude, GPT, Gemini, an open-weight model) reads the context and produces an answer. The retrieval step grounds the model in source material it didn't memorize; the generation step turns retrieved text into a fluent, integrated answer. The pipeline succeeds or fails on the retrieval step. If the retriever misses the relevant chunk, the generator hallucinates with confidence. If the retriever returns the right chunk in position 17 but the prompt only fits 10 chunks, the generator still misses it. This is why reranking matters and why honest evaluation matters. Four failure modes show up in production over and over. (1) The chunk boundary cuts across the answer — the relevant span is split between two chunks and neither one alone is sufficient. (2) The query is phrased differently from the document — "how do I cancel my subscription" vs documentation that says "to terminate service." Hybrid search and query rewriting both help. (3) The corpus has multiple plausible answers and the retriever returns one without flagging the others. (4) The corpus contains stale information and the retriever has no notion of recency. None of these are solved by switching embedding models — they're solved upstream of the embedding model.

Vector databases

The vector database landscape is crowded and the differences matter less than vendors imply. Most production workloads at moderate scale (under ~10M vectors, sub-100ms p99 latency targets) can be served by any of these. Differences become real at scale, at multi-tenancy, or when filters and hybrid search complicate the query.

Pinecone

pinecone.io

Managed-only, serverless. Mature, well-instrumented, expensive at scale. Strong for teams that want zero ops and a clean API. Hybrid search via sparse-dense vectors. Released serverless tier in 2023.

Weaviate

weaviate.io

Open-source plus managed cloud. GraphQL API. Strong multi-tenancy story. Built-in modules for embedding generation at write time. Native hybrid search (BM25 + dense).

Qdrant

qdrant.tech

Open-source, written in Rust. Strong filtering performance (filters applied during HNSW search, not after). Self-host or managed cloud. Quantization support is mature.

Milvus

milvus.io

Open-source, Apache 2.0. CNCF graduated project. Designed for very large scale (billions of vectors). More operational complexity than Qdrant or Weaviate. Zilliz is the managed offering.

Chroma

trychroma.com

Open-source, lightweight, embedded-first. Pip-installable. Excellent for prototyping and small-to-mid corpora. Distributed Chroma exists for production.

pgvector

github.com/pgvector/pgvector

Postgres extension. The pragmatic choice when you already run Postgres. HNSW and IVFFlat indexes. Integrates with your existing transactional data and SQL filters. Performance has improved substantially since version 0.5.

LanceDB

lancedb.com

Embedded vector database built on the Lance columnar format. Disk-first, no separate server needed. Good for local-first apps and analytical workloads where vectors live alongside other columnar data.

FAISS

github.com/facebookresearch/faiss

Facebook AI Similarity Search. Not a database — a library. The reference implementation for ANN algorithms. Many of the systems above use FAISS internally or were inspired by it. Use directly when you want maximum control and minimum infrastructure.

Reranking with cross-encoders

An embedding model is a bi-encoder: it encodes the query and the document independently, then compares vectors. That's what makes it fast — you can pre-compute all the document vectors offline. A cross-encoder, by contrast, takes the query and a document together as a single input and outputs a relevance score. It's slower (you can't pre-compute, you have to run a forward pass per candidate), but it's substantially more accurate because it can attend to query-document interactions. The production pattern: use a fast bi-encoder to retrieve top-100 (or top-50, or top-200) candidates, then use a cross-encoder to rerank those candidates down to the top-5 or top-10 that actually go to the LLM. The cross-encoder sees few enough candidates that latency stays reasonable, and the bi-encoder did the heavy lifting of narrowing from millions to hundreds. Two widely-used options. (1) Cohere Rerank — a managed API. Cohere has shipped multiple versions; rerank-multilingual-v3 and rerank-english-v3 are the typical defaults as of 2024–2026. Easy to drop in, strong quality. (2) Open-weight cross-encoders from the sentence-transformers library — the ms-marco-MiniLM family (cross-encoder/ms-marco-MiniLM-L-6-v2 and -L-12-v2) is the standard baseline. Tiny, fast on CPU, surprisingly competitive. Larger models like ms-marco-electra-base or cross-encoder/ms-marco-MiniLM-L-12-v2 trade speed for quality. In my experience and in published RAG-eval reports, adding a reranker to a competent retrieval pipeline typically produces a larger end-to-end quality improvement than switching from a mid-tier embedding model to a top-tier one. If you're optimizing a pipeline and haven't added a reranker yet, do that first.

When embeddings + RAG beats long-context — and when it doesn't

The honest decision matrix. Long-context frontier models (Claude with 200K, GPT-4 Turbo / GPT-4o, Gemini 1.5/2 with 1M+) made many RAG pipelines from 2022–2023 obsolete. They also did not replace RAG everywhere. Use this matrix as a starting point, not a verdict.

Situation	Better choice	Why
Corpus larger than the model's context window (>1M tokens of source material)	RAG	You can't fit it in context. The decision is mechanical.
Corpus fits in context, single query, latency tolerant	Long context	Skip the infra. Stuff the docs in the prompt. Modern long-context models retrieve well across their windows (within limits).
Same corpus queried many times, cost-sensitive	RAG	Embedding lookup is far cheaper per query than re-processing the full corpus. Prompt caching narrows but doesn't eliminate the gap.
Latency budget under 1 second	RAG	Long-context prefill is slow. Retrieval + small context window is faster.
Corpus updates frequently, you need freshness	RAG	Incremental index updates are cheap. Re-uploading a million-token document on every change is not.
Need to cite specific source chunks in the answer	RAG	You already have the chunks. Long-context citation is possible but more error-prone.
Cross-document reasoning, synthesis across many sources	Mixed	Retrieval to surface candidates, then long-context to reason across them. The two approaches compose.
Highly structured queries with strong keyword overlap	Hybrid (dense + sparse)	Pure dense retrieval underperforms BM25 on product codes, error messages, proper nouns. Combine both.

SituationCorpus larger than the model's context window (>1M tokens of source material)

Better choiceRAG

WhyYou can't fit it in context. The decision is mechanical.

SituationCorpus fits in context, single query, latency tolerant

Better choiceLong context

WhySkip the infra. Stuff the docs in the prompt. Modern long-context models retrieve well across their windows (within limits).

SituationSame corpus queried many times, cost-sensitive

Better choiceRAG

WhyEmbedding lookup is far cheaper per query than re-processing the full corpus. Prompt caching narrows but doesn't eliminate the gap.

SituationLatency budget under 1 second

Better choiceRAG

WhyLong-context prefill is slow. Retrieval + small context window is faster.

SituationCorpus updates frequently, you need freshness

Better choiceRAG

WhyIncremental index updates are cheap. Re-uploading a million-token document on every change is not.

SituationNeed to cite specific source chunks in the answer

Better choiceRAG

WhyYou already have the chunks. Long-context citation is possible but more error-prone.

SituationCross-document reasoning, synthesis across many sources

Better choiceMixed

WhyRetrieval to surface candidates, then long-context to reason across them. The two approaches compose.

SituationHighly structured queries with strong keyword overlap

Better choiceHybrid (dense + sparse)

WhyPure dense retrieval underperforms BM25 on product codes, error messages, proper nouns. Combine both.

Honest gotchas

Things that bite production embedding systems that aren't on the leaderboard pages.

Model drift — re-embedding millions of documents because you switched models is expensive. Pick a model you can live with for a year, or design for dual-indexing during migrations.
Asymmetric models — E5, BGE, Cohere v3 and others expect different prefixes or input types for queries vs documents (e.g., 'query: ' vs 'passage: ' for E5). Forgetting this silently degrades retrieval by 5–15 points.
Truncation — most embedding models have a max input length (often 512 tokens, sometimes 8192). Documents longer than that get silently truncated. Chunk first.
Language mismatch — most leaderboards are English-heavy. For non-English or multilingual corpora, evaluate on multilingual benchmarks (MIRACL, MTEB multilingual) and prefer models explicitly trained multilingually (multilingual-e5-large, Cohere embed-multilingual-v3, BGE-M3).
Domain mismatch — generic embedding models trained on web text underperform on legal, medical, financial, and code domains. Domain-specific fine-tuning or domain-trained models (Voyage's vertical variants, code-specific embedders) often beat a generic SOTA model on your specific task.
Quantization — float32 vectors are wasteful. int8 quantization typically loses <1% recall and quarters the storage. Binary quantization with rescoring is even more aggressive and surprisingly competitive in recent benchmarks. Worth evaluating.
Cosine traps — two vectors with similarity 0.85 sound close. They aren't necessarily. Calibrate against your own distribution; "high similarity" is relative to your corpus, not absolute.

Minimum-effective-dose checklist

If you're starting a new RAG project today: (1) Use OpenAI text-embedding-3-small or BGE-small as a baseline — both are cheap, fast, and good enough to prove out the pipeline. (2) Use pgvector if you already have Postgres, otherwise Qdrant or Chroma. Don't pick Pinecone or Milvus until you've outgrown the simpler option. (3) Add hybrid search (BM25 + dense) on day one — it's a 50-line change that pays for itself. (4) Add a reranker (Cohere Rerank or ms-marco-MiniLM-L-12-v2) before optimizing the embedding model. (5) Build a 50-query labeled eval set before you tune anything else. Without it, you're guessing. (6) Don't switch to a more expensive embedding model until your reranker is in place and your eval set tells you the embeddings are the bottleneck. Most of the time, they aren't.

Sources

[01]
Mikolov et al., 'Efficient Estimation of Word Representations in Vector Space' — original Word2Vec paper introducing skip-gram and CBOW.
arxiv.org/abs/1301.3781
[02]
Pennington, Socher, Manning — 'GloVe: Global Vectors for Word Representation,' EMNLP 2014.
nlp.stanford.edu/pubs/glove.pdf
[03]
Bojanowski et al., 'Enriching Word Vectors with Subword Information' — fastText paper.
arxiv.org/abs/1607.04606
[04]
Reimers and Gurevych, 'Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,' EMNLP 2019 — the canonical sentence-transformers paper.
arxiv.org/abs/1908.10084
[05]
Sentence-Transformers library documentation and model hub (UKP Lab).
sbert.net
[06]
Muennighoff et al., 'MTEB: Massive Text Embedding Benchmark' — defines the standard evaluation suite for embedding models.
arxiv.org/abs/2210.07316
[07]
Live MTEB leaderboard hosted on Hugging Face Spaces; updated continuously.
huggingface.co/spaces/mteb/leaderboard
[08]
OpenAI announcement of text-embedding-3-small and text-embedding-3-large (January 2024), including Matryoshka dimension reduction.
openai.com/blog/new-embedding-models-and-api-updates
[09]
Cohere embed v3 documentation, including input_type conditioning and multilingual variants.
docs.cohere.com/docs/cohere-embed
[10]
Voyage AI embedding models documentation, including voyage-3 and domain-specific variants (voyage-code, voyage-finance, voyage-law).
docs.voyageai.com
[11]
Lee et al., 'NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models' — NVIDIA's NV-Embed paper.
arxiv.org/abs/2405.17428
[12]
BAAI BGE model card and family documentation on Hugging Face.
huggingface.co/BAAI/bge-large-en-v1.5
[13]
Chen et al., 'BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity' — BGE-M3 paper covering dense, sparse, and multi-vector retrieval.
arxiv.org/abs/2402.03216
[14]
Wang et al., 'Text Embeddings by Weakly-Supervised Contrastive Pre-training' — Microsoft's E5 paper.
arxiv.org/abs/2212.03533
[15]
Mixedbread AI mxbai-embed-large-v1 model card with Matryoshka training details.
huggingface.co/mixedbread-ai/mxbai-embed-large-v1
[16]
Kusupati et al., 'Matryoshka Representation Learning,' NeurIPS 2022 — the dimension-flexible embeddings approach.
arxiv.org/abs/2205.13147
[17]
Malkov and Yashunin, 'Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs' — the HNSW algorithm.
arxiv.org/abs/1603.09320
[18]
Lewis et al., 'Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,' NeurIPS 2020 — the foundational RAG paper.
arxiv.org/abs/2005.11401
[19]
FAISS — Facebook AI Similarity Search library, reference implementation for ANN.
github.com/facebookresearch/faiss
[20]
pgvector — open-source vector similarity extension for Postgres.
github.com/pgvector/pgvector
[21]
Qdrant documentation covering HNSW with filtering, quantization, and self-hosting.
qdrant.tech/documentation
[22]
Weaviate documentation including hybrid search and multi-tenancy.
weaviate.io/developers/weaviate
[23]
Milvus open-source vector database documentation.
milvus.io/docs
[24]
Cohere Rerank API documentation including rerank-english-v3 and rerank-multilingual-v3 models.
docs.cohere.com/docs/rerank-overview
[25]
ms-marco-MiniLM cross-encoder model card — standard open-weight reranker baseline.
huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2

Keep reading

Atlas: large language models →Atlas: retrieval-augmented generation →Learn: long context vs RAG →Tools: vector database picker →Research: AtomEons retrieval papers →Compare: Pinecone vs pgvector vs Qdrant →Learn: chunking strategies →OrangeBox: local RAG stack →

Embeddings

What an embedding actually is

Historical line: from Word2Vec to sentence transformers

Word2Vec (Mikolov et al., Google)

GloVe (Pennington, Socher, Manning, Stanford)

fastText (Bojanowski et al., Facebook AI)

ELMo and BERT — contextual embeddings

Sentence-BERT (Reimers and Gurevych)

OpenAI text-embedding-ada-002

The encoder renaissance

Long context vs RAG tension

The current model landscape

Cosine, dot product, Euclidean — which one to use

MTEB: read it carefully

Semantic search workflow, end to end

RAG: retrieve, augment, generate

Vector databases

Pinecone

Weaviate

Qdrant

Milvus

Chroma

pgvector

LanceDB

FAISS

Reranking with cross-encoders

When embeddings + RAG beats long-context — and when it doesn't

Honest gotchas

Minimum-effective-dose checklist

Sources

Keep reading