A single point of bio-cyan light suspended inside a clear crystal cube — a vector in semantic space.

Retrieval-augmented generation · the atlas

The most-deployed AI pattern of 2026, explained honestly.

Every enterprise AI app touches RAG somewhere. Most production RAG systems are not the diagram in the blog post — they're hybrid search + contextual chunking + reranking + query rewriting + document-level permission boundaries. This page walks the actual architecture stack honestly, including where it breaks.

Eight architecture patterns

From naive RAG to no-RAG.

Naive RAG

User query → embed → vector search top-k chunks → stuff into LLM context → generate. The textbook diagram every blog post draws. Works for narrow, well-curated corpora. Fails on anything resembling a real enterprise document set.

When to use: Demo apps, FAQ bots, small knowledge bases with high editorial care.

Where it breaks: Multi-hop questions. Questions that need reasoning across documents. Documents with poor chunking. Tables. Code. Most real corpora.

Hybrid search (dense + sparse)

Combine vector similarity (semantic) with BM25 or similar lexical match (keyword). Take the union with reciprocal rank fusion or similar score blending. Catches both 'documents similar in meaning' and 'documents that contain the exact rare term the user typed.' Substantially better than pure vector search on real corpora.

When to use: Most production RAG should start here, not at naive RAG.

Where it breaks: Still has the chunking + reasoning + multi-hop problems. Just better at the retrieval step.

Contextual retrieval (Anthropic Sep 2024)

Before embedding each chunk, prepend a model-generated context blob that situates the chunk within its broader document (the section it's from, the document type, the company/topic). Reportedly cuts retrieval failure rate by ~50% over naive. Pairs with hybrid search and reranking.

When to use: When chunk-level context-loss is causing retrieval failures (i.e., always).

Where it breaks: Cost: requires a per-chunk LLM call at indexing time. Index-build time + cost are the tradeoffs.

Reranking

After retrieval brings back top-50 or top-100, a separate cross-encoder model re-scores the candidates against the query. Cohere Rerank, Voyage Rerank, BGE-Reranker are the public options. Reranking is the single most impactful improvement most RAG systems can make.

When to use: Any production system. Almost always net-positive.

Where it breaks: Adds latency (~50-200ms). Adds cost. Bad reranker can be worse than no reranker.

Query rewriting + decomposition

Before retrieval, an LLM rewrites the user's question to optimize for retrieval (e.g., expand acronyms, split multi-part questions, generate sub-queries). Sub-queries each retrieve their own context, then merge for final generation.

When to use: Multi-hop questions, ambiguous queries, conversational follow-ups that reference earlier turns.

Where it breaks: Adds an LLM call before retrieval. Sometimes the rewrite is wrong and degrades performance.

GraphRAG (Microsoft Research, 2024)

Build an entity-relationship knowledge graph from the corpus at indexing time. Use graph + community summaries for retrieval, not just chunk-level vectors. Strong on cross-document synthesis questions ('what does the corpus say about X?').

When to use: Corpora where entities + relationships matter (legal, medical, scientific literature).

Where it breaks: Expensive to build the graph. Complex pipeline. Microsoft's open-source implementation is a reference but heavy.

Agentic RAG

Instead of one retrieval + one generation, the LLM iteratively decides what to retrieve next based on what it has already seen. Multi-hop tool-use loop where 'search the knowledge base' is one of the agent's tools. Significantly better for hard questions; significantly more expensive per query.

When to use: Research-grade queries, complex troubleshooting, deep technical analysis.

Where it breaks: Cost: 5-30× more LLM calls per question. Latency: 10-60+ seconds. Failure modes of agent systems apply (infinite loops, off-topic drift).

Long-context (no RAG)

With 1M-2M token context windows (Gemini 1.5/2.5, Claude Sonnet 4 beta), some teams skip retrieval entirely and just dump the relevant document set into the prompt. Works for small-to-medium corpora; fails on cost + latency for anything larger than ~50-100 documents.

When to use: When the entire relevant corpus fits comfortably in context.

Where it breaks: Inference cost scales linearly with input tokens. Lost-in-the-middle effects remain (Liu et al. 2023). Information retrieval at scale is still cheaper than 'just put it all in the context.'

Vector databases

Eight options, honest notes.

pgvector (Postgres extension)

The pragmatic choice for most production teams. Same operational footprint as the rest of your stack. Performance up to ~1-10M vectors before specialized DBs win.

Pinecone

The original commercial vector DB. Hosted-only, opinionated API, strong performance. Default pick for many enterprises through 2023.

Weaviate

Open-source + hosted. Schema-aware, hybrid-search built in. Strong for teams wanting features beyond raw vector search.

Qdrant

Open-source + hosted. Rust-based. Strong performance per dollar. Filterable payloads are a deployment-friendly feature.

Turbopuffer

Newer (2024+). S3-backed, very cheap. Strong choice when cost-per-stored-vector matters more than millisecond-level latency.

LanceDB

Embedded vector DB built on Apache Lance. Good for client-side / edge / on-device use cases. Strong with multimodal.

Chroma

Open-source. Strong developer-experience focus. Common for prototyping; some teams ship it to production.

Milvus / Zilliz

Open-source (Milvus) + commercial (Zilliz Cloud). Strong at very large scale (billions of vectors). Heavier ops footprint.

Where RAG breaks in production

Six failure modes.

01
Chunk boundary loss. Information that spans chunks (sentences split awkwardly, tables truncated, lists cut in half) is invisible to chunk-level retrieval. Mitigations: smarter chunking, overlap, hierarchical chunking, structure-aware ingestion.
02
Embedding model mismatch. Embedding models trained on English web text are weak on code, multilingual content, scientific notation, dates, and numeric reasoning. Use the right embedding model for your domain.
03
Top-k too low. Naive RAG often retrieves top-5 chunks. For complex questions, you need top-20-to-50 + reranking. The right k depends on the question type.
04
Stale corpus. RAG only works if the corpus is current. Every production RAG system needs an ingestion pipeline, not just a one-time index build.
05
Embedding-search drift. The embedding model that was state-of-art when you built the index isn't anymore. Re-embedding is expensive but periodically necessary.
06
Permission boundaries. RAG can leak documents the user shouldn't see if your retrieval layer doesn't enforce the same auth/permission model your application does. Document-level ACLs in the retrieval layer are non-negotiable.

Sources

[01]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Lewis et al. (Meta + UCL) · 2020
https://arxiv.org/abs/2005.11401 ↗
[02]
Lost in the Middle: How Language Models Use Long Contexts
Liu et al. (Stanford + multiple) · 2023
https://arxiv.org/abs/2307.03172 ↗
[03]
Anthropic · Introducing Contextual Retrieval
Anthropic · 2024
https://www.anthropic.com/news/contextual-retrieval ↗
[04]
GraphRAG: A new approach for discovery using complex information
Microsoft Research · 2024
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/ ↗
[05]
RAG vs Long Context: A Comparison (Anthropic)
Anthropic blog post · 2024
https://www.anthropic.com/news/contextual-retrieval ↗
[06]
BGE-Reranker · BAAI open-source reranker
BAAI · 2023+
https://huggingface.co/BAAI/bge-reranker-v2-m3 ↗

Embeddings · the substrate →Agentic RAG details →← atlas index