User query → embed → vector search top-k chunks → stuff into LLM context → generate. The textbook diagram every blog post draws. Works for narrow, well-curated corpora. Fails on anything resembling a real enterprise document set.
When to use: Demo apps, FAQ bots, small knowledge bases with high editorial care.
Where it breaks: Multi-hop questions. Questions that need reasoning across documents. Documents with poor chunking. Tables. Code. Most real corpora.
02
Hybrid search (dense + sparse)
Combine vector similarity (semantic) with BM25 or similar lexical match (keyword). Take the union with reciprocal rank fusion or similar score blending. Catches both 'documents similar in meaning' and 'documents that contain the exact rare term the user typed.' Substantially better than pure vector search on real corpora.
When to use: Most production RAG should start here, not at naive RAG.
Where it breaks: Still has the chunking + reasoning + multi-hop problems. Just better at the retrieval step.
03
Contextual retrieval (Anthropic Sep 2024)
Before embedding each chunk, prepend a model-generated context blob that situates the chunk within its broader document (the section it's from, the document type, the company/topic). Reportedly cuts retrieval failure rate by ~50% over naive. Pairs with hybrid search and reranking.
When to use: When chunk-level context-loss is causing retrieval failures (i.e., always).
Where it breaks: Cost: requires a per-chunk LLM call at indexing time. Index-build time + cost are the tradeoffs.
After retrieval brings back top-50 or top-100, a separate cross-encoder model re-scores the candidates against the query. Cohere Rerank, Voyage Rerank, BGE-Reranker are the public options. Reranking is the single most impactful improvement most RAG systems can make.
When to use: Any production system. Almost always net-positive.
Where it breaks: Adds latency (~50-200ms). Adds cost. Bad reranker can be worse than no reranker.
05
Query rewriting + decomposition
Before retrieval, an LLM rewrites the user's question to optimize for retrieval (e.g., expand acronyms, split multi-part questions, generate sub-queries). Sub-queries each retrieve their own context, then merge for final generation.
When to use: Multi-hop questions, ambiguous queries, conversational follow-ups that reference earlier turns.
Where it breaks: Adds an LLM call before retrieval. Sometimes the rewrite is wrong and degrades performance.
06
GraphRAG (Microsoft Research, 2024)
Build an entity-relationship knowledge graph from the corpus at indexing time. Use graph + community summaries for retrieval, not just chunk-level vectors. Strong on cross-document synthesis questions ('what does the corpus say about X?').
When to use: Corpora where entities + relationships matter (legal, medical, scientific literature).
Where it breaks: Expensive to build the graph. Complex pipeline. Microsoft's open-source implementation is a reference but heavy.
Instead of one retrieval + one generation, the LLM iteratively decides what to retrieve next based on what it has already seen. Multi-hop tool-use loop where 'search the knowledge base' is one of the agent's tools. Significantly better for hard questions; significantly more expensive per query.
When to use: Research-grade queries, complex troubleshooting, deep technical analysis.
Where it breaks: Cost: 5-30× more LLM calls per question. Latency: 10-60+ seconds. Failure modes of agent systems apply (infinite loops, off-topic drift).
With 1M-2M token context windows (Gemini 1.5/2.5, Claude Sonnet 4 beta), some teams skip retrieval entirely and just dump the relevant document set into the prompt. Works for small-to-medium corpora; fails on cost + latency for anything larger than ~50-100 documents.
When to use: When the entire relevant corpus fits comfortably in context.
Where it breaks: Inference cost scales linearly with input tokens. Lost-in-the-middle effects remain (Liu et al. 2023). Information retrieval at scale is still cheaper than 'just put it all in the context.'