What is retrieval-augmented generation (RAG)?

Last reviewed June 2026 · ÆoNs Research Laboratory

The short answer

Retrieval-augmented generation (RAG) is a technique that grounds a large language model's output in documents fetched at query time from an external knowledge store, instead of relying only on the model's frozen training weights. The pattern was formalized by Lewis et al. at Facebook AI Research in the 2020 NeurIPS paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (arXiv:2005.11401), and it is now the standard architecture for AI systems that need fresh, citable, domain-specific answers.

The longer answer

RAG was introduced in May 2020 by Patrick Lewis and ten co-authors from Facebook AI Research, University College London, and New York University, in a paper presented at NeurIPS 2020 (arXiv:2005.11401). The architecture pairs a parametric memory — a sequence-to-sequence transformer such as BART — with a non-parametric memory — a dense vector index of Wikipedia accessed via Maximum Inner Product Search (MIPS). At query time, a Dense Passage Retrieval (DPR) encoder (arXiv:2004.04906, Karpukhin et al., 2020) projects the user's question into a vector, fetches the top-k passages, and the generator conditions its output on both the question and the retrieved passages.

The motivation is two failure modes of pure parametric models. First, knowledge cutoff: a model trained in January cannot know what happened in February. Second, hallucination: when a model lacks grounding for a claim, it confabulates plausible-sounding falsehoods. The 2020 RAG paper showed state-of-the-art performance on open-domain QA benchmarks Natural Questions, TriviaQA, and WebQuestions, beating both extractive and closed-book baselines.

Modern production RAG diverges from the original paper in three ways. First, the generator is typically a frontier instruction-tuned LLM (GPT-4, Claude, Gemini, Llama 3) rather than BART. Second, the retriever is often a hybrid of dense vectors (BGE, E5, OpenAI text-embedding-3) and sparse lexical matching (BM25, as defined in Robertson & Zaragoza 2009). Third, the index is rarely Wikipedia — it is the customer's own corpus, indexed in a vector database such as Pinecone, Weaviate, Qdrant, Chroma, or pgvector (a PostgreSQL extension first released June 2021).

The canonical pipeline has six stages: chunking (splitting documents into 200–1000 token passages), embedding (encoding each chunk as a vector), indexing (storing vectors in an ANN structure such as HNSW, Malkov & Yashunin 2018, arXiv:1603.09320), retrieval (top-k nearest neighbors at query time), reranking (often using a cross-encoder like Cohere Rerank or BGE-Reranker for higher precision on the top results), and generation (the LLM conditions on the question plus retrieved context).

RAG has known failure modes documented in the literature. Stanford's "Lost in the Middle" paper (Liu et al., 2023, arXiv:2307.03172) showed that LLMs attend more strongly to context at the start and end of the prompt than in the middle — a U-shaped accuracy curve. The "RAGAS" framework (Es et al., 2023, arXiv:2309.15217) and Anthropic's "Contextual Retrieval" research (September 2024) propose metrics and improvements such as contextual chunk headers, which reduce retrieval failure by up to 49% on Anthropic's published benchmarks.

Frameworks have proliferated. LangChain (first commit October 2022) and LlamaIndex (initially "GPT Index," November 2022) dominate the application layer. NVIDIA released the NeMo Retriever blueprint in 2024. Microsoft's GraphRAG (arXiv:2404.16130, April 2024) extends RAG with knowledge-graph construction over the corpus, and Anthropic's Model Context Protocol (MCP), open-sourced November 2024, standardizes how LLM clients connect to retrieval servers.

RAG is not a panacea. Long-context models (Gemini 1.5 Pro at 2M tokens, Claude 3.5 Sonnet at 200K) reduce some use cases for retrieval, and fine-tuning remains better when you need the model to internalize a style or a behavior rather than recall a fact. The practical rule is: use RAG for facts that change, and fine-tune for behavior that doesn't.

Key facts

RAG was introduced by Lewis et al. in the paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," posted to arXiv on 22 May 2020 and published at NeurIPS 2020 (arXiv:2005.11401).
The companion paper for the retriever, Dense Passage Retrieval (DPR), was published the same year by Karpukhin et al. at Facebook AI (arXiv:2004.04906).
HNSW (Hierarchical Navigable Small World), the dominant approximate-nearest-neighbor algorithm used in vector databases, was published by Malkov and Yashunin in 2018 (arXiv:1603.09320, IEEE TPAMI).
pgvector, the PostgreSQL extension for vector similarity search, was first released by Andrew Kane on 16 June 2021 (github.com/pgvector/pgvector).
BM25, the sparse retrieval scoring function still used in hybrid RAG pipelines, was formalized in Robertson and Zaragoza's "The Probabilistic Relevance Framework: BM25 and Beyond" (2009, Foundations and Trends in Information Retrieval).
The "Lost in the Middle" effect — degraded LLM accuracy when relevant context sits in the middle of a long prompt — was empirically documented by Liu et al. at Stanford (arXiv:2307.03172, July 2023).
Anthropic's Contextual Retrieval technique, published 19 September 2024, reduces retrieval failure rate by up to 49% when combined with reranking (anthropic.com/news/contextual-retrieval).
Microsoft's GraphRAG paper extends RAG with LLM-constructed knowledge graphs over the corpus (arXiv:2404.16130, April 2024).
Anthropic's Model Context Protocol (MCP), an open standard for connecting LLM clients to retrieval and tool servers, was announced 25 November 2024 (modelcontextprotocol.io).
OWASP listed "LLM01: Prompt Injection" and "LLM06: Sensitive Information Disclosure" as top risks for RAG-equipped applications in the OWASP Top 10 for LLM Applications v1.1 (October 2023).

Sources

Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," NeurIPS 2020: arxiv.org/abs/2005.11401
Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering," EMNLP 2020: arxiv.org/abs/2004.04906
Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," 2023: arxiv.org/abs/2307.03172
Anthropic, "Introducing Contextual Retrieval," 19 September 2024: anthropic.com/news/contextual-retrieval
Microsoft Research, "From Local to Global: A Graph RAG Approach to Query-Focused Summarization," 2024: arxiv.org/abs/2404.16130
Malkov & Yashunin, "Efficient and robust approximate nearest neighbor search using HNSW graphs," IEEE TPAMI 2018: arxiv.org/abs/1603.09320
pgvector official repository: github.com/pgvector/pgvector
OWASP Top 10 for LLM Applications v1.1: owasp.org/www-project-top-10-for-large-language-model-applications

What is retrieval-augmented generation (RAG)?

The short answer

The longer answer

Key facts

Related questions

Sources