::synthesis · Tim-Ferriss method
RAG vs long-context · when to use which
::minimum effective dose
RAG (Retrieval-Augmented Generation) and long-context are two solutions to the same problem: getting external knowledge into a model's working memory. They have different tradeoffs, and the operator decision is mostly empirical, not philosophical. Use LONG-CONTEXT when: (1) the corpus is small enough to fit (under ~200K tokens for safety), (2) you need the model to reason across the entire document, not just retrieve facts, (3) the corpus is stable per session — you'll re-use it many times, so prompt caching makes it cheap. Use RAG when: (1) the corpus is large (10K+ documents), (2) you need fresh data (yesterday's docs, last hour's tickets), (3) you need source attribution per claim, (4) cost matters and you can't afford 200K input tokens per call. The 2024-2026 consensus that almost nobody admits: for corpus sizes between 50K and 500K tokens, long-context is usually better quality AND cheaper than naive RAG if you have prompt caching enabled. Cached input is often 10x cheaper than the engineering cost of building, maintaining, and tuning a retrieval pipeline. RAG only decisively wins above the long-context ceiling, or when you need fresh data, or when you need citations. The 'RAG is dead' takes are wrong, but so are the 'RAG is the answer for everything' takes. The right answer is: measure on your corpus, your queries, and your latency budget.
::DiSSS · deconstruction questions
- 01What's the total token size of my corpus, and is it stable or constantly changing?
- 02What's my actual query pattern — fact lookup, synthesis across the corpus, or reasoning over a region?
- 03Do I need per-claim citations for legal, factual, or trust reasons?
- 04What's my latency budget — RAG adds retrieval overhead but cuts input tokens; long-context is the opposite tradeoff?
- 05Have I measured both on the same queries, or am I deciding by reputation?
::fear-setting
Cost of not learning this: you'll build a RAG pipeline (vector DB, embedding pipeline, chunking strategy, re-ranking, hybrid search, query rewriting) for a problem that long-context with prompt caching would have solved in a tenth of the engineering time. Every week of RAG engineering for a small-corpus problem is a week you didn't ship the actual product. Cost of getting it wrong: silent retrieval failures. A bad RAG system doesn't crash — it just returns the wrong chunk and the model confidently synthesizes a wrong answer with citations to the wrong source. These failures look like model hallucinations but they're retrieval errors, and they're undetectable from the output alone. Operators ship RAG systems with no eval and then defend wrong answers for months.
::80 / 20 cut
SKIP: the latest RAG paper, exotic chunking strategies, the newest vector DB benchmark. OBSESS OVER: (1) measuring on YOUR corpus and queries — the right architecture depends on data shape, (2) starting with long-context + caching if your corpus fits — it's the simpler baseline, (3) building retrieval eval (precision@k, recall@k) on a held-out query set BEFORE building the pipeline. Most RAG failures are skipped-eval failures, not algorithm failures.
::tribe of mentors · paraphrased stances
Jerry Liu
Co-founder of LlamaIndex, has shipped more production RAG than almost anyone
Jerry's stance: RAG is not one thing; it's a 30-knob system. The default settings work for ~50% of use cases; the rest need real evaluation and tuning. Operators who copy a tutorial and ship without measuring are building on sand.
Greg Kamradt
Ran the original needle-in-a-haystack tests across frontier models at long context
Greg's stance: long-context windows are now genuinely usable for many corpus sizes that would have required RAG in 2023. The 'when to use which' decision has shifted toward long-context for smaller corpora and RAG for the truly large or freshness-critical.
Lance Martin
LangChain engineer, publishes the most operator-friendly RAG architecture pattern catalog
Lance's stance: start with the dumbest possible RAG (chunk, embed, retrieve top-K, stuff in prompt). Measure. Only add complexity (re-ranking, hybrid search, agentic retrieval) when you can prove the simpler version is the bottleneck.
Anthropic prompt caching team
Built the prompt caching feature that changed the long-context economic equation
Anthropic's stance, made explicit in cookbook: for stable corpora, long-context with caching is often the simpler, cheaper, higher-quality solution. RAG remains essential for large or dynamic corpora, but the crossover point has shifted.
::real-world test · this week
This week: take a corpus you're considering for RAG (or that you already RAG'd). If it's under 200K tokens, just paste it into Claude or Gemini's long-context window with prompt caching enabled. Run your 10 hardest queries against both your RAG system and the long-context version. Score the outputs blind. Measure cost per query and latency per query. In a surprising number of cases, you'll find long-context wins on quality and ties or wins on cost — and you'll save the engineering. If RAG wins, you have the data to justify the engineering.
::action items · ranked
- 01Measure your corpus size in tokens — many operators assume RAG-territory when they're long-context territory
- 02Build a 10-query held-out eval set BEFORE building or tuning any retrieval system
- 03Test long-context + caching as the baseline; only build RAG when long-context provably loses
- 04If RAG, instrument retrieval quality (precision@k, recall@k) separately from generation quality
- 05Document the decision — which architecture, which queries, which corpus size — so you can revisit when models improve