
The long-context atlas
Why attention is quadratic, how the field got past it, and when long context actually beats retrieval
The quadratic that started everything
Architectural moves: how attention got cheaper
Four broad families of attention-shape changes dominate the long-context literature. They are not mutually exclusive — production models often combine them.
Sliding-window attention
linear scaling · loses long-range deps
Each token attends only to a fixed-size window of neighbors. Longformer (Beltagy, Peters, Cohan 2020, arXiv 2004.05150) introduced the combination of sliding-window local attention plus a few task-specific global tokens, dropping the complexity from O(n squared) to O(n times window). Mistral 7B uses a 4,096-token sliding window across its layers as a deliberate trade. Cheap, simple, lossy: the model loses true long-range dependency capacity in exchange for linear scaling.
Sparse attention
structured patterns · provable approximation
Instead of attending everywhere, attend on a structured sparse pattern — strided, dilated, or learned. OpenAI's Sparse Transformer (Child et al, 2019, arXiv 1904.10509) is the canonical reference; BigBird (Zaheer et al, 2020, arXiv 2007.14062) gave a clean theoretical argument that the right sparse pattern preserves the universal-approximation property of full attention. Used heavily in early long-context research, partially superseded by hardware-friendly alternatives.
Ring attention
exact full attention · scales with hardware
Liu, Zaharia, and Abbeel's Ring Attention (2023, arXiv 2310.01889) does not change the attention pattern at all. It shards the sequence across devices and rotates key-value blocks around a ring while each device works on its local query block. The communication overlaps with compute. The math is exact full attention; the scaling becomes a function of device count. This is the trick that makes multi-million-token training and inference financially survivable.
Linear and state-space alternatives
O(n) but lossy on exact recall
A more radical line replaces softmax attention entirely. Linear attention rewrites the score function so it factorizes, giving O(n). State-space models like Mamba (Gu and Dao, 2023, arXiv 2312.00752) skip attention in favor of selective state-space layers. They scale beautifully but, as of mid-2026, hybrid stacks that mix attention with state-space or linear layers consistently outperform pure-SSM models on recall-heavy tasks. The bitter lesson here is still being written.
Position encodings: how the model knows where it is
The major position-encoding methods
Real context windows in production (best-effort, June 2026)
Lost in the middle — the failure mode that didn't go away
Liu et al, 'Lost in the Middle: How Language Models Use Long Contexts' (2023, arXiv 2307.03172), is the paper that you should read before designing any system that relies on long-context behavior. The finding: when models are given documents to retrieve over, accuracy is highest when the relevant information is at the beginning or end of the context window, and meaningfully lower when it is in the middle. The performance curve is U-shaped, and it does not go away just because the model's advertised window got bigger. The practical consequence is uncomfortable. A model with a 1M-token window does not actually attend to the middle of that window with the same fidelity it attends to the edges. If your application puts the critical document in token position 412,000 of 1,000,000, you are paying for context the model is least likely to use. The mitigation is structural — re-rank the most important documents to the front and back of the prompt, use retrieval to keep context smaller, or chunk-and-summarize before the final pass. The vendor cannot fix this for you by raising the cap further.
Needle-in-a-haystack: useful, but easy to game
Long context versus retrieval: the actual decision
Long context and retrieval-augmented generation (RAG) are presented in marketing as competitors. In production they are complements with different operating regimes. The decision rule that has held up for us:
- Use long context when the relevant material is small enough to fit, you cannot predict in advance what part of it the model will need, and the question requires reasoning across the whole corpus. A 60-page legal contract being reviewed for cross-references is the canonical example.
- Use retrieval when the corpus is much larger than any model's window, the relevant material per query is small, and you can afford to invest in a good index. The model never sees the parts it does not need.
- Use hybrid when the corpus is moderate, retrieval gives you the right neighborhood, and a long context lets the model widen the neighborhood at read time. This is what most production systems converge to.
- Cost is real. A 1M-token Claude call is roughly 5,000x the input cost of a 200-token call. If you are paying for a 1M window per query and the relevant content is 2K tokens, you are subsidizing the model's failure to retrieve.
- Latency is real. Time-to-first-token at 1M-token inputs is typically tens of seconds even with cached prefixes. If your user expects a chat-speed response, long context is not transparent.
- Lost-in-the-middle is real. If your retrieval is good enough to surface the right 8K tokens, you will usually beat a 1M-token cold call on accuracy as well as cost.
The chronology
2017
Attention is all you need
Vaswani et al introduce the transformer with full O(n squared) self-attention. The original window is 512 tokens. The quadratic is acknowledged in the paper but not yet treated as a research problem.
2019
Sparse Transformer
Child, Gray, Radford, and Sutskever (arXiv 1904.10509) introduce structured sparse attention patterns. First serious move past the quadratic for autoregressive models.
2020
Longformer and BigBird
Beltagy et al (arXiv 2004.05150) and Zaheer et al (arXiv 2007.14062) formalize sliding-window-plus-global-tokens as a linear-scaling architecture with theoretical justification.
2021
RoPE and ALiBi
Su et al (arXiv 2104.09864) publish RoPE and Press et al (arXiv 2108.12409) publish ALiBi. These two position-encoding schemes will quietly become the dominant choices for every long-context model that follows.
2022
FlashAttention
Dao et al (arXiv 2205.14135) publish FlashAttention, an IO-aware exact attention algorithm that does not change the quadratic but makes it dramatically more memory-efficient on GPU. Practical context lengths jump.
2023 (mid)
Lost in the middle
Liu, Lin, Hewitt et al (arXiv 2307.03172) publish the U-shape result. The community starts to take long-context utilization, not just long-context capacity, seriously.
2023 (late)
YaRN and Ring Attention
Peng et al (arXiv 2309.00071) publish YaRN; Liu, Zaharia, and Abbeel (arXiv 2310.01889) publish Ring Attention. Together they make multi-hundred-thousand-token windows tractable in pretraining.
2024
1M-token windows in production
Gemini 1.5 Pro (Reid et al, arXiv 2403.05530) launches with a 1M-token window and demonstrates 10M experimentally. Anthropic, Meta, and OpenAI ship 100K+ windows. The leaderboard era of context length begins.
2024-2026
From capacity to utilization
Benchmarks like RULER (Hsieh et al, arXiv 2404.06654) and Michelangelo (Vodrahalli et al, 2024, arXiv 2409.12640) push beyond needle-in-haystack into multi-hop and aggregation. Real-world use settles on hybrid retrieval-plus-long-context as the default architecture.
What we recommend
If you are about to ship something that depends on long context, our minimum-effective-dose checklist:
- Measure utilization, not just capacity. Run a multi-hop NIAH or borrow a RULER subset against your actual workload. A green single-needle heatmap is not enough.
- Order your prompt. Put the most important material near the beginning and the question near the end. Treat the middle as the cheapest seat in the house and price accordingly.
- Cache the prefix. Anthropic, OpenAI, and Google all offer prompt-caching tiers; if your long context is shared across many queries the economics change by an order of magnitude.
- Hybrid by default. Retrieve to a few thousand tokens of high-signal context, then let the model use the rest of its window for working memory and chain-of-thought.
- Read the position encoding section of the model card. Whether the vendor used PI, NTK-Aware, or YaRN tells you something real about how the model will degrade at the edge of its advertised window.
- Do not believe a single benchmark number. The honest summary of long-context capability in mid-2026 is that the field is good at retrieval, decent at single-document reasoning, and still publishing papers about multi-document aggregation.
Sources
- [01]
Vaswani et al, 'Attention Is All You Need' — introduces the transformer and full softmax self-attention with O(n squared) cost.
arxiv.org/abs/1706.03762
- [02]
Beltagy, Peters, Cohan, 'Longformer: The Long-Document Transformer' — sliding-window plus global attention, linear scaling.
arxiv.org/abs/2004.05150
- [03]
Child, Gray, Radford, Sutskever, 'Generating Long Sequences with Sparse Transformers' — structured sparse attention patterns.
arxiv.org/abs/1904.10509
- [04]
Zaheer et al, 'Big Bird: Transformers for Longer Sequences' — theoretical justification for sparse attention preserving universal approximation.
arxiv.org/abs/2007.14062
- [05]
Liu, Zaharia, Abbeel, 'Ring Attention with Blockwise Transformers for Near-Infinite Context' — exact attention via device-ring KV rotation.
arxiv.org/abs/2310.01889
- [06]
Gu and Dao, 'Mamba: Linear-Time Sequence Modeling with Selective State Spaces' — state-space alternative to attention.
arxiv.org/abs/2312.00752
- [07]
Su et al, 'RoFormer: Enhanced Transformer with Rotary Position Embedding' — introduces RoPE, the dominant modern position encoding.
arxiv.org/abs/2104.09864
- [08]
Press, Smith, Lewis, 'Train Short, Test Long: Attention with Linear Biases (ALiBi)' — linear-bias position encoding that extrapolates beyond training length.
arxiv.org/abs/2108.12409
- [09]
Chen et al, 'Extending Context Window of Large Language Models via Positional Interpolation' — PI method for stretching RoPE.
arxiv.org/abs/2306.15595
- [10]
Peng, Quesnelle, Fan, Shippole, 'YaRN: Efficient Context Window Extension of Large Language Models' — piecewise frequency scaling plus attention temperature for RoPE extension.
arxiv.org/abs/2309.00071
- [11]
Liu, Lin, Hewitt et al, 'Lost in the Middle: How Language Models Use Long Contexts' — U-shaped utilization curve in long-context retrieval.
arxiv.org/abs/2307.03172
- [12]
Reid et al, 'Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context' — 1M-token production window and 10M experimental recall.
arxiv.org/abs/2403.05530
- [13]
Hsieh et al, 'RULER: What's the Real Context Size of Your Long-Context Language Models?' — benchmark suite that exposes single-needle gaming.
arxiv.org/abs/2404.06654
- [14]
Dao, Fu, Ermon, Rudra, Re, 'FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness' — IO-aware exact attention, large practical context gains.
arxiv.org/abs/2205.14135
- [15]
Llama 3.1 8B/70B/405B ship with 128K-token native context windows.
ai.meta.com/blog/meta-llama-3-1 · Llama 3.1 release post (July 2024)
- [16]
Claude 3 family supports 200K-token context in the public API; check current docs for 1M-tier availability.
Anthropic model documentation · docs.anthropic.com
- [17]
GPT-4o launched with a 128K-token context window; verify current OpenAI docs for newer models.
OpenAI model documentation · platform.openai.com/docs/models
- [18]
Greg Kamradt's needle-in-a-haystack harness — the de facto NIAH benchmarking format used in 2024-2026 vendor launches.
github.com/gkamradt/LLMTest_NeedleInAHaystack
- [19]
Vodrahalli et al, 'Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries' — benchmark probing structural long-context reasoning.
arxiv.org/abs/2409.12640