A long folded accordion-fan form extending into vanishing-point distance — the context window.

The long-context atlas

Why attention is quadratic, how the field got past it, and when long context actually beats retrieval

Context length is the dimension of language models that has moved the fastest and been hyped the hardest. In 2020 a 4,096-token window was a flagship feature. By mid-2026 it is normal to see frontier models advertise one to two million tokens, with research demos pushing higher. The marketing reads like a leaderboard. The engineering reads very differently. This page is the honest atlas. The math is the same math it was in 2017: every token attends to every other token, which makes attention scale as O(n squared) in compute and memory. Everything that has happened since is a fight against that quadratic — sliding-window approximations, sparse patterns, distributed ring attention, smarter position encodings, and post-hoc extension tricks. Each one trades something away. None of them make the quadratic disappear; they just hide it behind better engineering or accept a lossier representation in exchange for length. We also have a second, quieter problem. Even when a model can ingest a million tokens, it does not use them evenly. Liu et al's "Lost in the Middle" study, published mid-2023, showed a U-shaped utilization curve that has held up well: models recall best near the start and end of their context, and significantly worse in the middle. Needle-in-haystack benchmarks — the most-cited proof that "long context works" — are easy to game with surface-level retrieval and tell you little about whether a model can actually reason over the middle of its own window. We wrote this atlas because we got tired of the gap between vendor blog posts and what an engineer actually needs to decide whether to use a 1M-token window or a retrieval pipeline. Below: the math, the architectural moves, the position-encoding family, the real numbers as of June 2026 (we point to provider docs for current figures), the failure modes, and the decision rule for long context versus RAG.

The quadratic that started everything

Self-attention as introduced in Vaswani et al's 2017 transformer paper is a beautifully simple operation. Every token produces a query, a key, and a value vector. Every query is compared against every key. The resulting n by n matrix of dot-products is softmaxed and used to mix the value vectors. That is the whole thing. The consequence is also simple. If you have n tokens, you compute n squared pairwise scores. You also store an n squared attention matrix in memory during training. Doubling your sequence length quadruples your compute and your activation memory. Going from 4K to 1M is not a 250-times increase in cost — it is a 62,500-times increase in the dominant attention term. Key-value cache changes the shape of the problem at inference time, but not the underlying scaling. During autoregressive decoding you can cache the K and V projections for previously generated tokens, so generating the next token costs O(n) attention work instead of O(n squared). But that cache itself grows linearly in n per layer, and on real hardware the cache can dominate memory long before the compute does. A 1M-token cache in a 70-billion-parameter model running at half precision is on the order of tens of gigabytes per request. Everything in the rest of this atlas is, in some sense, a way of refusing to pay the full quadratic price.

Architectural moves: how attention got cheaper

Four broad families of attention-shape changes dominate the long-context literature. They are not mutually exclusive — production models often combine them.

Sliding-window attention

linear scaling · loses long-range deps

Each token attends only to a fixed-size window of neighbors. Longformer (Beltagy, Peters, Cohan 2020, arXiv 2004.05150) introduced the combination of sliding-window local attention plus a few task-specific global tokens, dropping the complexity from O(n squared) to O(n times window). Mistral 7B uses a 4,096-token sliding window across its layers as a deliberate trade. Cheap, simple, lossy: the model loses true long-range dependency capacity in exchange for linear scaling.

Sparse attention

structured patterns · provable approximation

Instead of attending everywhere, attend on a structured sparse pattern — strided, dilated, or learned. OpenAI's Sparse Transformer (Child et al, 2019, arXiv 1904.10509) is the canonical reference; BigBird (Zaheer et al, 2020, arXiv 2007.14062) gave a clean theoretical argument that the right sparse pattern preserves the universal-approximation property of full attention. Used heavily in early long-context research, partially superseded by hardware-friendly alternatives.

Ring attention

exact full attention · scales with hardware

Liu, Zaharia, and Abbeel's Ring Attention (2023, arXiv 2310.01889) does not change the attention pattern at all. It shards the sequence across devices and rotates key-value blocks around a ring while each device works on its local query block. The communication overlaps with compute. The math is exact full attention; the scaling becomes a function of device count. This is the trick that makes multi-million-token training and inference financially survivable.

Linear and state-space alternatives

O(n) but lossy on exact recall

A more radical line replaces softmax attention entirely. Linear attention rewrites the score function so it factorizes, giving O(n). State-space models like Mamba (Gu and Dao, 2023, arXiv 2312.00752) skip attention in favor of selective state-space layers. They scale beautifully but, as of mid-2026, hybrid stacks that mix attention with state-space or linear layers consistently outperform pure-SSM models on recall-heavy tasks. The bitter lesson here is still being written.

Position encodings: how the model knows where it is

Attention as defined is permutation-invariant. Without a position signal a transformer cannot tell 'dog bit man' from 'man bit dog.' How you inject that signal turns out to matter enormously for context length. The original 2017 transformer used additive sinusoidal embeddings. Modern long-context models almost universally use Rotary Position Embedding (RoPE), introduced by Su et al in RoFormer (2021, arXiv 2104.09864). RoPE rotates the query and key vectors by an angle that depends on absolute position; the dot-product of two rotated vectors then depends only on their relative position. This gives RoPE two clean properties: it is relative without needing a separate relative-position table, and it composes nicely with the rest of the attention math. ALiBi (Press, Smith, Lewis, 2021, arXiv 2108.12409, ICLR 2022) takes a different route — no embedding at all, just a fixed linear bias on attention scores that grows with distance. The paper's title — 'Train Short, Test Long' — is the headline result: ALiBi extrapolates to longer sequences than it was trained on, much more gracefully than sinusoidal embeddings. The practical story of 2023 onward is RoPE extension. The base RoPE only works well up to the lengths it was trained on. Position Interpolation (PI, kaiokendev / Chen et al, 2023) scales position indices down so the model thinks it's still operating in its trained range. NTK-Aware scaling (bloc97, community, 2023) does the same but only on the lower-frequency dimensions, preserving high-frequency detail. YaRN (Peng et al, 2023, arXiv 2309.00071) combines a piecewise frequency scaling with an attention temperature adjustment and reports 10x fewer tokens and 2.5x fewer training steps than prior extension methods. As of June 2026 most production long-context models use some descendant of these techniques to stretch a model trained at one length to operate at several times that length.

The major position-encoding methods

Method	Year / arXiv	Mechanism	Strength	Weakness
Sinusoidal	2017 (Vaswani et al)	Additive fixed sinusoidal embeddings	Simple, universal	Poor length extrapolation
Learned absolute	2018+	Trainable per-position embedding	Easy to add	Cannot extrapolate past trained length at all
RoPE	2021 / 2104.09864	Rotate Q and K by position-dependent angle	Relative behavior, plays well with attention math	Native window limited to training length
ALiBi	2021 / 2108.12409	Linear distance bias on attention scores	Strong length extrapolation, no params	Less expressive than RoPE in some benches
PI (Position Interpolation)	2023 / 2306.15595	Linearly compress RoPE indices into training range	Cheap extension with small fine-tune	Loses high-frequency positional detail
NTK-Aware scaling	2023 (community)	Scale low-frequency RoPE dims only	Often works zero-shot	Sensitive to base length
YaRN	2023 / 2309.00071	Piecewise frequency scaling + attention temperature	Best efficiency in its class at release	Still requires some fine-tuning for best results

MethodSinusoidal

Year / arXiv2017 (Vaswani et al)

MechanismAdditive fixed sinusoidal embeddings

StrengthSimple, universal

WeaknessPoor length extrapolation

MethodLearned absolute

Year / arXiv2018+

MechanismTrainable per-position embedding

StrengthEasy to add

WeaknessCannot extrapolate past trained length at all

MethodRoPE

Year / arXiv2021 / 2104.09864

MechanismRotate Q and K by position-dependent angle

StrengthRelative behavior, plays well with attention math

WeaknessNative window limited to training length

MethodALiBi

Year / arXiv2021 / 2108.12409

MechanismLinear distance bias on attention scores

StrengthStrong length extrapolation, no params

WeaknessLess expressive than RoPE in some benches

MethodPI (Position Interpolation)

Year / arXiv2023 / 2306.15595

MechanismLinearly compress RoPE indices into training range

StrengthCheap extension with small fine-tune

WeaknessLoses high-frequency positional detail

MethodNTK-Aware scaling

Year / arXiv2023 (community)

MechanismScale low-frequency RoPE dims only

StrengthOften works zero-shot

WeaknessSensitive to base length

MethodYaRN

Year / arXiv2023 / 2309.00071

MechanismPiecewise frequency scaling + attention temperature

StrengthBest efficiency in its class at release

WeaknessStill requires some fine-tuning for best results

Real context windows in production (best-effort, June 2026)

The numbers below are best-effort as of June 2026. Providers update these aggressively and the difference between 'supported in API,' 'available on a private preview,' and 'announced for a research paper' matters. Check the provider's official docs before quoting them in a contract. Claude (Anthropic): the public API has supported 200,000-token context across the Claude 3 family since early 2024. Anthropic has published research and partner work using a 1M-token window, and as of mid-2026 a 1M-token tier is available on some Claude lines for enterprise customers. Treat 200K as the broadly-deployed number and 1M as the research / select-tier number until you check the current model card. Gemini (Google DeepMind): Gemini 1.5 Pro shipped with a 1M-token context in 2024 and the underlying research paper (Reid et al, 2024, arXiv 2403.05530) demonstrated near-perfect needle recall up to 10M tokens experimentally. Newer Gemini generations have continued to support multi-million-token windows in production. GPT-4o (OpenAI): GPT-4o launched with a 128K-token context window. Some later OpenAI models have advertised larger windows; verify with the current OpenAI model documentation before quoting a number. Llama 3.1 (Meta): the 8B, 70B, and 405B releases (Meta, July 2024) all shipped with a 128K-token native context. The Llama 3.1 paper documents the RoPE-based length-extension recipe used to reach that window during continued pretraining. The pattern across all four vendors is the same: 100K-class windows are now table stakes, 1M-class windows are available somewhere on the price sheet, and the headline 'experimental' numbers are research papers, not API SLAs.

Lost in the middle — the failure mode that didn't go away

Liu et al, 'Lost in the Middle: How Language Models Use Long Contexts' (2023, arXiv 2307.03172), is the paper that you should read before designing any system that relies on long-context behavior. The finding: when models are given documents to retrieve over, accuracy is highest when the relevant information is at the beginning or end of the context window, and meaningfully lower when it is in the middle. The performance curve is U-shaped, and it does not go away just because the model's advertised window got bigger. The practical consequence is uncomfortable. A model with a 1M-token window does not actually attend to the middle of that window with the same fidelity it attends to the edges. If your application puts the critical document in token position 412,000 of 1,000,000, you are paying for context the model is least likely to use. The mitigation is structural — re-rank the most important documents to the front and back of the prompt, use retrieval to keep context smaller, or chunk-and-summarize before the final pass. The vendor cannot fix this for you by raising the cap further.

Needle-in-a-haystack: useful, but easy to game

The most common 'proof' that a long-context model works is a needle-in-a-haystack (NIAH) test. The setup: insert a single distinctive fact ('the secret word is purple') at a controlled position in a long irrelevant document, then ask the model to retrieve it. Plot recall as a function of insertion depth and context length. Greg Kamradt's NIAH harness, released in late 2023, popularized this format and most vendor launches now include a NIAH heatmap. NIAH is informative, but it is also the easiest possible long-context task. The needle is lexically distinctive and the question is direct retrieval. A model can do well on NIAH using essentially fuzzy keyword matching, with no real reasoning over the surrounding context. Several published critiques — and the RULER benchmark from NVIDIA (Hsieh et al, 2024, arXiv 2404.06654) — have shown that models which look perfect on basic NIAH degrade sharply on multi-hop NIAH, variable-tracking, and aggregation tasks at the same context length. When you see a vendor publishing a green NIAH heatmap at 1M tokens, what you have learned is that the model can do single-fact retrieval at 1M tokens. You have learned very little about whether it can summarize, compare, or reason over a 1M-token document. Treat NIAH as a smoke test, not a capability claim.

Long context versus retrieval: the actual decision

Long context and retrieval-augmented generation (RAG) are presented in marketing as competitors. In production they are complements with different operating regimes. The decision rule that has held up for us:

Use long context when the relevant material is small enough to fit, you cannot predict in advance what part of it the model will need, and the question requires reasoning across the whole corpus. A 60-page legal contract being reviewed for cross-references is the canonical example.
Use retrieval when the corpus is much larger than any model's window, the relevant material per query is small, and you can afford to invest in a good index. The model never sees the parts it does not need.
Use hybrid when the corpus is moderate, retrieval gives you the right neighborhood, and a long context lets the model widen the neighborhood at read time. This is what most production systems converge to.
Cost is real. A 1M-token Claude call is roughly 5,000x the input cost of a 200-token call. If you are paying for a 1M window per query and the relevant content is 2K tokens, you are subsidizing the model's failure to retrieve.
Latency is real. Time-to-first-token at 1M-token inputs is typically tens of seconds even with cached prefixes. If your user expects a chat-speed response, long context is not transparent.
Lost-in-the-middle is real. If your retrieval is good enough to surface the right 8K tokens, you will usually beat a 1M-token cold call on accuracy as well as cost.

The chronology

2017
Attention is all you need
Vaswani et al introduce the transformer with full O(n squared) self-attention. The original window is 512 tokens. The quadratic is acknowledged in the paper but not yet treated as a research problem.
2019
Sparse Transformer
Child, Gray, Radford, and Sutskever (arXiv 1904.10509) introduce structured sparse attention patterns. First serious move past the quadratic for autoregressive models.
2020
Longformer and BigBird
Beltagy et al (arXiv 2004.05150) and Zaheer et al (arXiv 2007.14062) formalize sliding-window-plus-global-tokens as a linear-scaling architecture with theoretical justification.
2021
RoPE and ALiBi
Su et al (arXiv 2104.09864) publish RoPE and Press et al (arXiv 2108.12409) publish ALiBi. These two position-encoding schemes will quietly become the dominant choices for every long-context model that follows.
2022
FlashAttention
Dao et al (arXiv 2205.14135) publish FlashAttention, an IO-aware exact attention algorithm that does not change the quadratic but makes it dramatically more memory-efficient on GPU. Practical context lengths jump.
2023 (mid)
Lost in the middle
Liu, Lin, Hewitt et al (arXiv 2307.03172) publish the U-shape result. The community starts to take long-context utilization, not just long-context capacity, seriously.
2023 (late)
YaRN and Ring Attention
Peng et al (arXiv 2309.00071) publish YaRN; Liu, Zaharia, and Abbeel (arXiv 2310.01889) publish Ring Attention. Together they make multi-hundred-thousand-token windows tractable in pretraining.
2024
1M-token windows in production
Gemini 1.5 Pro (Reid et al, arXiv 2403.05530) launches with a 1M-token window and demonstrates 10M experimentally. Anthropic, Meta, and OpenAI ship 100K+ windows. The leaderboard era of context length begins.
2024-2026
From capacity to utilization
Benchmarks like RULER (Hsieh et al, arXiv 2404.06654) and Michelangelo (Vodrahalli et al, 2024, arXiv 2409.12640) push beyond needle-in-haystack into multi-hop and aggregation. Real-world use settles on hybrid retrieval-plus-long-context as the default architecture.

What we recommend

If you are about to ship something that depends on long context, our minimum-effective-dose checklist:

Measure utilization, not just capacity. Run a multi-hop NIAH or borrow a RULER subset against your actual workload. A green single-needle heatmap is not enough.
Order your prompt. Put the most important material near the beginning and the question near the end. Treat the middle as the cheapest seat in the house and price accordingly.
Cache the prefix. Anthropic, OpenAI, and Google all offer prompt-caching tiers; if your long context is shared across many queries the economics change by an order of magnitude.
Hybrid by default. Retrieve to a few thousand tokens of high-signal context, then let the model use the rest of its window for working memory and chain-of-thought.
Read the position encoding section of the model card. Whether the vendor used PI, NTK-Aware, or YaRN tells you something real about how the model will degrade at the edge of its advertised window.
Do not believe a single benchmark number. The honest summary of long-context capability in mid-2026 is that the field is good at retrieval, decent at single-document reasoning, and still publishing papers about multi-document aggregation.

Sources

[01]
Vaswani et al, 'Attention Is All You Need' — introduces the transformer and full softmax self-attention with O(n squared) cost.
arxiv.org/abs/1706.03762
[02]
Beltagy, Peters, Cohan, 'Longformer: The Long-Document Transformer' — sliding-window plus global attention, linear scaling.
arxiv.org/abs/2004.05150
[03]
Child, Gray, Radford, Sutskever, 'Generating Long Sequences with Sparse Transformers' — structured sparse attention patterns.
arxiv.org/abs/1904.10509
[04]
Zaheer et al, 'Big Bird: Transformers for Longer Sequences' — theoretical justification for sparse attention preserving universal approximation.
arxiv.org/abs/2007.14062
[05]
Liu, Zaharia, Abbeel, 'Ring Attention with Blockwise Transformers for Near-Infinite Context' — exact attention via device-ring KV rotation.
arxiv.org/abs/2310.01889
[06]
Gu and Dao, 'Mamba: Linear-Time Sequence Modeling with Selective State Spaces' — state-space alternative to attention.
arxiv.org/abs/2312.00752
[07]
Su et al, 'RoFormer: Enhanced Transformer with Rotary Position Embedding' — introduces RoPE, the dominant modern position encoding.
arxiv.org/abs/2104.09864
[08]
Press, Smith, Lewis, 'Train Short, Test Long: Attention with Linear Biases (ALiBi)' — linear-bias position encoding that extrapolates beyond training length.
arxiv.org/abs/2108.12409
[09]
Chen et al, 'Extending Context Window of Large Language Models via Positional Interpolation' — PI method for stretching RoPE.
arxiv.org/abs/2306.15595
[10]
Peng, Quesnelle, Fan, Shippole, 'YaRN: Efficient Context Window Extension of Large Language Models' — piecewise frequency scaling plus attention temperature for RoPE extension.
arxiv.org/abs/2309.00071
[11]
Liu, Lin, Hewitt et al, 'Lost in the Middle: How Language Models Use Long Contexts' — U-shaped utilization curve in long-context retrieval.
arxiv.org/abs/2307.03172
[12]
Reid et al, 'Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context' — 1M-token production window and 10M experimental recall.
arxiv.org/abs/2403.05530
[13]
Hsieh et al, 'RULER: What's the Real Context Size of Your Long-Context Language Models?' — benchmark suite that exposes single-needle gaming.
arxiv.org/abs/2404.06654
[14]
Dao, Fu, Ermon, Rudra, Re, 'FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness' — IO-aware exact attention, large practical context gains.
arxiv.org/abs/2205.14135
[15]
Llama 3.1 8B/70B/405B ship with 128K-token native context windows.
ai.meta.com/blog/meta-llama-3-1 · Llama 3.1 release post (July 2024)
[16]
Claude 3 family supports 200K-token context in the public API; check current docs for 1M-tier availability.
Anthropic model documentation · docs.anthropic.com
[17]
GPT-4o launched with a 128K-token context window; verify current OpenAI docs for newer models.
OpenAI model documentation · platform.openai.com/docs/models
[18]
Greg Kamradt's needle-in-a-haystack harness — the de facto NIAH benchmarking format used in 2024-2026 vendor launches.
github.com/gkamradt/LLMTest_NeedleInAHaystack
[19]
Vodrahalli et al, 'Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries' — benchmark probing structural long-context reasoning.
arxiv.org/abs/2409.12640

Keep reading

Atlas: the attention mechanism →Atlas: retrieval-augmented generation →Atlas: position encodings deep-dive →Long context vs RAG: the decision →Learn: prompt caching for long context →Research: lost-in-the-middle reproduction →Tools: context-cost calculator →B00KMakor: long-document workflows →

The long-context atlas

The quadratic that started everything

Architectural moves: how attention got cheaper

Sliding-window attention

Sparse attention

Ring attention

Linear and state-space alternatives

Position encodings: how the model knows where it is

The major position-encoding methods

Real context windows in production (best-effort, June 2026)

Lost in the middle — the failure mode that didn't go away

Needle-in-a-haystack: useful, but easy to game

Long context versus retrieval: the actual decision

The chronology

Attention is all you need

Sparse Transformer

Longformer and BigBird

RoPE and ALiBi

FlashAttention

Lost in the middle

YaRN and Ring Attention

1M-token windows in production

From capacity to utilization

What we recommend

Sources

Keep reading