A matte-black aluminum heatsink with a single bio-cyan LED — where inference actually runs.

Inference · the atlas

What actually happens when you call a model.

Inference is the slow-and-expensive half of AI in 2026. Training is one-time; inference is forever. This page is the bottom-up walk: tokenization, prefill, decode, KV cache, then the optimizations that make modern serving possible — FlashAttention, paged attention, continuous batching, speculative decoding, prompt caching.

The pipeline

Four stages, one model call.

Tokenization

Input text is split into tokens (sub-word pieces). Models use either BPE (Byte-Pair Encoding — Llama, GPT, Mistral) or SentencePiece (Gemini, T5). A single token is ~3-4 characters of English on average. The tokenizer is part of the model contract — using the wrong tokenizer corrupts everything downstream.

Why it matters: Tokenization is the input-cost denominator. 1M tokens of input means whatever the model's tokenizer thinks 1M tokens means — and different models tokenize the same text differently.

Prefill

First inference pass: the input tokens are processed in parallel through the model. The model computes attention scores for every input token, produces hidden states, and writes them into the KV cache. Prefill is compute-bound (the GPU does parallel matrix math on N tokens at once). On a frontier-tier model, prefill is ~10-100ms per 1k tokens of input.

Why it matters: Prefill cost scales with input length. Long-context prompts cost more in prefill — and prefill cost is paid before the first token of output is generated.

Decode

Each subsequent token is generated one at a time. The model reads the KV cache, computes attention against it, samples the next token, and updates the cache. Decode is memory-bandwidth-bound (the GPU has to stream the KV cache through HBM on every step). Decode latency is the bottleneck on tokens-per-second throughput.

Why it matters: Decode is what users feel as 'how fast is the model.' A 100-token completion takes 100 decode steps. Per-step latency × 100 = the wait you see.

KV cache

Stored attention keys + values for every previous token in the current sequence. Memory cost grows linearly with sequence length, model size, and batch. A 70B model with 32k context = ~80GB+ of KV cache per sequence. KV cache management is the dominant memory pressure at serving time.

Why it matters: KV cache is why GPU memory pressure scales with context length. Long-context inference is more expensive than parameter-count alone implies.

The optimizations

Eight techniques modern inference depends on.

FlashAttention (Dao et al. 2022 + FA2 2023 + FA3 2024)

Reformulates the attention computation to avoid materializing the full attention matrix in slow GPU memory. Tile the computation, keep intermediates in fast SRAM. Massive speedup at long context. FA3 is the H100-optimized version. Effectively all 2024+ inference engines use FlashAttention.

Paged Attention (vLLM, 2023)

Treats the KV cache like virtual memory in an OS: pages are allocated dynamically across sequences instead of pre-allocated worst-case per request. Allows much higher batch utilization. Implemented in vLLM and propagated to most modern inference engines (TGI, TensorRT-LLM, llama.cpp).

Continuous batching

Process multiple requests in the same forward pass, with new requests joining the batch as old ones finish (instead of waiting for a uniform batch). Massively improves GPU utilization on multi-tenant inference servers. The base inference-serving pattern in 2024-2026.

Speculative decoding

Use a small draft model to generate several candidate tokens, then verify them in parallel with the large target model. If the target accepts, you got those tokens at draft-model latency. Standard practice on frontier inference systems. Tokens-per-second gains of 2-3× are typical.

Medusa + EAGLE (multi-head speculation)

Variants of speculative decoding where the speculation heads are trained into the same model (no separate draft model required). EAGLE-2/3 (2024) achieves ~3-5× speedup on some workloads. Active research area; production support is uneven.

Prefix caching

If many requests share a common prompt prefix (system prompt, conversation history, RAG corpus), cache the prefill work for that prefix once and reuse it across requests. The OpenAI Batch API and Anthropic prompt-caching feature both expose this. Massive cost reduction on multi-tenant + agentic workloads.

Grouped-Query Attention (GQA)

Reduces the number of attention heads that have separate K and V projections. Llama 2 70B used GQA-8 instead of MHA. Cuts KV cache size by 4-8× with minimal quality loss. Standard in 2023+ frontier models.

MQA → GQA → MLA progression

Multi-Query Attention (one KV head) was the aggressive original. Grouped-Query Attention is the compromise. Multi-head Latent Attention (MLA, DeepSeek 2024) compresses KV via low-rank factorization for further savings. Each step traded modest quality for major KV-cache savings.

Inference economics

Six facts about cost.

01
Input tokens cost much less than output tokens. Typical ratio is 4-10× (e.g., Claude 3.5 Sonnet is $3/M input vs $15/M output). Reason: prefill is parallel (fast); decode is sequential (slow + memory-bound).
02
Long input is cheap relative to short input plus long output. RAG over a 32k-token corpus is much cheaper than asking the model to generate 32k tokens of analysis.
03
Prompt caching cuts cost by 5-10× when used. Anthropic's prompt-caching feature charges 25% of input price for cached prefix tokens. OpenAI's similar feature does the same.
04
Speculative decoding doesn't change the price you pay (you pay for the same model) but does cut your latency. Different from cost-optimization.
05
Batched inference is cheaper per request when you can wait. OpenAI Batch API + Anthropic Batch API offer 50% price discounts for 24h-asynchronous workloads.
06
Inference compute can dominate training compute. A model trained once for $100M can spend $10M+ per year on inference at scale. This is why labs increasingly target inference-economically-optimal architectures (smaller models, more tokens, GQA, MQA, MLA).

Scaling laws (training side) →Quantization →← atlas index