FlashAttention (Dao et al. 2022 + FA2 2023 + FA3 2024)
Reformulates the attention computation to avoid materializing the full attention matrix in slow GPU memory. Tile the computation, keep intermediates in fast SRAM. Massive speedup at long context. FA3 is the H100-optimized version. Effectively all 2024+ inference engines use FlashAttention.
Paged Attention (vLLM, 2023)
Treats the KV cache like virtual memory in an OS: pages are allocated dynamically across sequences instead of pre-allocated worst-case per request. Allows much higher batch utilization. Implemented in vLLM and propagated to most modern inference engines (TGI, TensorRT-LLM, llama.cpp).
Continuous batching
Process multiple requests in the same forward pass, with new requests joining the batch as old ones finish (instead of waiting for a uniform batch). Massively improves GPU utilization on multi-tenant inference servers. The base inference-serving pattern in 2024-2026.
Speculative decoding
Use a small draft model to generate several candidate tokens, then verify them in parallel with the large target model. If the target accepts, you got those tokens at draft-model latency. Standard practice on frontier inference systems. Tokens-per-second gains of 2-3× are typical.
Medusa + EAGLE (multi-head speculation)
Variants of speculative decoding where the speculation heads are trained into the same model (no separate draft model required). EAGLE-2/3 (2024) achieves ~3-5× speedup on some workloads. Active research area; production support is uneven.
Prefix caching
If many requests share a common prompt prefix (system prompt, conversation history, RAG corpus), cache the prefill work for that prefix once and reuse it across requests. The OpenAI Batch API and Anthropic prompt-caching feature both expose this. Massive cost reduction on multi-tenant + agentic workloads.
Grouped-Query Attention (GQA)
Reduces the number of attention heads that have separate K and V projections. Llama 2 70B used GQA-8 instead of MHA. Cuts KV cache size by 4-8× with minimal quality loss. Standard in 2023+ frontier models.
MQA → GQA → MLA progression
Multi-Query Attention (one KV head) was the aggressive original. Grouped-Query Attention is the compromise. Multi-head Latent Attention (MLA, DeepSeek 2024) compresses KV via low-rank factorization for further savings. Each step traded modest quality for major KV-cache savings.