
Transformer variants: a field map
Every major architectural lineage since "Attention Is All You Need" — what changed, what it solved, what's still live
The three classical stacks
Classical stacks — index
Efficient attention: the O(n²) problem and the responses to it
State-space and recurrent alternatives
These are not transformers in the strict 2017 sense — they replace self-attention with a different sequence operator. They earn a place on this map because they have begun reaching production for long-context and on-device workloads as of 2025-2026 (best-effort summary; check provider model cards for current production status).
Mamba
Gu & Dao, 2023 — arxiv.org/abs/2312.00752
Selective state-space model (S6). Linear-time inference, constant memory in sequence length, competitive perplexity with same-size transformers on language. The selectivity (input-dependent state-space parameters) is the key change versus earlier S4/S5 models. Strongest commercial relevance to date has been in long-context, on-device, and DNA/genomics workloads.
Mamba-2
Dao & Gu, 2024 — arxiv.org/abs/2405.21060
Structured state-space duality (SSD). Establishes a formal equivalence between a class of state-space models and masked attention with a 1-semiseparable mask, and uses that to make Mamba 2-8x faster on hardware. Reported crossover with FlashAttention-2 at ~2K tokens and ~6x faster at 16K. Influential as a unification result, not just an architecture.
Hyena
Poli et al, 2023 — arxiv.org/abs/2302.10866
Implicitly-parametrized long convolutions plus data-controlled gating, as a subquadratic drop-in for attention. Direct architectural ancestor of Mamba in spirit. Notable commercial deployment is in protein and DNA language modeling, where the long-context advantage compounds.
RWKV
Peng et al, 2023 — arxiv.org/abs/2305.13048
Linear-attention RNN-transformer hybrid that runs as a parallel transformer at training time and as an RNN at inference time. Now at the RWKV-7 generation (as of 2025, best-effort). Strong community and on-device focus; commercial enterprise deployment is more limited than Mamba.
RetNet
Sun et al, 2023 — arxiv.org/abs/2307.08621
Retention mechanism that aims for training parallelism, low-cost inference, and good language-modeling performance simultaneously. From Microsoft Research. Active research interest, modest commercial deployment as of mid-2026 (best-effort).
Mixture-of-experts: scale by sparsity instead of by density
MoE — index
Multimodal fusion: four ways to bolt vision onto a language model
Multimodal — index
The arc of the field, condensed
2017
Attention Is All You Need
Vaswani et al introduce the encoder-decoder transformer for machine translation. The architecture is published as a Google Research paper at NeurIPS 2017.
2018
BERT
Devlin et al strip out the decoder and pretrain a bidirectional encoder with masked-language-modeling. NLU benchmarks fall over.
2019
T5 and GPT-2
Google reframes all NLP as text-to-text with the T5 encoder-decoder. OpenAI scales the decoder-only GPT stack. The decoder-only path begins to pull ahead for generation.
2020
GPT-3 and the efficient-attention wave
GPT-3 establishes that scale plus in-context learning eats most fine-tuning. Longformer, Reformer, Linformer try to break the O(n²) wall.
2021
Switch Transformer makes MoE practical
Fedus, Zoph, Shazeer simplify routing to top-1 and stabilize training in bfloat16. Sparse models become a credible engineering target.
2022
FlashAttention and Flamingo
FlashAttention makes exact softmax attention IO-efficient, neutralizing most efficient-attention approximations below ~16K tokens. Flamingo establishes the cross-attention-into-frozen-LLM pattern for vision-language.
2023
Llama, Mistral, and Mamba
Llama and Mistral give the open community a clean decoder-only template (RoPE, SwiGLU, RMSNorm, GQA, sliding-window). Mamba arrives as the first genuinely competitive state-space alternative to attention.
2024
Mixtral, Mamba-2, DeepSeek-V3
Mixtral opens up the MoE pattern. Mamba-2 unifies state-space and attention through structured state-space duality. DeepSeek-V3 ships a 671B / 37B-active open-weights MoE that resets cost expectations at the frontier.
2025-2026
Convergence onto MoE plus long context
Frontier commercial models settle into the pattern of large MoE decoders with hardware-friendly attention and increasingly hybrid (attention + state-space) blocks for very long context. Architecture stops being the headline variable; data, post-training, and inference stack take over.
What this map does not say
The architecture is rarely the bottleneck. Most measurable quality differences between frontier models in 2026 come from training data composition, post-training (RLHF, DPO, GRPO, constitutional methods), inference-time compute, and tool-use scaffolding — not from the transformer variant under the hood. Two models with identical architecture can differ by 20+ percentage points on benchmarks depending on what they were trained on and how they were tuned. Two models with very different architectures (a dense Llama-3-style decoder, a Mixtral-style MoE, and a hybrid Mamba-attention stack at the same active-parameter budget) can be remarkably close in user-visible quality if their data and post-training pipelines are comparable. If you are picking a model to ship a product on, the architecture is approximately the least important variable. Pick on benchmarks for your task, latency at your batch size, cost at your token volume, and the provider's track record on stability and policy. Pick the architecture later — or, more honestly, never; let the provider pick it for you.
How to use this page
If you are orienting yourself in the transformer literature, here is the minimum-effective-dose reading order. Skip anything you already know. Stop the moment you have enough.
- Read the 2017 paper once. Vaswani et al, arxiv.org/abs/1706.03762. Everything else is a delta against this.
- Read BERT (arxiv.org/abs/1810.04805) and GPT-3 (arxiv.org/abs/2005.14165) for the encoder-only and decoder-only forks.
- Read FlashAttention (arxiv.org/abs/2205.14135) — not an architecture, but the reason most efficient-attention papers stopped mattering in practice.
- Read Llama (arxiv.org/abs/2302.13971) for the de-facto modern decoder template.
- Read Mixtral (arxiv.org/abs/2401.04088) for the canonical small-open MoE.
- Read Mamba and Mamba-2 (arxiv.org/abs/2312.00752, arxiv.org/abs/2405.21060) for the strongest non-transformer sequence model line as of 2026.
- Read DeepSeek-V3 (arxiv.org/abs/2412.19437) for what a current frontier-scale MoE actually looks like in detail.
- Stop. The rest is gradient. Spend the time saved on data and evals instead.
Sources
- [01]
Vaswani et al, Attention Is All You Need, 2017 — the original encoder-decoder transformer for machine translation.
arxiv.org/abs/1706.03762
- [02]
Devlin et al, BERT, 2018 — bidirectional encoder pretraining via masked language modeling and next-sentence prediction.
arxiv.org/abs/1810.04805
- [03]
Liu et al, RoBERTa, 2019 — BERT trained longer with more data and without next-sentence prediction substantially improves benchmarks.
arxiv.org/abs/1907.11692
- [04]
He et al, DeBERTa, 2020 — disentangled attention with separate content and position matrices, plus enhanced mask decoder.
arxiv.org/abs/2006.03654
- [05]
Brown et al, GPT-3, 2020 — scale and in-context few-shot learning replace most task-specific fine-tuning.
arxiv.org/abs/2005.14165
- [06]
Touvron et al, Llama, 2023 — open-weights decoder-only transformer with RoPE, SwiGLU, and RMSNorm that became the field template.
arxiv.org/abs/2302.13971
- [07]
Jiang et al, Mistral 7B, 2023 — sliding-window attention and grouped-query attention on a Llama-derived decoder.
arxiv.org/abs/2310.06825
- [08]
Raffel et al, T5, 2019 — text-to-text reframing of all NLP tasks on a single encoder-decoder model.
arxiv.org/abs/1910.10683
- [09]
Lewis et al, BART, 2019 — denoising autoencoder encoder-decoder for generation and summarization.
arxiv.org/abs/1910.13461
- [10]
Beltagy, Peters, Cohan, Longformer, 2020 — sliding-window plus global attention for long documents at linear cost.
arxiv.org/abs/2004.05150
- [11]
Kitaev, Kaiser, Levskaya, Reformer, 2020 — LSH-based attention plus reversible residual layers for memory efficiency.
arxiv.org/abs/2001.04451
- [12]
Wang et al, Linformer, 2020 — low-rank projection of keys and values for linear-complexity self-attention.
arxiv.org/abs/2006.04768
- [13]
Poli et al, Hyena Hierarchy, 2023 — long convolutions plus data-controlled gating as a subquadratic drop-in for attention.
arxiv.org/abs/2302.10866
- [14]
Dao et al, FlashAttention, 2022 — exact IO-efficient attention that neutralized most efficient-attention approximations in production.
arxiv.org/abs/2205.14135
- [15]
Gu and Dao, Mamba, 2023 — selective state-space model (S6) with input-dependent parameters competitive with transformers on language modeling.
arxiv.org/abs/2312.00752
- [16]
Dao and Gu, Mamba-2, 2024 — structured state-space duality with 2-8x speedup over Mamba; crossover with FlashAttention-2 around 2K tokens.
arxiv.org/abs/2405.21060
- [17]
Peng et al, RWKV, 2023 — linear-attention RNN-transformer hybrid that trains in parallel and infers as an RNN.
arxiv.org/abs/2305.13048
- [18]
Sun et al, RetNet (Retentive Network), 2023 — retention mechanism designed for parallel training and recurrent inference.
arxiv.org/abs/2307.08621
- [19]
Fedus, Zoph, Shazeer, Switch Transformer, 2021 — top-1 expert routing with stable bfloat16 training and up to 7x pre-training speedup.
arxiv.org/abs/2101.03961
- [20]
Du et al, GLaM, 2021 — 1.2T-parameter MoE matching GPT-3 quality at substantially lower training energy.
arxiv.org/abs/2112.06905
- [21]
Jiang et al, Mixtral of Experts, 2024 — 47B-total / 13B-active Apache-2.0 MoE with top-2 routing across 8 experts per layer, 32K context.
arxiv.org/abs/2401.04088
- [22]
DeepSeek-AI, DeepSeek-V3 Technical Report, December 2024 — 671B total / 37B active MoE with multi-head latent attention (MLA), trained on 14.8T tokens.
arxiv.org/abs/2412.19437
- [23]
Alayrac et al, Flamingo, 2022 — gated cross-attention layers inserted between frozen LLM layers plus a Perceiver resampler for vision-language fusion.
arxiv.org/abs/2204.14198
- [24]
Li et al, BLIP-2, 2023 — Q-Former with learnable query tokens bridges a frozen vision encoder to a frozen LLM.
arxiv.org/abs/2301.12597
- [25]
Liu et al, LLaVA, 2023 — single MLP projection of CLIP features into the LLM embedding space; trained on GPT-4-synthesized instruction data.
arxiv.org/abs/2304.08485