built throughORANGEBOX·see what it ships·$1 →
Polished black branching Y-form sculpture against deep void black — the transformer family tree.

AtomEons / Learn / atlas / transformer-variants

Transformer variants: a field map

Every major architectural lineage since "Attention Is All You Need" — what changed, what it solved, what's still live

The transformer is not one thing. It is a family. The 2017 Vaswani et al paper "Attention Is All You Need" (arxiv.org/abs/1706.03762) introduced a specific encoder-decoder stack with scaled dot-product self-attention and learned position embeddings, built for machine translation. Almost nothing shipped at scale today is exactly that architecture. The lineage forked early, forked often, and the variants now diverge enough that grouping them as "transformers" hides more than it reveals. This page is a field map. For each major branch we name the architectural change, link the primary paper, say what it actually solved, name who shipped it commercially, and note whether it is still live in production as of June 2026 (best-effort, not exhaustive — provider docs are the source of truth for current model lineups). We avoid the framing that any one branch "won." Encoder-only models still dominate retrieval. Decoder-only models dominate chat. Encoder-decoder is still the default for summarization and seq2seq. State-space models are reaching production for very long context. Mixture-of-experts is now standard at frontier scale. Multimodal fusion is its own design space. Treat this as orientation, not endorsement. Architectures matter less than data, training compute, post-training, and inference stack — a fact the field has been slow to absorb. We mark where the architectural claim is genuine ("this changed what was possible") versus where it is mostly branding ("this is a slight reshuffle of an existing pattern"). When we don't know something certainly, we say so.

The three classical stacks

The original 2017 paper described an encoder-decoder transformer for translation. Within two years the field had separated this into three distinct stacks that are still the textbook taxonomy. Encoder-only models read a full sequence bidirectionally and emit per-token or pooled representations. They are what you use when the output is a class, a span, an embedding, or a score — not free-form text. BERT (Devlin et al, 2018, arxiv.org/abs/1810.04805) launched this branch by training on masked-token prediction plus next-sentence prediction, and is still the genealogical root of most embedding and reranker models in production today. RoBERTa (Liu et al, 2019, arxiv.org/abs/1907.11692) showed that BERT was significantly undertrained — same architecture, more data and longer training, much better numbers. DeBERTa (He et al, 2020, arxiv.org/abs/2006.03654) added disentangled attention (separate matrices for content and position) and an enhanced mask decoder; DeBERTa-V3 with replaced-token-detection pretraining is still a default strong baseline for classification and NLU benchmarks. Decoder-only models read left-to-right with causal masking and predict the next token. They are what you use when the output is text. The GPT lineage (Radford et al, 2018 onward — GPT-1 was the original decoder-only pretraining demonstration) became the dominant commercial stack after GPT-3 (Brown et al, 2020, arxiv.org/abs/2005.14165) showed that scale plus in-context learning ate most task-specific finetuning. Llama (Touvron et al, 2023, arxiv.org/abs/2302.13971) made a clean open-weights decoder-only stack with rotary position embeddings, SwiGLU activations, and RMSNorm available to the research community, and the Llama recipe became the de-facto template for almost every open model that followed — including Mistral 7B (Jiang et al, 2023, arxiv.org/abs/2310.06825) which added sliding-window attention and grouped-query attention. Encoder-decoder models keep both stacks — a bidirectional encoder that reads the input and a causal decoder that writes the output. T5 (Raffel et al, 2019, arxiv.org/abs/1910.10683) reframed every NLP task as text-to-text and trained a single encoder-decoder model across all of them, which is why T5 and its descendants are still the default for translation, summarization, and structured seq2seq. BART (Lewis et al, 2019, arxiv.org/abs/1910.13461) used a denoising autoencoder objective and remains a common starting point for summarization fine-tunes. The practical rule of thumb: if the output is a vector or label, reach for encoder-only. If the output is free text, reach for decoder-only. If the output is conditioned on a long structured input and you want a clean encoder representation, encoder-decoder is still a defensible choice.

Classical stacks — index

ModelBERT
StackEncoder-only
Year2018
Key changeMasked-LM + next-sentence prediction
Primary paperarxiv.org/abs/1810.04805
ModelRoBERTa
StackEncoder-only
Year2019
Key changeBERT trained longer, no NSP, more data
Primary paperarxiv.org/abs/1907.11692
ModelDeBERTa
StackEncoder-only
Year2020
Key changeDisentangled content/position attention
Primary paperarxiv.org/abs/2006.03654
ModelGPT-3
StackDecoder-only
Year2020
Key changeScale + few-shot in-context learning
Primary paperarxiv.org/abs/2005.14165
ModelLlama
StackDecoder-only
Year2023
Key changeOpen-weights RoPE + SwiGLU + RMSNorm baseline
Primary paperarxiv.org/abs/2302.13971
ModelMistral 7B
StackDecoder-only
Year2023
Key changeSliding-window + grouped-query attention
Primary paperarxiv.org/abs/2310.06825
ModelT5
StackEncoder-decoder
Year2019
Key changeAll NLP cast as text-to-text on one model
Primary paperarxiv.org/abs/1910.10683
ModelBART
StackEncoder-decoder
Year2019
Key changeDenoising autoencoder pretraining
Primary paperarxiv.org/abs/1910.13461

Efficient attention: the O(n²) problem and the responses to it

Self-attention as defined in the 2017 paper has quadratic time and memory in sequence length. For a 512-token sequence this is cheap. For a 100,000-token document it is not. The years 2019 to 2022 produced a wave of architectural variants whose goal was to break or reduce that quadratic. Longformer (Beltagy, Peters, Cohan, 2020, arxiv.org/abs/2004.05150) replaced full attention with sliding-window local attention plus a small number of designated global-attention tokens. Linear in sequence length, no other tricks. Shipped in the open-source Hugging Face ecosystem and used as the encoder backbone for retrieval and long-document classification through about 2023. Reformer (Kitaev, Kaiser, Levskaya, 2020, arxiv.org/abs/2001.04451) used locality-sensitive hashing (LSH) attention plus reversible residual layers to push memory down dramatically. Elegant but operationally tricky — the LSH bucketing introduces variance and the reversible-layers idea never became standard practice in production stacks. Linformer (Wang et al, 2020, arxiv.org/abs/2006.04768) projected the keys and values to a lower-rank approximation, making attention linear in sequence length under the assumption that the attention matrix is low-rank in practice. Influential as an idea, but the low-rank assumption breaks for tasks that need fine-grained retrieval across the full sequence. Hyena (Poli et al, 2023, arxiv.org/abs/2302.10866) is the bridge architecture between this generation and the state-space generation that followed — it replaced attention with implicitly-parametrized long convolutions and data-controlled gating. Subquadratic, competitive with attention at small-to-medium scale, and it directly influenced the design of Mamba. The meta-lesson from this generation is that most efficient-attention papers showed scaling-curve advantages on synthetic long-context benchmarks but did not displace standard attention in production. The reason is mundane — FlashAttention (Dao et al, 2022, arxiv.org/abs/2205.14135) made exact softmax attention IO-efficient enough on modern GPUs that the constant-factor advantage of approximate methods evaporated below ~16K tokens. Above ~16K tokens, the field largely jumped past efficient-attention transformers entirely and onto state-space models.

State-space and recurrent alternatives

These are not transformers in the strict 2017 sense — they replace self-attention with a different sequence operator. They earn a place on this map because they have begun reaching production for long-context and on-device workloads as of 2025-2026 (best-effort summary; check provider model cards for current production status).

Mamba

Gu & Dao, 2023 — arxiv.org/abs/2312.00752

Selective state-space model (S6). Linear-time inference, constant memory in sequence length, competitive perplexity with same-size transformers on language. The selectivity (input-dependent state-space parameters) is the key change versus earlier S4/S5 models. Strongest commercial relevance to date has been in long-context, on-device, and DNA/genomics workloads.

Mamba-2

Dao & Gu, 2024 — arxiv.org/abs/2405.21060

Structured state-space duality (SSD). Establishes a formal equivalence between a class of state-space models and masked attention with a 1-semiseparable mask, and uses that to make Mamba 2-8x faster on hardware. Reported crossover with FlashAttention-2 at ~2K tokens and ~6x faster at 16K. Influential as a unification result, not just an architecture.

Hyena

Poli et al, 2023 — arxiv.org/abs/2302.10866

Implicitly-parametrized long convolutions plus data-controlled gating, as a subquadratic drop-in for attention. Direct architectural ancestor of Mamba in spirit. Notable commercial deployment is in protein and DNA language modeling, where the long-context advantage compounds.

RWKV

Peng et al, 2023 — arxiv.org/abs/2305.13048

Linear-attention RNN-transformer hybrid that runs as a parallel transformer at training time and as an RNN at inference time. Now at the RWKV-7 generation (as of 2025, best-effort). Strong community and on-device focus; commercial enterprise deployment is more limited than Mamba.

RetNet

Sun et al, 2023 — arxiv.org/abs/2307.08621

Retention mechanism that aims for training parallelism, low-cost inference, and good language-modeling performance simultaneously. From Microsoft Research. Active research interest, modest commercial deployment as of mid-2026 (best-effort).

Mixture-of-experts: scale by sparsity instead of by density

The mixture-of-experts pattern is older than transformers — Shazeer et al published the sparsely-gated MoE layer in 2017 (arxiv.org/abs/1701.06538). What changed in the 2020s is that MoE moved from a research curiosity to the dominant pattern at frontier scale. Switch Transformer (Fedus, Zoph, Shazeer, 2021, arxiv.org/abs/2101.03961) simplified MoE routing to send each token to exactly one expert (top-1 routing), and showed that the resulting sparse models could be trained stably in bfloat16 with up to 7x pre-training speedup over a dense T5 baseline at matched compute. This made MoE practically usable. GLaM (Du et al, 2021, arxiv.org/abs/2112.06905) from Google scaled the pattern to a 1.2T-parameter MoE that activated about 8% of parameters per token, matching GPT-3 quality at roughly one-third the training energy. This established the central MoE economic argument — total parameters control capacity, active parameters control inference cost, and the gap between them is the lever. Mixtral 8x7B (Jiang et al, 2024, arxiv.org/abs/2401.04088) made the pattern open. Eight expert MLPs per layer with top-2 routing — 47B total parameters, ~13B active per token, 32K context. Shipped with open weights under Apache 2.0 by Mistral. Mixtral became the canonical reference MoE for the open-source community. DeepSeek-V3 (DeepSeek-AI, 2024, arxiv.org/abs/2412.19437) is the current open-weights frontier of the pattern as of late 2024 / early 2026. 671B total parameters, 37B active per token, trained on 14.8T tokens, with multi-head latent attention (MLA) and DeepSeekMoE routing as the two notable architectural refinements. Its release reset the conversation about how cheaply a frontier-quality model can be trained. The practical takeaway: at frontier scale, dense models are now the exception rather than the default. The cost economics of MoE are decisive once you can amortize the routing complexity, and the field's tooling (vLLM, TensorRT-LLM, SGLang) caught up to MoE serving in 2024.

MoE — index

ModelSwitch Transformer
Year2021
Total paramsUp to 1.6T (in paper)
Active paramsTop-1 expert
RoutingTop-1
Paperarxiv.org/abs/2101.03961
ModelGLaM
Year2021
Total params1.2T
Active params~97B (~8%)
RoutingTop-2
Paperarxiv.org/abs/2112.06905
ModelMixtral 8x7B
Year2024
Total params47B
Active params13B
RoutingTop-2 of 8 experts
Paperarxiv.org/abs/2401.04088
ModelDeepSeek-V3
Year2024
Total params671B
Active params37B
RoutingDeepSeekMoE + MLA
Paperarxiv.org/abs/2412.19437

Multimodal fusion: four ways to bolt vision onto a language model

By 2026 most production LLMs accept images, and many accept audio and video. The architectural design space for how the vision tokens get into the language model is narrower than most marketing copy implies. There are roughly four patterns in active use. Flamingo (Alayrac et al, DeepMind, 2022, arxiv.org/abs/2204.14198) inserts new gated cross-attention layers between frozen layers of a pretrained LLM. The vision encoder is also frozen. Only the new cross-attention layers and a Perceiver resampler are trained. This pattern is parameter-efficient and keeps the language quality of the base model intact. It is still the cleanest design for adding vision to a strong language backbone you do not want to disturb. BLIP-2 (Li et al, Salesforce, 2023, arxiv.org/abs/2301.12597) uses a Q-Former — a small transformer with learnable query tokens — to compress vision features into a fixed number of soft tokens that get prepended to the language model's input. Cheap to train, modular, and the Q-Former design influenced many subsequent VLMs. LLaVA (Liu et al, 2023, arxiv.org/abs/2304.08485) is the simplest pattern of all — a single MLP projection that maps CLIP visual features into the language model's embedding space, with visual tokens concatenated to text tokens as a flat input sequence. Trained on synthetic GPT-4-generated instruction data. The simplicity is the point; LLaVA-style projection became the dominant pattern for open-source VLMs. CogVLM (Wang et al, 2023, arxiv.org/abs/2311.03079) introduced visual expert modules — separate QKV matrices and FFN inside each transformer block, activated only when processing image tokens. The base language model is preserved, and visual processing happens through dedicated parallel weights. More expensive than LLaVA-style projection but stronger on visual grounding benchmarks. As of June 2026 (best-effort), commercial frontier multimodal models (the Claude, GPT, and Gemini families) do not publish their fusion architecture in full. Public behavior is consistent with deep-fusion approaches closer to CogVLM / Flamingo than to LLaVA-style projection, but treat this as inference, not fact — check provider docs for any architectural details a vendor confirms.

Multimodal — index

ModelFlamingo
Year2022
Fusion patternGated cross-attention between frozen LLM layers + Perceiver resampler
Ships fromDeepMind (research)
Primary paperarxiv.org/abs/2204.14198
ModelBLIP-2
Year2023
Fusion patternQ-Former bridges frozen vision encoder and frozen LLM
Ships fromSalesforce
Primary paperarxiv.org/abs/2301.12597
ModelLLaVA
Year2023
Fusion patternSingle MLP projection of CLIP features, concatenated to text
Ships fromOpen-source academic
Primary paperarxiv.org/abs/2304.08485
ModelCogVLM
Year2023
Fusion patternVisual expert QKV + FFN inside each transformer block
Ships fromTsinghua / Zhipu AI
Primary paperarxiv.org/abs/2311.03079

The arc of the field, condensed

  1. 2017

    Attention Is All You Need

    Vaswani et al introduce the encoder-decoder transformer for machine translation. The architecture is published as a Google Research paper at NeurIPS 2017.

  2. 2018

    BERT

    Devlin et al strip out the decoder and pretrain a bidirectional encoder with masked-language-modeling. NLU benchmarks fall over.

  3. 2019

    T5 and GPT-2

    Google reframes all NLP as text-to-text with the T5 encoder-decoder. OpenAI scales the decoder-only GPT stack. The decoder-only path begins to pull ahead for generation.

  4. 2020

    GPT-3 and the efficient-attention wave

    GPT-3 establishes that scale plus in-context learning eats most fine-tuning. Longformer, Reformer, Linformer try to break the O(n²) wall.

  5. 2021

    Switch Transformer makes MoE practical

    Fedus, Zoph, Shazeer simplify routing to top-1 and stabilize training in bfloat16. Sparse models become a credible engineering target.

  6. 2022

    FlashAttention and Flamingo

    FlashAttention makes exact softmax attention IO-efficient, neutralizing most efficient-attention approximations below ~16K tokens. Flamingo establishes the cross-attention-into-frozen-LLM pattern for vision-language.

  7. 2023

    Llama, Mistral, and Mamba

    Llama and Mistral give the open community a clean decoder-only template (RoPE, SwiGLU, RMSNorm, GQA, sliding-window). Mamba arrives as the first genuinely competitive state-space alternative to attention.

  8. 2024

    Mixtral, Mamba-2, DeepSeek-V3

    Mixtral opens up the MoE pattern. Mamba-2 unifies state-space and attention through structured state-space duality. DeepSeek-V3 ships a 671B / 37B-active open-weights MoE that resets cost expectations at the frontier.

  9. 2025-2026

    Convergence onto MoE plus long context

    Frontier commercial models settle into the pattern of large MoE decoders with hardware-friendly attention and increasingly hybrid (attention + state-space) blocks for very long context. Architecture stops being the headline variable; data, post-training, and inference stack take over.

What this map does not say

The architecture is rarely the bottleneck. Most measurable quality differences between frontier models in 2026 come from training data composition, post-training (RLHF, DPO, GRPO, constitutional methods), inference-time compute, and tool-use scaffolding — not from the transformer variant under the hood. Two models with identical architecture can differ by 20+ percentage points on benchmarks depending on what they were trained on and how they were tuned. Two models with very different architectures (a dense Llama-3-style decoder, a Mixtral-style MoE, and a hybrid Mamba-attention stack at the same active-parameter budget) can be remarkably close in user-visible quality if their data and post-training pipelines are comparable. If you are picking a model to ship a product on, the architecture is approximately the least important variable. Pick on benchmarks for your task, latency at your batch size, cost at your token volume, and the provider's track record on stability and policy. Pick the architecture later — or, more honestly, never; let the provider pick it for you.

How to use this page

If you are orienting yourself in the transformer literature, here is the minimum-effective-dose reading order. Skip anything you already know. Stop the moment you have enough.

  • Read the 2017 paper once. Vaswani et al, arxiv.org/abs/1706.03762. Everything else is a delta against this.
  • Read BERT (arxiv.org/abs/1810.04805) and GPT-3 (arxiv.org/abs/2005.14165) for the encoder-only and decoder-only forks.
  • Read FlashAttention (arxiv.org/abs/2205.14135) — not an architecture, but the reason most efficient-attention papers stopped mattering in practice.
  • Read Llama (arxiv.org/abs/2302.13971) for the de-facto modern decoder template.
  • Read Mixtral (arxiv.org/abs/2401.04088) for the canonical small-open MoE.
  • Read Mamba and Mamba-2 (arxiv.org/abs/2312.00752, arxiv.org/abs/2405.21060) for the strongest non-transformer sequence model line as of 2026.
  • Read DeepSeek-V3 (arxiv.org/abs/2412.19437) for what a current frontier-scale MoE actually looks like in detail.
  • Stop. The rest is gradient. Spend the time saved on data and evals instead.

Sources

  1. [01]

    Vaswani et al, Attention Is All You Need, 2017 — the original encoder-decoder transformer for machine translation.

    arxiv.org/abs/1706.03762

  2. [02]

    Devlin et al, BERT, 2018 — bidirectional encoder pretraining via masked language modeling and next-sentence prediction.

    arxiv.org/abs/1810.04805

  3. [03]

    Liu et al, RoBERTa, 2019 — BERT trained longer with more data and without next-sentence prediction substantially improves benchmarks.

    arxiv.org/abs/1907.11692

  4. [04]

    He et al, DeBERTa, 2020 — disentangled attention with separate content and position matrices, plus enhanced mask decoder.

    arxiv.org/abs/2006.03654

  5. [05]

    Brown et al, GPT-3, 2020 — scale and in-context few-shot learning replace most task-specific fine-tuning.

    arxiv.org/abs/2005.14165

  6. [06]

    Touvron et al, Llama, 2023 — open-weights decoder-only transformer with RoPE, SwiGLU, and RMSNorm that became the field template.

    arxiv.org/abs/2302.13971

  7. [07]

    Jiang et al, Mistral 7B, 2023 — sliding-window attention and grouped-query attention on a Llama-derived decoder.

    arxiv.org/abs/2310.06825

  8. [08]

    Raffel et al, T5, 2019 — text-to-text reframing of all NLP tasks on a single encoder-decoder model.

    arxiv.org/abs/1910.10683

  9. [09]

    Lewis et al, BART, 2019 — denoising autoencoder encoder-decoder for generation and summarization.

    arxiv.org/abs/1910.13461

  10. [10]

    Beltagy, Peters, Cohan, Longformer, 2020 — sliding-window plus global attention for long documents at linear cost.

    arxiv.org/abs/2004.05150

  11. [11]

    Kitaev, Kaiser, Levskaya, Reformer, 2020 — LSH-based attention plus reversible residual layers for memory efficiency.

    arxiv.org/abs/2001.04451

  12. [12]

    Wang et al, Linformer, 2020 — low-rank projection of keys and values for linear-complexity self-attention.

    arxiv.org/abs/2006.04768

  13. [13]

    Poli et al, Hyena Hierarchy, 2023 — long convolutions plus data-controlled gating as a subquadratic drop-in for attention.

    arxiv.org/abs/2302.10866

  14. [14]

    Dao et al, FlashAttention, 2022 — exact IO-efficient attention that neutralized most efficient-attention approximations in production.

    arxiv.org/abs/2205.14135

  15. [15]

    Gu and Dao, Mamba, 2023 — selective state-space model (S6) with input-dependent parameters competitive with transformers on language modeling.

    arxiv.org/abs/2312.00752

  16. [16]

    Dao and Gu, Mamba-2, 2024 — structured state-space duality with 2-8x speedup over Mamba; crossover with FlashAttention-2 around 2K tokens.

    arxiv.org/abs/2405.21060

  17. [17]

    Peng et al, RWKV, 2023 — linear-attention RNN-transformer hybrid that trains in parallel and infers as an RNN.

    arxiv.org/abs/2305.13048

  18. [18]

    Sun et al, RetNet (Retentive Network), 2023 — retention mechanism designed for parallel training and recurrent inference.

    arxiv.org/abs/2307.08621

  19. [19]

    Fedus, Zoph, Shazeer, Switch Transformer, 2021 — top-1 expert routing with stable bfloat16 training and up to 7x pre-training speedup.

    arxiv.org/abs/2101.03961

  20. [20]

    Du et al, GLaM, 2021 — 1.2T-parameter MoE matching GPT-3 quality at substantially lower training energy.

    arxiv.org/abs/2112.06905

  21. [21]

    Jiang et al, Mixtral of Experts, 2024 — 47B-total / 13B-active Apache-2.0 MoE with top-2 routing across 8 experts per layer, 32K context.

    arxiv.org/abs/2401.04088

  22. [22]

    DeepSeek-AI, DeepSeek-V3 Technical Report, December 2024 — 671B total / 37B active MoE with multi-head latent attention (MLA), trained on 14.8T tokens.

    arxiv.org/abs/2412.19437

  23. [23]

    Alayrac et al, Flamingo, 2022 — gated cross-attention layers inserted between frozen LLM layers plus a Perceiver resampler for vision-language fusion.

    arxiv.org/abs/2204.14198

  24. [24]

    Li et al, BLIP-2, 2023 — Q-Former with learnable query tokens bridges a frozen vision encoder to a frozen LLM.

    arxiv.org/abs/2301.12597

  25. [25]

    Liu et al, LLaVA, 2023 — single MLP projection of CLIP features into the LLM embedding space; trained on GPT-4-synthesized instruction data.

    arxiv.org/abs/2304.08485

LAB · ATOMEONS · MARCO ISLAND FLÆONS RESEARCH · 12 PAPERS · CC-BY 4.0ORANGEBOX v1.0.0-beta · TURBO-OPTIMIZE CLAUDE · SHIPPED 2026-05-30B00KMAKR v3.2.0 · AI PUBLISHING COCKPIT · MAC + WINDOWSFREE LAUNCH WEEK · ENDS JUNE 6 · §4A NO-SAAS LOCKFOUNDER'S VIEW · NEXT BROADCAST IN ...CITE THE WORK · FORWARD THE LINK · NO ALGORITHMLAB · ATOMEONS · MARCO ISLAND FLÆONS RESEARCH · 12 PAPERS · CC-BY 4.0ORANGEBOX v1.0.0-beta · TURBO-OPTIMIZE CLAUDE · SHIPPED 2026-05-30B00KMAKR v3.2.0 · AI PUBLISHING COCKPIT · MAC + WINDOWSFREE LAUNCH WEEK · ENDS JUNE 6 · §4A NO-SAAS LOCKFOUNDER'S VIEW · NEXT BROADCAST IN ...CITE THE WORK · FORWARD THE LINK · NO ALGORITHM