Polished black branching Y-form sculpture against deep void black — the transformer family tree.

AtomEons / Learn / atlas / transformer-variants

Transformer variants: a field map

Every major architectural lineage since "Attention Is All You Need" — what changed, what it solved, what's still live

The transformer is not one thing. It is a family. The 2017 Vaswani et al paper "Attention Is All You Need" (arxiv.org/abs/1706.03762) introduced a specific encoder-decoder stack with scaled dot-product self-attention and learned position embeddings, built for machine translation. Almost nothing shipped at scale today is exactly that architecture. The lineage forked early, forked often, and the variants now diverge enough that grouping them as "transformers" hides more than it reveals. This page is a field map. For each major branch we name the architectural change, link the primary paper, say what it actually solved, name who shipped it commercially, and note whether it is still live in production as of June 2026 (best-effort, not exhaustive — provider docs are the source of truth for current model lineups). We avoid the framing that any one branch "won." Encoder-only models still dominate retrieval. Decoder-only models dominate chat. Encoder-decoder is still the default for summarization and seq2seq. State-space models are reaching production for very long context. Mixture-of-experts is now standard at frontier scale. Multimodal fusion is its own design space. Treat this as orientation, not endorsement. Architectures matter less than data, training compute, post-training, and inference stack — a fact the field has been slow to absorb. We mark where the architectural claim is genuine ("this changed what was possible") versus where it is mostly branding ("this is a slight reshuffle of an existing pattern"). When we don't know something certainly, we say so.

The three classical stacks

The original 2017 paper described an encoder-decoder transformer for translation. Within two years the field had separated this into three distinct stacks that are still the textbook taxonomy. Encoder-only models read a full sequence bidirectionally and emit per-token or pooled representations. They are what you use when the output is a class, a span, an embedding, or a score — not free-form text. BERT (Devlin et al, 2018, arxiv.org/abs/1810.04805) launched this branch by training on masked-token prediction plus next-sentence prediction, and is still the genealogical root of most embedding and reranker models in production today. RoBERTa (Liu et al, 2019, arxiv.org/abs/1907.11692) showed that BERT was significantly undertrained — same architecture, more data and longer training, much better numbers. DeBERTa (He et al, 2020, arxiv.org/abs/2006.03654) added disentangled attention (separate matrices for content and position) and an enhanced mask decoder; DeBERTa-V3 with replaced-token-detection pretraining is still a default strong baseline for classification and NLU benchmarks. Decoder-only models read left-to-right with causal masking and predict the next token. They are what you use when the output is text. The GPT lineage (Radford et al, 2018 onward — GPT-1 was the original decoder-only pretraining demonstration) became the dominant commercial stack after GPT-3 (Brown et al, 2020, arxiv.org/abs/2005.14165) showed that scale plus in-context learning ate most task-specific finetuning. Llama (Touvron et al, 2023, arxiv.org/abs/2302.13971) made a clean open-weights decoder-only stack with rotary position embeddings, SwiGLU activations, and RMSNorm available to the research community, and the Llama recipe became the de-facto template for almost every open model that followed — including Mistral 7B (Jiang et al, 2023, arxiv.org/abs/2310.06825) which added sliding-window attention and grouped-query attention. Encoder-decoder models keep both stacks — a bidirectional encoder that reads the input and a causal decoder that writes the output. T5 (Raffel et al, 2019, arxiv.org/abs/1910.10683) reframed every NLP task as text-to-text and trained a single encoder-decoder model across all of them, which is why T5 and its descendants are still the default for translation, summarization, and structured seq2seq. BART (Lewis et al, 2019, arxiv.org/abs/1910.13461) used a denoising autoencoder objective and remains a common starting point for summarization fine-tunes. The practical rule of thumb: if the output is a vector or label, reach for encoder-only. If the output is free text, reach for decoder-only. If the output is conditioned on a long structured input and you want a clean encoder representation, encoder-decoder is still a defensible choice.

Classical stacks — index

Model	Stack	Year	Key change	Primary paper
BERT	Encoder-only	2018	Masked-LM + next-sentence prediction	arxiv.org/abs/1810.04805
RoBERTa	Encoder-only	2019	BERT trained longer, no NSP, more data	arxiv.org/abs/1907.11692
DeBERTa	Encoder-only	2020	Disentangled content/position attention	arxiv.org/abs/2006.03654
GPT-3	Decoder-only	2020	Scale + few-shot in-context learning	arxiv.org/abs/2005.14165
Llama	Decoder-only	2023	Open-weights RoPE + SwiGLU + RMSNorm baseline	arxiv.org/abs/2302.13971
Mistral 7B	Decoder-only	2023	Sliding-window + grouped-query attention	arxiv.org/abs/2310.06825
T5	Encoder-decoder	2019	All NLP cast as text-to-text on one model	arxiv.org/abs/1910.10683
BART	Encoder-decoder	2019	Denoising autoencoder pretraining	arxiv.org/abs/1910.13461

ModelBERT

StackEncoder-only

Year2018

Key changeMasked-LM + next-sentence prediction

Primary paperarxiv.org/abs/1810.04805

ModelRoBERTa

StackEncoder-only

Year2019

Key changeBERT trained longer, no NSP, more data

Primary paperarxiv.org/abs/1907.11692

ModelDeBERTa

StackEncoder-only

Year2020

Key changeDisentangled content/position attention

Primary paperarxiv.org/abs/2006.03654

ModelGPT-3

StackDecoder-only

Year2020

Key changeScale + few-shot in-context learning

Primary paperarxiv.org/abs/2005.14165

ModelLlama

StackDecoder-only

Year2023

Key changeOpen-weights RoPE + SwiGLU + RMSNorm baseline

Primary paperarxiv.org/abs/2302.13971

ModelMistral 7B

StackDecoder-only

Year2023

Key changeSliding-window + grouped-query attention

Primary paperarxiv.org/abs/2310.06825

ModelT5

StackEncoder-decoder

Year2019

Key changeAll NLP cast as text-to-text on one model

Primary paperarxiv.org/abs/1910.10683

ModelBART

StackEncoder-decoder

Year2019

Key changeDenoising autoencoder pretraining

Primary paperarxiv.org/abs/1910.13461

Efficient attention: the O(n²) problem and the responses to it

Self-attention as defined in the 2017 paper has quadratic time and memory in sequence length. For a 512-token sequence this is cheap. For a 100,000-token document it is not. The years 2019 to 2022 produced a wave of architectural variants whose goal was to break or reduce that quadratic. Longformer (Beltagy, Peters, Cohan, 2020, arxiv.org/abs/2004.05150) replaced full attention with sliding-window local attention plus a small number of designated global-attention tokens. Linear in sequence length, no other tricks. Shipped in the open-source Hugging Face ecosystem and used as the encoder backbone for retrieval and long-document classification through about 2023. Reformer (Kitaev, Kaiser, Levskaya, 2020, arxiv.org/abs/2001.04451) used locality-sensitive hashing (LSH) attention plus reversible residual layers to push memory down dramatically. Elegant but operationally tricky — the LSH bucketing introduces variance and the reversible-layers idea never became standard practice in production stacks. Linformer (Wang et al, 2020, arxiv.org/abs/2006.04768) projected the keys and values to a lower-rank approximation, making attention linear in sequence length under the assumption that the attention matrix is low-rank in practice. Influential as an idea, but the low-rank assumption breaks for tasks that need fine-grained retrieval across the full sequence. Hyena (Poli et al, 2023, arxiv.org/abs/2302.10866) is the bridge architecture between this generation and the state-space generation that followed — it replaced attention with implicitly-parametrized long convolutions and data-controlled gating. Subquadratic, competitive with attention at small-to-medium scale, and it directly influenced the design of Mamba. The meta-lesson from this generation is that most efficient-attention papers showed scaling-curve advantages on synthetic long-context benchmarks but did not displace standard attention in production. The reason is mundane — FlashAttention (Dao et al, 2022, arxiv.org/abs/2205.14135) made exact softmax attention IO-efficient enough on modern GPUs that the constant-factor advantage of approximate methods evaporated below ~16K tokens. Above ~16K tokens, the field largely jumped past efficient-attention transformers entirely and onto state-space models.

State-space and recurrent alternatives

These are not transformers in the strict 2017 sense — they replace self-attention with a different sequence operator. They earn a place on this map because they have begun reaching production for long-context and on-device workloads as of 2025-2026 (best-effort summary; check provider model cards for current production status).

Mamba

Gu & Dao, 2023 — arxiv.org/abs/2312.00752

Selective state-space model (S6). Linear-time inference, constant memory in sequence length, competitive perplexity with same-size transformers on language. The selectivity (input-dependent state-space parameters) is the key change versus earlier S4/S5 models. Strongest commercial relevance to date has been in long-context, on-device, and DNA/genomics workloads.

Mamba-2

Dao & Gu, 2024 — arxiv.org/abs/2405.21060

Structured state-space duality (SSD). Establishes a formal equivalence between a class of state-space models and masked attention with a 1-semiseparable mask, and uses that to make Mamba 2-8x faster on hardware. Reported crossover with FlashAttention-2 at ~2K tokens and ~6x faster at 16K. Influential as a unification result, not just an architecture.

Hyena

Poli et al, 2023 — arxiv.org/abs/2302.10866

Implicitly-parametrized long convolutions plus data-controlled gating, as a subquadratic drop-in for attention. Direct architectural ancestor of Mamba in spirit. Notable commercial deployment is in protein and DNA language modeling, where the long-context advantage compounds.

RWKV

Peng et al, 2023 — arxiv.org/abs/2305.13048

Linear-attention RNN-transformer hybrid that runs as a parallel transformer at training time and as an RNN at inference time. Now at the RWKV-7 generation (as of 2025, best-effort). Strong community and on-device focus; commercial enterprise deployment is more limited than Mamba.

RetNet

Sun et al, 2023 — arxiv.org/abs/2307.08621

Retention mechanism that aims for training parallelism, low-cost inference, and good language-modeling performance simultaneously. From Microsoft Research. Active research interest, modest commercial deployment as of mid-2026 (best-effort).

Mixture-of-experts: scale by sparsity instead of by density

The mixture-of-experts pattern is older than transformers — Shazeer et al published the sparsely-gated MoE layer in 2017 (arxiv.org/abs/1701.06538). What changed in the 2020s is that MoE moved from a research curiosity to the dominant pattern at frontier scale. Switch Transformer (Fedus, Zoph, Shazeer, 2021, arxiv.org/abs/2101.03961) simplified MoE routing to send each token to exactly one expert (top-1 routing), and showed that the resulting sparse models could be trained stably in bfloat16 with up to 7x pre-training speedup over a dense T5 baseline at matched compute. This made MoE practically usable. GLaM (Du et al, 2021, arxiv.org/abs/2112.06905) from Google scaled the pattern to a 1.2T-parameter MoE that activated about 8% of parameters per token, matching GPT-3 quality at roughly one-third the training energy. This established the central MoE economic argument — total parameters control capacity, active parameters control inference cost, and the gap between them is the lever. Mixtral 8x7B (Jiang et al, 2024, arxiv.org/abs/2401.04088) made the pattern open. Eight expert MLPs per layer with top-2 routing — 47B total parameters, ~13B active per token, 32K context. Shipped with open weights under Apache 2.0 by Mistral. Mixtral became the canonical reference MoE for the open-source community. DeepSeek-V3 (DeepSeek-AI, 2024, arxiv.org/abs/2412.19437) is the current open-weights frontier of the pattern as of late 2024 / early 2026. 671B total parameters, 37B active per token, trained on 14.8T tokens, with multi-head latent attention (MLA) and DeepSeekMoE routing as the two notable architectural refinements. Its release reset the conversation about how cheaply a frontier-quality model can be trained. The practical takeaway: at frontier scale, dense models are now the exception rather than the default. The cost economics of MoE are decisive once you can amortize the routing complexity, and the field's tooling (vLLM, TensorRT-LLM, SGLang) caught up to MoE serving in 2024.

MoE — index

Model	Year	Total params	Active params	Routing	Paper
Switch Transformer	2021	Up to 1.6T (in paper)	Top-1 expert	Top-1	arxiv.org/abs/2101.03961
GLaM	2021	1.2T	~97B (~8%)	Top-2	arxiv.org/abs/2112.06905
Mixtral 8x7B	2024	47B	13B	Top-2 of 8 experts	arxiv.org/abs/2401.04088
DeepSeek-V3	2024	671B	37B	DeepSeekMoE + MLA	arxiv.org/abs/2412.19437

ModelSwitch Transformer

Year2021

Total paramsUp to 1.6T (in paper)

Active paramsTop-1 expert

RoutingTop-1

Paperarxiv.org/abs/2101.03961

ModelGLaM

Year2021

Total params1.2T

Active params~97B (~8%)

RoutingTop-2

Paperarxiv.org/abs/2112.06905

ModelMixtral 8x7B

Year2024

Total params47B

Active params13B

RoutingTop-2 of 8 experts

Paperarxiv.org/abs/2401.04088

ModelDeepSeek-V3

Year2024

Total params671B

Active params37B

RoutingDeepSeekMoE + MLA

Paperarxiv.org/abs/2412.19437

Multimodal fusion: four ways to bolt vision onto a language model

By 2026 most production LLMs accept images, and many accept audio and video. The architectural design space for how the vision tokens get into the language model is narrower than most marketing copy implies. There are roughly four patterns in active use. Flamingo (Alayrac et al, DeepMind, 2022, arxiv.org/abs/2204.14198) inserts new gated cross-attention layers between frozen layers of a pretrained LLM. The vision encoder is also frozen. Only the new cross-attention layers and a Perceiver resampler are trained. This pattern is parameter-efficient and keeps the language quality of the base model intact. It is still the cleanest design for adding vision to a strong language backbone you do not want to disturb. BLIP-2 (Li et al, Salesforce, 2023, arxiv.org/abs/2301.12597) uses a Q-Former — a small transformer with learnable query tokens — to compress vision features into a fixed number of soft tokens that get prepended to the language model's input. Cheap to train, modular, and the Q-Former design influenced many subsequent VLMs. LLaVA (Liu et al, 2023, arxiv.org/abs/2304.08485) is the simplest pattern of all — a single MLP projection that maps CLIP visual features into the language model's embedding space, with visual tokens concatenated to text tokens as a flat input sequence. Trained on synthetic GPT-4-generated instruction data. The simplicity is the point; LLaVA-style projection became the dominant pattern for open-source VLMs. CogVLM (Wang et al, 2023, arxiv.org/abs/2311.03079) introduced visual expert modules — separate QKV matrices and FFN inside each transformer block, activated only when processing image tokens. The base language model is preserved, and visual processing happens through dedicated parallel weights. More expensive than LLaVA-style projection but stronger on visual grounding benchmarks. As of June 2026 (best-effort), commercial frontier multimodal models (the Claude, GPT, and Gemini families) do not publish their fusion architecture in full. Public behavior is consistent with deep-fusion approaches closer to CogVLM / Flamingo than to LLaVA-style projection, but treat this as inference, not fact — check provider docs for any architectural details a vendor confirms.

Multimodal — index

Model	Year	Fusion pattern	Ships from	Primary paper
Flamingo	2022	Gated cross-attention between frozen LLM layers + Perceiver resampler	DeepMind (research)	arxiv.org/abs/2204.14198
BLIP-2	2023	Q-Former bridges frozen vision encoder and frozen LLM	Salesforce	arxiv.org/abs/2301.12597
LLaVA	2023	Single MLP projection of CLIP features, concatenated to text	Open-source academic	arxiv.org/abs/2304.08485
CogVLM	2023	Visual expert QKV + FFN inside each transformer block	Tsinghua / Zhipu AI	arxiv.org/abs/2311.03079

ModelFlamingo

Year2022

Fusion patternGated cross-attention between frozen LLM layers + Perceiver resampler

Ships fromDeepMind (research)

Primary paperarxiv.org/abs/2204.14198

ModelBLIP-2

Year2023

Fusion patternQ-Former bridges frozen vision encoder and frozen LLM

Ships fromSalesforce

Primary paperarxiv.org/abs/2301.12597

ModelLLaVA

Year2023

Fusion patternSingle MLP projection of CLIP features, concatenated to text

Ships fromOpen-source academic

Primary paperarxiv.org/abs/2304.08485

ModelCogVLM

Year2023

Fusion patternVisual expert QKV + FFN inside each transformer block

Ships fromTsinghua / Zhipu AI

Primary paperarxiv.org/abs/2311.03079

The arc of the field, condensed

2017
Attention Is All You Need
Vaswani et al introduce the encoder-decoder transformer for machine translation. The architecture is published as a Google Research paper at NeurIPS 2017.
2018
BERT
Devlin et al strip out the decoder and pretrain a bidirectional encoder with masked-language-modeling. NLU benchmarks fall over.
2019
T5 and GPT-2
Google reframes all NLP as text-to-text with the T5 encoder-decoder. OpenAI scales the decoder-only GPT stack. The decoder-only path begins to pull ahead for generation.
2020
GPT-3 and the efficient-attention wave
GPT-3 establishes that scale plus in-context learning eats most fine-tuning. Longformer, Reformer, Linformer try to break the O(n²) wall.
2021
Switch Transformer makes MoE practical
Fedus, Zoph, Shazeer simplify routing to top-1 and stabilize training in bfloat16. Sparse models become a credible engineering target.
2022
FlashAttention and Flamingo
FlashAttention makes exact softmax attention IO-efficient, neutralizing most efficient-attention approximations below ~16K tokens. Flamingo establishes the cross-attention-into-frozen-LLM pattern for vision-language.
2023
Llama, Mistral, and Mamba
Llama and Mistral give the open community a clean decoder-only template (RoPE, SwiGLU, RMSNorm, GQA, sliding-window). Mamba arrives as the first genuinely competitive state-space alternative to attention.
2024
Mixtral, Mamba-2, DeepSeek-V3
Mixtral opens up the MoE pattern. Mamba-2 unifies state-space and attention through structured state-space duality. DeepSeek-V3 ships a 671B / 37B-active open-weights MoE that resets cost expectations at the frontier.
2025-2026
Convergence onto MoE plus long context
Frontier commercial models settle into the pattern of large MoE decoders with hardware-friendly attention and increasingly hybrid (attention + state-space) blocks for very long context. Architecture stops being the headline variable; data, post-training, and inference stack take over.

What this map does not say

The architecture is rarely the bottleneck. Most measurable quality differences between frontier models in 2026 come from training data composition, post-training (RLHF, DPO, GRPO, constitutional methods), inference-time compute, and tool-use scaffolding — not from the transformer variant under the hood. Two models with identical architecture can differ by 20+ percentage points on benchmarks depending on what they were trained on and how they were tuned. Two models with very different architectures (a dense Llama-3-style decoder, a Mixtral-style MoE, and a hybrid Mamba-attention stack at the same active-parameter budget) can be remarkably close in user-visible quality if their data and post-training pipelines are comparable. If you are picking a model to ship a product on, the architecture is approximately the least important variable. Pick on benchmarks for your task, latency at your batch size, cost at your token volume, and the provider's track record on stability and policy. Pick the architecture later — or, more honestly, never; let the provider pick it for you.

How to use this page

If you are orienting yourself in the transformer literature, here is the minimum-effective-dose reading order. Skip anything you already know. Stop the moment you have enough.

Read the 2017 paper once. Vaswani et al, arxiv.org/abs/1706.03762. Everything else is a delta against this.
Read BERT (arxiv.org/abs/1810.04805) and GPT-3 (arxiv.org/abs/2005.14165) for the encoder-only and decoder-only forks.
Read FlashAttention (arxiv.org/abs/2205.14135) — not an architecture, but the reason most efficient-attention papers stopped mattering in practice.
Read Llama (arxiv.org/abs/2302.13971) for the de-facto modern decoder template.
Read Mixtral (arxiv.org/abs/2401.04088) for the canonical small-open MoE.
Read Mamba and Mamba-2 (arxiv.org/abs/2312.00752, arxiv.org/abs/2405.21060) for the strongest non-transformer sequence model line as of 2026.
Read DeepSeek-V3 (arxiv.org/abs/2412.19437) for what a current frontier-scale MoE actually looks like in detail.
Stop. The rest is gradient. Spend the time saved on data and evals instead.

Sources

[01]
Vaswani et al, Attention Is All You Need, 2017 — the original encoder-decoder transformer for machine translation.
arxiv.org/abs/1706.03762
[02]
Devlin et al, BERT, 2018 — bidirectional encoder pretraining via masked language modeling and next-sentence prediction.
arxiv.org/abs/1810.04805
[03]
Liu et al, RoBERTa, 2019 — BERT trained longer with more data and without next-sentence prediction substantially improves benchmarks.
arxiv.org/abs/1907.11692
[04]
He et al, DeBERTa, 2020 — disentangled attention with separate content and position matrices, plus enhanced mask decoder.
arxiv.org/abs/2006.03654
[05]
Brown et al, GPT-3, 2020 — scale and in-context few-shot learning replace most task-specific fine-tuning.
arxiv.org/abs/2005.14165
[06]
Touvron et al, Llama, 2023 — open-weights decoder-only transformer with RoPE, SwiGLU, and RMSNorm that became the field template.
arxiv.org/abs/2302.13971
[07]
Jiang et al, Mistral 7B, 2023 — sliding-window attention and grouped-query attention on a Llama-derived decoder.
arxiv.org/abs/2310.06825
[08]
Raffel et al, T5, 2019 — text-to-text reframing of all NLP tasks on a single encoder-decoder model.
arxiv.org/abs/1910.10683
[09]
Lewis et al, BART, 2019 — denoising autoencoder encoder-decoder for generation and summarization.
arxiv.org/abs/1910.13461
[10]
Beltagy, Peters, Cohan, Longformer, 2020 — sliding-window plus global attention for long documents at linear cost.
arxiv.org/abs/2004.05150
[11]
Kitaev, Kaiser, Levskaya, Reformer, 2020 — LSH-based attention plus reversible residual layers for memory efficiency.
arxiv.org/abs/2001.04451
[12]
Wang et al, Linformer, 2020 — low-rank projection of keys and values for linear-complexity self-attention.
arxiv.org/abs/2006.04768
[13]
Poli et al, Hyena Hierarchy, 2023 — long convolutions plus data-controlled gating as a subquadratic drop-in for attention.
arxiv.org/abs/2302.10866
[14]
Dao et al, FlashAttention, 2022 — exact IO-efficient attention that neutralized most efficient-attention approximations in production.
arxiv.org/abs/2205.14135
[15]
Gu and Dao, Mamba, 2023 — selective state-space model (S6) with input-dependent parameters competitive with transformers on language modeling.
arxiv.org/abs/2312.00752
[16]
Dao and Gu, Mamba-2, 2024 — structured state-space duality with 2-8x speedup over Mamba; crossover with FlashAttention-2 around 2K tokens.
arxiv.org/abs/2405.21060
[17]
Peng et al, RWKV, 2023 — linear-attention RNN-transformer hybrid that trains in parallel and infers as an RNN.
arxiv.org/abs/2305.13048
[18]
Sun et al, RetNet (Retentive Network), 2023 — retention mechanism designed for parallel training and recurrent inference.
arxiv.org/abs/2307.08621
[19]
Fedus, Zoph, Shazeer, Switch Transformer, 2021 — top-1 expert routing with stable bfloat16 training and up to 7x pre-training speedup.
arxiv.org/abs/2101.03961
[20]
Du et al, GLaM, 2021 — 1.2T-parameter MoE matching GPT-3 quality at substantially lower training energy.
arxiv.org/abs/2112.06905
[21]
Jiang et al, Mixtral of Experts, 2024 — 47B-total / 13B-active Apache-2.0 MoE with top-2 routing across 8 experts per layer, 32K context.
arxiv.org/abs/2401.04088
[22]
DeepSeek-AI, DeepSeek-V3 Technical Report, December 2024 — 671B total / 37B active MoE with multi-head latent attention (MLA), trained on 14.8T tokens.
arxiv.org/abs/2412.19437
[23]
Alayrac et al, Flamingo, 2022 — gated cross-attention layers inserted between frozen LLM layers plus a Perceiver resampler for vision-language fusion.
arxiv.org/abs/2204.14198
[24]
Li et al, BLIP-2, 2023 — Q-Former with learnable query tokens bridges a frozen vision encoder to a frozen LLM.
arxiv.org/abs/2301.12597
[25]
Liu et al, LLaVA, 2023 — single MLP projection of CLIP features into the LLM embedding space; trained on GPT-4-synthesized instruction data.
arxiv.org/abs/2304.08485

Keep reading

Atlas index →Learn: foundations →Research papers →Model comparisons →OrangeBox runtime →B00kMakor — long-form research mode →Tools index →

Transformer variants: a field map

The three classical stacks

Classical stacks — index

Efficient attention: the O(n²) problem and the responses to it

State-space and recurrent alternatives

Mamba

Mamba-2

Hyena

RWKV

RetNet

Mixture-of-experts: scale by sparsity instead of by density

MoE — index

Multimodal fusion: four ways to bolt vision onto a language model

Multimodal — index

The arc of the field, condensed

Attention Is All You Need

BERT

T5 and GPT-2

GPT-3 and the efficient-attention wave

Switch Transformer makes MoE practical

FlashAttention and Flamingo

Llama, Mistral, and Mamba

Mixtral, Mamba-2, DeepSeek-V3

Convergence onto MoE plus long context

What this map does not say

How to use this page

Sources

Keep reading