
Mixture-of-experts, end to end
A working atlas of the sparse models that decoupled training compute from inference cost.
Vocabulary you need before reading any MoE paper
- Expert — one of the parallel feed-forward sub-networks inside an MoE layer. In dense transformers, the FFN is a single block; in MoE, it is replaced by N experts of similar shape.
- Router (or gate) — a small learned network that, given a token's hidden state, outputs a score for each expert. Usually a linear projection followed by softmax.
- Top-k routing — the routing rule. Top-1 sends each token to exactly one expert (Switch Transformer); top-2 sends it to two (GShard, Mixtral); higher k is rare.
- Total parameters — the size of the saved checkpoint, dominated by expert weights.
- Active parameters per token — what is actually multiplied during one forward pass: shared layers (attention, embedding, norms) plus the k experts the router chose. This sets serving FLOPs and latency.
- Capacity factor — a per-expert token budget. If too many tokens want the same expert, the overflow gets dropped or routed to a fallback. Capacity = (tokens / num_experts) × capacity_factor.
- Load-balancing loss — an auxiliary training loss that pushes the router to spread tokens evenly across experts, so no expert starves and none gets overworked.
- Router z-loss — a separate auxiliary loss, introduced in ST-MoE, that penalises large router logits to keep softmax numerically stable at scale.
- Expert choice — an alternative routing rule (Zhou et al. 2022) where experts pick their top-k tokens instead of tokens picking their top-k experts. Guarantees perfect load balance by construction.
- Shared expert — an expert that runs for every token, used alongside sparsely-routed experts to capture common knowledge (DeepSeekMoE convention).
Timeline
Jan 2017
Sparsely-gated MoE layer (Shazeer et al.)
arxiv 1701.06538. Up to 137B parameters in an LSTM language model, top-k softmax gating with k between 2 and 4, noisy gating for exploration, importance + load auxiliary losses. The first paper to show that sparse activation could deliver greater than 1000x capacity gains with manageable compute overhead on GPU clusters.
Jan 2021
Switch Transformer (Fedus, Zoph, Shazeer)
arxiv 2101.03961. Simplified routing to top-1 (one expert per token), introduced capacity factor and overflow handling, and scaled to 1.6 trillion parameters on the C4 corpus. Reported a 4x pretraining speedup over a tuned T5-XXL dense baseline at matched compute.
Dec 2021
GLaM (Du et al., Google)
arxiv 2112.06905. 1.2T total parameters, 64 experts per MoE layer, top-2 routing, roughly 97B activated per token. Reported using one-third the training energy of GPT-3 and half its inference FLOPs while beating it on average across 29 NLP tasks zero- and one-shot.
Feb 2022
Expert choice routing (Zhou et al.)
arxiv 2202.09368. Flips the routing direction: each expert picks its top-k tokens instead of each token picking its top-k experts. Guarantees uniform expert load without an auxiliary balance loss, at the cost of giving up the contract that every token gets routed.
Feb 2022
ST-MoE and router z-loss (Zoph et al.)
arxiv 2202.08906. Diagnosed the training-instability mode where router logits grow unboundedly and softmax explodes. Added an auxiliary log-sum-exp penalty on router logits (the z-loss) that has since become standard in Mixtral, DeepSeek-V3, and most large MoE training stacks.
Dec 2023 / Jan 2024
Mixtral 8x7B (Mistral)
arxiv 2401.04088. Eight experts per layer, top-2 routing, 47B total parameters with about 13B active per token. The first widely-deployed open-weights MoE, distributed under Apache 2.0. Made MoE inference a practical concern for the open-source community.
Jan 2024
DeepSeekMoE (Dai et al.)
arxiv 2401.06066. Introduced fine-grained expert segmentation (splitting each expert's FFN into smaller pieces and increasing N) and shared-expert isolation (a few always-on experts for common knowledge). Established the design template that DeepSeek-V2 and V3 would later scale.
Apr 2024
Mixtral 8x22B (Mistral)
Eight experts, top-2 routing, 141B total parameters with about 39B active. Released under Apache 2.0 with a 64k context window. Reference architecture, no separate paper; details are in the Mistral release notes and the original Mixtral arxiv 2401.04088 lineage.
Oct 2024
IBM Granite 3.0 MoE
Small enterprise-grade MoE: Granite-3.0-1B-A400M and Granite-3.0-3B-A800M, sized for CPU servers and on-device inference. The 'A400M' / 'A800M' suffix is the active parameter count. Trained on 10T tokens. Released under Apache 2.0.
Dec 2024
DeepSeek-V3 (DeepSeek-AI)
arxiv 2412.19437. 671B total parameters, 37B active per token. Pioneered an auxiliary-loss-free load-balancing strategy (bias-adjusted routing) and a multi-token prediction training objective. Reported 2.788M H800 GPU-hours for full training — far below comparable closed dense models.
May 2025
Qwen3 MoE (Alibaba Qwen Team)
arxiv 2505.09388. Two MoE checkpoints: Qwen3-235B-A22B (235B total, 22B active) and Qwen3-30B-A3B (30B total, 3B active). 128 experts per layer, top-8 routing, no shared expert. Adopted a global-batch load balancing loss for expert specialisation.
The model index
How routing actually works
Training tricks that keep MoE stable
Load-balancing auxiliary loss
arxiv 2101.03961
Introduced in Shazeer 2017, standardised by Switch Transformer. An auxiliary term added to the cross-entropy loss that penalises the router for over- or under-using any expert. Switch Transformer uses a coefficient of 10^-2; the loss is the product of the fraction of tokens routed to each expert and the fraction of routing probability assigned to it, summed and scaled by N.
Router z-loss
arxiv 2202.08906
Introduced in ST-MoE (Zoph et al. 2022). Penalises log-sum-exp of router logits to stop them growing unboundedly and blowing up softmax in bfloat16. Has become standard in essentially every large MoE since: Mixtral, DeepSeek, Qwen, Granite all use it or an equivalent.
Capacity factor and overflow
Switch Transformer §2.2
Each expert gets a fixed per-batch token budget. If more tokens want it than the budget allows, the excess is dropped (the residual stream is passed through unchanged) or rerouted. Switch Transformer used capacity factors between 1.0 and 1.5. Setting it too low drops too many tokens; setting it too high wastes memory.
Fine-grained experts
arxiv 2401.06066
DeepSeekMoE's contribution. Instead of N large experts, use 4N or 8N small experts (smaller intermediate FFN dim) and route to more of them. Empirically improves expert specialisation at the same active-parameter budget. Granularity is a knob, not a free win — see arxiv 2505.06839 for a recent analysis.
Shared experts
DeepSeekMoE / Qwen3
One or more experts that run for every token, alongside the routed experts. The idea is that some computation is genuinely common to all tokens (basic syntax, generic semantics) and should not have to be re-learned in every routed expert. Used in DeepSeekMoE and DeepSeek-V2/V3. Qwen3 dropped shared experts in favour of more routed experts plus global-batch balancing.
Auxiliary-loss-free balance
arxiv 2412.19437
DeepSeek-V3's contribution. Instead of an auxiliary loss term that competes with cross-entropy, maintain a per-expert bias that is adjusted online to push the router toward balanced usage. The model still uses a small sequence-wise balance loss as a safety net.
Why MoE matters — the inference-cost argument
Active vs total — the FLOPs picture
Honest caveats and known failure modes
Three things to remember when reading MoE marketing copy. First, total-parameter counts are not directly comparable to dense-model parameter counts. A 671B MoE is not '10x GPT-3' in any operational sense. The fair comparison is active parameters versus dense parameters at matched training tokens. Second, MoE training is more brittle. Router z-loss, fine-grained experts, and auxiliary-loss-free balance are all responses to real training collapses that happened to real teams. The OpenMoE paper (arxiv 2402.01739) found that routing decisions get locked in extremely early and barely change afterwards — context-independent specialisation by token ID is the dominant regime. That is a finding worth absorbing before claiming an MoE has 'learned' something semantically sophisticated about its expert layout. Third, MoE inference at low batch sizes is bandwidth-bound in unintuitive ways. Each token may need to pull weights for a different subset of experts from HBM; the FLOPs win is real, but the memory-bandwidth win is smaller and sometimes absent. Provider pricing reflects this — check provider docs for current per-token pricing rather than reading off active-parameter counts. As of June 2026 this is a best-effort summary. New MoEs ship weekly; specific routing-mechanism details for vendor models should be confirmed against the official technical report or model card before relying on them in production.
Open questions, as of mid-2026
- Is fine-grained expert segmentation a strictly better default, or does it just trade load-balance ease for routing-overhead cost? The 2025 arxiv 2505.06839 analysis says granularity boosts expressivity but flattens out; the question is open.
- Does the OpenMoE finding (routing locks in early, mostly by token ID) generalise to large auxiliary-loss-free models like DeepSeek-V3? No public replication yet.
- Are shared experts a permanent design feature or a transitional crutch? Qwen3 abandoned them; DeepSeek-V3 kept them. Neither paper presents a clean ablation against a matched baseline.
- What is the right load-balance mechanism — auxiliary loss, expert choice, or bias-adjusted routing? DeepSeek-V3 argues the loss-free approach is cleaner; nobody has run a head-to-head at matched scale.
- When does MoE stop helping? At very small total-parameter counts the routing overhead dominates. At very large counts, communication costs can dominate. The middle band where MoE wins is wider than dense-model advocates claim and narrower than MoE marketing implies.
Sources
- [01]
Shazeer et al. 2017 introduce the sparsely-gated mixture-of-experts layer and demonstrate greater than 1000x capacity gains over dense baselines.
arxiv.org/abs/1701.06538
- [02]
Fedus, Zoph, Shazeer 2021 introduce Switch Transformer with top-1 routing, capacity factor, and a 1.6 trillion-parameter scaling demonstration.
arxiv.org/abs/2101.03961
- [03]
Du et al. 2022 GLaM paper reports a 1.2T-parameter MoE activating roughly 96.6B parameters per token, with 64 experts per MoE layer and top-2 routing.
arxiv.org/abs/2112.06905
- [04]
Zoph et al. 2022 ST-MoE paper introduces router z-loss as the standard mechanism for stabilising MoE training against router-logit blowup.
arxiv.org/abs/2202.08906
- [05]
Zhou et al. 2022 introduce expert-choice routing in which experts pick top-k tokens, guaranteeing perfect load balance and reporting >2x convergence speedup over Switch top-1 and GShard top-2.
arxiv.org/abs/2202.09368
- [06]
Mistral's Mixtral of Experts paper specifies 8 experts per layer, top-2 routing, ~47B total / ~13B active parameters for Mixtral 8x7B.
arxiv.org/abs/2401.04088
- [07]
DeepSeekMoE paper introduces fine-grained expert segmentation and shared-expert isolation as MoE architectural primitives.
arxiv.org/abs/2401.06066
- [08]
OpenMoE paper releases 650M-34B open MoE checkpoints and reports that routing decisions are predominantly context-independent and locked in early in training.
arxiv.org/abs/2402.01739
- [09]
DeepSeek-V3 technical report specifies 671B total parameters, 37B active per token, auxiliary-loss-free load balancing, multi-token prediction, and 2.788M H800 GPU-hours full training.
arxiv.org/abs/2412.19437
- [10]
Qwen3 technical report specifies Qwen3-235B-A22B and Qwen3-30B-A3B MoE checkpoints with 128 experts per layer, top-8 routing, no shared expert, global-batch balance loss.
arxiv.org/abs/2505.09388
- [11]
IBM Granite 3.0 announcement specifies Granite-3.0-3B-A800M and Granite-3.0-1B-A400M as the MoE entries with 800M and 400M active parameters respectively, trained on 10T tokens.
ibm.com/new/announcements/ibm-granite-3-0-open-state-of-the-art-enterprise-models
- [12]
Mixtral 8x22B reference card confirms 141B total parameters with approximately 39B active per token via top-2 routing across 8 experts.
huggingface.co/mistral-community/Mixtral-8x22B-v0.1
- [13]
Google's GLaM announcement reports the model uses one-third of GPT-3's training energy and half its inference FLOPs while exceeding GPT-3 on average across 29 NLP tasks.
research.google/blog/more-efficient-in-context-learning-with-glam/
- [14]
A 2025 analysis paper finds that expert granularity boosts MoE expressivity but with diminishing returns past a certain split ratio.
arxiv.org/abs/2505.06839