Seven identical matte-black gears in a hexagonal cluster, one glowing cyan — mixture of experts.

Mixture-of-experts, end to end

A working atlas of the sparse models that decoupled training compute from inference cost.

A mixture-of-experts (MoE) language model is a transformer in which most of the feed-forward layers have been replaced by a bank of parallel sub-networks called experts. A small router network, attached to each MoE layer, picks a tiny subset of those experts to run for each token. The other experts sit on disk or in HBM and do nothing for that token. The model holds a very large number of total parameters and pays for a much smaller number of active parameters per forward pass. That single architectural choice is what makes a 671-billion-parameter open-weight model practical to serve. DeepSeek-V3, released December 2024, has 671B total parameters but activates only 37B per token (DeepSeek-AI, arxiv 2412.19437). Mixtral 8x22B has 141B total, 39B active (Mistral, January 2024). Qwen3-235B-A22B has 235B total, 22B active (Qwen Team, May 2025). The pattern is consistent: training-time capacity expands roughly linearly with total parameters, while serving cost tracks active parameters and a routing overhead that, when implemented well, is small. This page collects the lineage. It walks from Shazeer et al.'s 2017 sparsely-gated MoE layer, through the Switch Transformer that made top-1 routing work at scale, through GLaM, Mixtral, OpenMoE, DeepSeekMoE, Qwen-MoE, and IBM's Granite. For each we note total parameters, active parameters per token, the routing rule, and the training tricks that kept it stable. Numbers are pulled from primary sources (arxiv preprints, official model cards, vendor announcements) and footnoted. Where a number is provider-reported and not independently verified, we say so. Where a claim is best-effort as of June 2026, we say so. The aim is the working knowledge a practitioner needs to read MoE papers, size hardware, and tell honest scaling from marketing copy. A short glossary appears in the first section. Then the timeline, the model table, the routing mechanics, the training stabilizers, the inference cost picture, and a sober list of what is still hard.

Vocabulary you need before reading any MoE paper

Expert — one of the parallel feed-forward sub-networks inside an MoE layer. In dense transformers, the FFN is a single block; in MoE, it is replaced by N experts of similar shape.
Router (or gate) — a small learned network that, given a token's hidden state, outputs a score for each expert. Usually a linear projection followed by softmax.
Top-k routing — the routing rule. Top-1 sends each token to exactly one expert (Switch Transformer); top-2 sends it to two (GShard, Mixtral); higher k is rare.
Total parameters — the size of the saved checkpoint, dominated by expert weights.
Active parameters per token — what is actually multiplied during one forward pass: shared layers (attention, embedding, norms) plus the k experts the router chose. This sets serving FLOPs and latency.
Capacity factor — a per-expert token budget. If too many tokens want the same expert, the overflow gets dropped or routed to a fallback. Capacity = (tokens / num_experts) × capacity_factor.
Load-balancing loss — an auxiliary training loss that pushes the router to spread tokens evenly across experts, so no expert starves and none gets overworked.
Router z-loss — a separate auxiliary loss, introduced in ST-MoE, that penalises large router logits to keep softmax numerically stable at scale.
Expert choice — an alternative routing rule (Zhou et al. 2022) where experts pick their top-k tokens instead of tokens picking their top-k experts. Guarantees perfect load balance by construction.
Shared expert — an expert that runs for every token, used alongside sparsely-routed experts to capture common knowledge (DeepSeekMoE convention).

Timeline

Jan 2017
Sparsely-gated MoE layer (Shazeer et al.)
arxiv 1701.06538. Up to 137B parameters in an LSTM language model, top-k softmax gating with k between 2 and 4, noisy gating for exploration, importance + load auxiliary losses. The first paper to show that sparse activation could deliver greater than 1000x capacity gains with manageable compute overhead on GPU clusters.
Jan 2021
Switch Transformer (Fedus, Zoph, Shazeer)
arxiv 2101.03961. Simplified routing to top-1 (one expert per token), introduced capacity factor and overflow handling, and scaled to 1.6 trillion parameters on the C4 corpus. Reported a 4x pretraining speedup over a tuned T5-XXL dense baseline at matched compute.
Dec 2021
GLaM (Du et al., Google)
arxiv 2112.06905. 1.2T total parameters, 64 experts per MoE layer, top-2 routing, roughly 97B activated per token. Reported using one-third the training energy of GPT-3 and half its inference FLOPs while beating it on average across 29 NLP tasks zero- and one-shot.
Feb 2022
Expert choice routing (Zhou et al.)
arxiv 2202.09368. Flips the routing direction: each expert picks its top-k tokens instead of each token picking its top-k experts. Guarantees uniform expert load without an auxiliary balance loss, at the cost of giving up the contract that every token gets routed.
Feb 2022
ST-MoE and router z-loss (Zoph et al.)
arxiv 2202.08906. Diagnosed the training-instability mode where router logits grow unboundedly and softmax explodes. Added an auxiliary log-sum-exp penalty on router logits (the z-loss) that has since become standard in Mixtral, DeepSeek-V3, and most large MoE training stacks.
Dec 2023 / Jan 2024
Mixtral 8x7B (Mistral)
arxiv 2401.04088. Eight experts per layer, top-2 routing, 47B total parameters with about 13B active per token. The first widely-deployed open-weights MoE, distributed under Apache 2.0. Made MoE inference a practical concern for the open-source community.
Jan 2024
DeepSeekMoE (Dai et al.)
arxiv 2401.06066. Introduced fine-grained expert segmentation (splitting each expert's FFN into smaller pieces and increasing N) and shared-expert isolation (a few always-on experts for common knowledge). Established the design template that DeepSeek-V2 and V3 would later scale.
Apr 2024
Mixtral 8x22B (Mistral)
Eight experts, top-2 routing, 141B total parameters with about 39B active. Released under Apache 2.0 with a 64k context window. Reference architecture, no separate paper; details are in the Mistral release notes and the original Mixtral arxiv 2401.04088 lineage.
Oct 2024
IBM Granite 3.0 MoE
Small enterprise-grade MoE: Granite-3.0-1B-A400M and Granite-3.0-3B-A800M, sized for CPU servers and on-device inference. The 'A400M' / 'A800M' suffix is the active parameter count. Trained on 10T tokens. Released under Apache 2.0.
Dec 2024
DeepSeek-V3 (DeepSeek-AI)
arxiv 2412.19437. 671B total parameters, 37B active per token. Pioneered an auxiliary-loss-free load-balancing strategy (bias-adjusted routing) and a multi-token prediction training objective. Reported 2.788M H800 GPU-hours for full training — far below comparable closed dense models.
May 2025
Qwen3 MoE (Alibaba Qwen Team)
arxiv 2505.09388. Two MoE checkpoints: Qwen3-235B-A22B (235B total, 22B active) and Qwen3-30B-A3B (30B total, 3B active). 128 experts per layer, top-8 routing, no shared expert. Adopted a global-batch load balancing loss for expert specialisation.

The model index

Model	Released	Total params	Active / token	Routing	Notes
Shazeer et al. MoE LSTM	Jan 2017	up to ~137B	varies (top-k softmax, k=2-4)	Noisy top-k	First production-scale sparse MoE. arxiv 1701.06538.
Switch Transformer	Jan 2021	up to 1.6T	shared + 1 expert	Top-1	Capacity factor, overflow drop. arxiv 2101.03961.
GLaM	Dec 2021	1.2T	~96.6B	Top-2	64 experts/layer, 32 MoE layers. arxiv 2112.06905.
Mixtral 8x7B	Dec 2023	~47B	~13B	Top-2	8 experts/layer. arxiv 2401.04088. Apache 2.0.
DeepSeekMoE 16B	Jan 2024	16B	~2.8B	Top-k, fine-grained + shared	Architecture template for V2/V3. arxiv 2401.06066.
Mixtral 8x22B	Apr 2024	141B	~39B	Top-2	64k context. Apache 2.0.
OpenMoE	Feb 2024	650M to 34B	varies by size	Top-k	Fully open recipe + data + checkpoints. arxiv 2402.01739.
Granite 3.0-3B-A800M	Oct 2024	3B	800M	Top-k	IBM. CPU / edge target. Apache 2.0.
Granite 3.0-1B-A400M	Oct 2024	1B	400M	Top-k	IBM. Same family, smaller.
DeepSeek-V3	Dec 2024	671B	37B	Top-k, aux-loss-free + bias	Multi-token prediction. arxiv 2412.19437.
Qwen3-235B-A22B	May 2025	235B	22B	Top-8 of 128	Global-batch load balance loss. arxiv 2505.09388.
Qwen3-30B-A3B	May 2025	30B	3B	Top-8 of 128	Smaller sibling of the 235B model.

ModelShazeer et al. MoE LSTM

ReleasedJan 2017

Total paramsup to ~137B

Active / tokenvaries (top-k softmax, k=2-4)

RoutingNoisy top-k

NotesFirst production-scale sparse MoE. arxiv 1701.06538.

ModelSwitch Transformer

ReleasedJan 2021

Total paramsup to 1.6T

Active / tokenshared + 1 expert

RoutingTop-1

NotesCapacity factor, overflow drop. arxiv 2101.03961.

ModelGLaM

ReleasedDec 2021

Total params1.2T

Active / token~96.6B

RoutingTop-2

Notes64 experts/layer, 32 MoE layers. arxiv 2112.06905.

ModelMixtral 8x7B

ReleasedDec 2023

Total params~47B

Active / token~13B

RoutingTop-2

Notes8 experts/layer. arxiv 2401.04088. Apache 2.0.

ModelDeepSeekMoE 16B

ReleasedJan 2024

Total params16B

Active / token~2.8B

RoutingTop-k, fine-grained + shared

NotesArchitecture template for V2/V3. arxiv 2401.06066.

ModelMixtral 8x22B

ReleasedApr 2024

Total params141B

Active / token~39B

RoutingTop-2

Notes64k context. Apache 2.0.

ModelOpenMoE

ReleasedFeb 2024

Total params650M to 34B

Active / tokenvaries by size

RoutingTop-k

NotesFully open recipe + data + checkpoints. arxiv 2402.01739.

ModelGranite 3.0-3B-A800M

ReleasedOct 2024

Total params3B

Active / token800M

RoutingTop-k

NotesIBM. CPU / edge target. Apache 2.0.

ModelGranite 3.0-1B-A400M

ReleasedOct 2024

Total params1B

Active / token400M

RoutingTop-k

NotesIBM. Same family, smaller.

ModelDeepSeek-V3

ReleasedDec 2024

Total params671B

Active / token37B

RoutingTop-k, aux-loss-free + bias

NotesMulti-token prediction. arxiv 2412.19437.

ModelQwen3-235B-A22B

ReleasedMay 2025

Total params235B

Active / token22B

RoutingTop-8 of 128

NotesGlobal-batch load balance loss. arxiv 2505.09388.

ModelQwen3-30B-A3B

ReleasedMay 2025

Total params30B

Active / token3B

RoutingTop-8 of 128

NotesSmaller sibling of the 235B model.

How routing actually works

At each MoE layer, the router is a linear projection from the token's hidden state d to a vector of length N (the number of experts), followed by softmax. Call that score vector s. Top-k routing selects the k largest entries of s; the token's output is the weighted sum of those k experts' outputs, weighted by their (renormalised) softmax scores. That is the whole inner loop. The choice of k matters. Top-1 (Switch Transformer) halves communication cost in distributed setups and simplifies gradients, but loses the ability to combine expertise. Top-2 (GShard, Mixtral, GLaM) is the most common setting in production: it gets most of the routing flexibility with modest extra cost. Top-8 of 128 (Qwen3) is a fine-grained variant: many small experts, each token still touches a small fraction, but the combinatorics are richer. Expert choice (Zhou et al., arxiv 2202.09368) inverts the contract. Instead of each token claiming its top-k experts, each expert claims its top-k tokens. By construction, every expert sees the same number of tokens, which removes the need for a load-balancing loss. The price is that some tokens may not be claimed by any expert; the paper handles this and reports more than 2x faster pretraining convergence at matched compute relative to Switch top-1 and GShard top-2. DeepSeek-V3 contributes a more recent twist: an auxiliary-loss-free strategy in which the router maintains a per-expert bias that is nudged up or down to enforce balance, without adding a loss term that competes with the cross-entropy objective. The reported benefit is that the router can specialise more freely without being penalised by a balance term.

Training tricks that keep MoE stable

Load-balancing auxiliary loss

arxiv 2101.03961

Introduced in Shazeer 2017, standardised by Switch Transformer. An auxiliary term added to the cross-entropy loss that penalises the router for over- or under-using any expert. Switch Transformer uses a coefficient of 10^-2; the loss is the product of the fraction of tokens routed to each expert and the fraction of routing probability assigned to it, summed and scaled by N.

Router z-loss

arxiv 2202.08906

Introduced in ST-MoE (Zoph et al. 2022). Penalises log-sum-exp of router logits to stop them growing unboundedly and blowing up softmax in bfloat16. Has become standard in essentially every large MoE since: Mixtral, DeepSeek, Qwen, Granite all use it or an equivalent.

Capacity factor and overflow

Switch Transformer §2.2

Each expert gets a fixed per-batch token budget. If more tokens want it than the budget allows, the excess is dropped (the residual stream is passed through unchanged) or rerouted. Switch Transformer used capacity factors between 1.0 and 1.5. Setting it too low drops too many tokens; setting it too high wastes memory.

Fine-grained experts

arxiv 2401.06066

DeepSeekMoE's contribution. Instead of N large experts, use 4N or 8N small experts (smaller intermediate FFN dim) and route to more of them. Empirically improves expert specialisation at the same active-parameter budget. Granularity is a knob, not a free win — see arxiv 2505.06839 for a recent analysis.

Shared experts

DeepSeekMoE / Qwen3

One or more experts that run for every token, alongside the routed experts. The idea is that some computation is genuinely common to all tokens (basic syntax, generic semantics) and should not have to be re-learned in every routed expert. Used in DeepSeekMoE and DeepSeek-V2/V3. Qwen3 dropped shared experts in favour of more routed experts plus global-batch balancing.

Auxiliary-loss-free balance

arxiv 2412.19437

DeepSeek-V3's contribution. Instead of an auxiliary loss term that competes with cross-entropy, maintain a per-expert bias that is adjusted online to push the router toward balanced usage. The model still uses a small sequence-wise balance loss as a safety net.

Why MoE matters — the inference-cost argument

The headline reason MoE is interesting is that it decouples two numbers that are coupled in dense transformers: parameters and FLOPs per token. In a dense model, every parameter participates in every forward pass. A 70B-parameter dense model touches roughly 140 GFLOPs per token per layer pass (two FLOPs per parameter, ignoring attention). Doubling the parameters doubles the inference cost. There is no escape route at serving time. In an MoE model, the active-parameter budget per token can be held constant while total parameters grow. Mixtral 8x22B has 141B parameters but does the per-token work of a ~39B-active model. DeepSeek-V3 has 671B but does the per-token work of a ~37B-active model. The cost surfaces are: (1) HBM and storage scale with total parameters — you still have to hold all those weights in memory somewhere, and (2) routing introduces an extra all-to-all communication step in distributed serving, which is non-trivial at high batch sizes. The Switch Transformer paper put substantial engineering effort into making that all-to-all cheap. What MoE does not give you is free memory. A 671B-parameter MoE checkpoint is still 671B parameters of weights on disk. The inference benefit shows up in FLOPs, latency, and (with care) cost per token, not in VRAM requirements. As of June 2026, serving DeepSeek-V3 still needs roughly the same HBM as a 671B dense model of equivalent precision — multi-node, expert-parallel — but the per-token compute is in the tens-of-billions-of-active-parameters range, not the hundreds. That is the trade the architecture is built around.

Active vs total — the FLOPs picture

Model	Total params	Active / token	Active / total ratio	Equivalent dense FLOPs class
Mixtral 8x7B	47B	~13B	~28%	~13B dense
Mixtral 8x22B	141B	~39B	~28%	~39B dense
GLaM (full)	1.2T	~96.6B	~8%	~96.6B dense
DeepSeek-V3	671B	37B	~5.5%	~37B dense
Qwen3-235B-A22B	235B	22B	~9%	~22B dense
Qwen3-30B-A3B	30B	3B	~10%	~3B dense
Granite 3.0-3B-A800M	3B	0.8B	~27%	~0.8B dense

ModelMixtral 8x7B

Total params47B

Active / token~13B

Active / total ratio~28%

Equivalent dense FLOPs class~13B dense

ModelMixtral 8x22B

Total params141B

Active / token~39B

Active / total ratio~28%

Equivalent dense FLOPs class~39B dense

ModelGLaM (full)

Total params1.2T

Active / token~96.6B

Active / total ratio~8%

Equivalent dense FLOPs class~96.6B dense

ModelDeepSeek-V3

Total params671B

Active / token37B

Active / total ratio~5.5%

Equivalent dense FLOPs class~37B dense

ModelQwen3-235B-A22B

Total params235B

Active / token22B

Active / total ratio~9%

Equivalent dense FLOPs class~22B dense

ModelQwen3-30B-A3B

Total params30B

Active / token3B

Active / total ratio~10%

Equivalent dense FLOPs class~3B dense

ModelGranite 3.0-3B-A800M

Total params3B

Active / token0.8B

Active / total ratio~27%

Equivalent dense FLOPs class~0.8B dense

Honest caveats and known failure modes

Three things to remember when reading MoE marketing copy. First, total-parameter counts are not directly comparable to dense-model parameter counts. A 671B MoE is not '10x GPT-3' in any operational sense. The fair comparison is active parameters versus dense parameters at matched training tokens. Second, MoE training is more brittle. Router z-loss, fine-grained experts, and auxiliary-loss-free balance are all responses to real training collapses that happened to real teams. The OpenMoE paper (arxiv 2402.01739) found that routing decisions get locked in extremely early and barely change afterwards — context-independent specialisation by token ID is the dominant regime. That is a finding worth absorbing before claiming an MoE has 'learned' something semantically sophisticated about its expert layout. Third, MoE inference at low batch sizes is bandwidth-bound in unintuitive ways. Each token may need to pull weights for a different subset of experts from HBM; the FLOPs win is real, but the memory-bandwidth win is smaller and sometimes absent. Provider pricing reflects this — check provider docs for current per-token pricing rather than reading off active-parameter counts. As of June 2026 this is a best-effort summary. New MoEs ship weekly; specific routing-mechanism details for vendor models should be confirmed against the official technical report or model card before relying on them in production.

Open questions, as of mid-2026

Is fine-grained expert segmentation a strictly better default, or does it just trade load-balance ease for routing-overhead cost? The 2025 arxiv 2505.06839 analysis says granularity boosts expressivity but flattens out; the question is open.
Does the OpenMoE finding (routing locks in early, mostly by token ID) generalise to large auxiliary-loss-free models like DeepSeek-V3? No public replication yet.
Are shared experts a permanent design feature or a transitional crutch? Qwen3 abandoned them; DeepSeek-V3 kept them. Neither paper presents a clean ablation against a matched baseline.
What is the right load-balance mechanism — auxiliary loss, expert choice, or bias-adjusted routing? DeepSeek-V3 argues the loss-free approach is cleaner; nobody has run a head-to-head at matched scale.
When does MoE stop helping? At very small total-parameter counts the routing overhead dominates. At very large counts, communication costs can dominate. The middle band where MoE wins is wider than dense-model advocates claim and narrower than MoE marketing implies.

Sources

[01]
Shazeer et al. 2017 introduce the sparsely-gated mixture-of-experts layer and demonstrate greater than 1000x capacity gains over dense baselines.
arxiv.org/abs/1701.06538
[02]
Fedus, Zoph, Shazeer 2021 introduce Switch Transformer with top-1 routing, capacity factor, and a 1.6 trillion-parameter scaling demonstration.
arxiv.org/abs/2101.03961
[03]
Du et al. 2022 GLaM paper reports a 1.2T-parameter MoE activating roughly 96.6B parameters per token, with 64 experts per MoE layer and top-2 routing.
arxiv.org/abs/2112.06905
[04]
Zoph et al. 2022 ST-MoE paper introduces router z-loss as the standard mechanism for stabilising MoE training against router-logit blowup.
arxiv.org/abs/2202.08906
[05]
Zhou et al. 2022 introduce expert-choice routing in which experts pick top-k tokens, guaranteeing perfect load balance and reporting >2x convergence speedup over Switch top-1 and GShard top-2.
arxiv.org/abs/2202.09368
[06]
Mistral's Mixtral of Experts paper specifies 8 experts per layer, top-2 routing, ~47B total / ~13B active parameters for Mixtral 8x7B.
arxiv.org/abs/2401.04088
[07]
DeepSeekMoE paper introduces fine-grained expert segmentation and shared-expert isolation as MoE architectural primitives.
arxiv.org/abs/2401.06066
[08]
OpenMoE paper releases 650M-34B open MoE checkpoints and reports that routing decisions are predominantly context-independent and locked in early in training.
arxiv.org/abs/2402.01739
[09]
DeepSeek-V3 technical report specifies 671B total parameters, 37B active per token, auxiliary-loss-free load balancing, multi-token prediction, and 2.788M H800 GPU-hours full training.
arxiv.org/abs/2412.19437
[10]
Qwen3 technical report specifies Qwen3-235B-A22B and Qwen3-30B-A3B MoE checkpoints with 128 experts per layer, top-8 routing, no shared expert, global-batch balance loss.
arxiv.org/abs/2505.09388
[11]
IBM Granite 3.0 announcement specifies Granite-3.0-3B-A800M and Granite-3.0-1B-A400M as the MoE entries with 800M and 400M active parameters respectively, trained on 10T tokens.
ibm.com/new/announcements/ibm-granite-3-0-open-state-of-the-art-enterprise-models
[12]
Mixtral 8x22B reference card confirms 141B total parameters with approximately 39B active per token via top-2 routing across 8 experts.
huggingface.co/mistral-community/Mixtral-8x22B-v0.1
[13]
Google's GLaM announcement reports the model uses one-third of GPT-3's training energy and half its inference FLOPs while exceeding GPT-3 on average across 29 NLP tasks.
research.google/blog/more-efficient-in-context-learning-with-glam/
[14]
A 2025 analysis paper finds that expert granularity boosts MoE expressivity but with diminishing returns past a certain split ratio.
arxiv.org/abs/2505.06839

Keep reading

Atlas: open-weight model families →Learn: how transformers compute attention →Learn: scaling laws and compute budgets →Research: ÆoNs papers and disclosures →Tools: model picker by active-parameter class →vs: DeepSeek-V3 versus Mixtral 8x22B →OrangeBox: running MoEs on local hardware →

Mixture-of-experts, end to end

Vocabulary you need before reading any MoE paper

Timeline

Sparsely-gated MoE layer (Shazeer et al.)

Switch Transformer (Fedus, Zoph, Shazeer)

GLaM (Du et al., Google)

Expert choice routing (Zhou et al.)

ST-MoE and router z-loss (Zoph et al.)

Mixtral 8x7B (Mistral)

DeepSeekMoE (Dai et al.)

Mixtral 8x22B (Mistral)

IBM Granite 3.0 MoE

DeepSeek-V3 (DeepSeek-AI)

Qwen3 MoE (Alibaba Qwen Team)

The model index

How routing actually works

Training tricks that keep MoE stable

Load-balancing auxiliary loss

Router z-loss

Capacity factor and overflow

Fine-grained experts

Shared experts

Auxiliary-loss-free balance

Why MoE matters — the inference-cost argument

Active vs total — the FLOPs picture

Honest caveats and known failure modes

Open questions, as of mid-2026

Sources

Keep reading