built throughORANGEBOX·see what it ships·$1 →
Seven identical matte-black gears in a hexagonal cluster, one glowing cyan — mixture of experts.

AtomEons / Learn / atlas / moe

Mixture-of-experts, end to end

A working atlas of the sparse models that decoupled training compute from inference cost.

A mixture-of-experts (MoE) language model is a transformer in which most of the feed-forward layers have been replaced by a bank of parallel sub-networks called experts. A small router network, attached to each MoE layer, picks a tiny subset of those experts to run for each token. The other experts sit on disk or in HBM and do nothing for that token. The model holds a very large number of total parameters and pays for a much smaller number of active parameters per forward pass. That single architectural choice is what makes a 671-billion-parameter open-weight model practical to serve. DeepSeek-V3, released December 2024, has 671B total parameters but activates only 37B per token (DeepSeek-AI, arxiv 2412.19437). Mixtral 8x22B has 141B total, 39B active (Mistral, January 2024). Qwen3-235B-A22B has 235B total, 22B active (Qwen Team, May 2025). The pattern is consistent: training-time capacity expands roughly linearly with total parameters, while serving cost tracks active parameters and a routing overhead that, when implemented well, is small. This page collects the lineage. It walks from Shazeer et al.'s 2017 sparsely-gated MoE layer, through the Switch Transformer that made top-1 routing work at scale, through GLaM, Mixtral, OpenMoE, DeepSeekMoE, Qwen-MoE, and IBM's Granite. For each we note total parameters, active parameters per token, the routing rule, and the training tricks that kept it stable. Numbers are pulled from primary sources (arxiv preprints, official model cards, vendor announcements) and footnoted. Where a number is provider-reported and not independently verified, we say so. Where a claim is best-effort as of June 2026, we say so. The aim is the working knowledge a practitioner needs to read MoE papers, size hardware, and tell honest scaling from marketing copy. A short glossary appears in the first section. Then the timeline, the model table, the routing mechanics, the training stabilizers, the inference cost picture, and a sober list of what is still hard.

Vocabulary you need before reading any MoE paper

  • Expert — one of the parallel feed-forward sub-networks inside an MoE layer. In dense transformers, the FFN is a single block; in MoE, it is replaced by N experts of similar shape.
  • Router (or gate) — a small learned network that, given a token's hidden state, outputs a score for each expert. Usually a linear projection followed by softmax.
  • Top-k routing — the routing rule. Top-1 sends each token to exactly one expert (Switch Transformer); top-2 sends it to two (GShard, Mixtral); higher k is rare.
  • Total parameters — the size of the saved checkpoint, dominated by expert weights.
  • Active parameters per token — what is actually multiplied during one forward pass: shared layers (attention, embedding, norms) plus the k experts the router chose. This sets serving FLOPs and latency.
  • Capacity factor — a per-expert token budget. If too many tokens want the same expert, the overflow gets dropped or routed to a fallback. Capacity = (tokens / num_experts) × capacity_factor.
  • Load-balancing loss — an auxiliary training loss that pushes the router to spread tokens evenly across experts, so no expert starves and none gets overworked.
  • Router z-loss — a separate auxiliary loss, introduced in ST-MoE, that penalises large router logits to keep softmax numerically stable at scale.
  • Expert choice — an alternative routing rule (Zhou et al. 2022) where experts pick their top-k tokens instead of tokens picking their top-k experts. Guarantees perfect load balance by construction.
  • Shared expert — an expert that runs for every token, used alongside sparsely-routed experts to capture common knowledge (DeepSeekMoE convention).

Timeline

  1. Jan 2017

    Sparsely-gated MoE layer (Shazeer et al.)

    arxiv 1701.06538. Up to 137B parameters in an LSTM language model, top-k softmax gating with k between 2 and 4, noisy gating for exploration, importance + load auxiliary losses. The first paper to show that sparse activation could deliver greater than 1000x capacity gains with manageable compute overhead on GPU clusters.

  2. Jan 2021

    Switch Transformer (Fedus, Zoph, Shazeer)

    arxiv 2101.03961. Simplified routing to top-1 (one expert per token), introduced capacity factor and overflow handling, and scaled to 1.6 trillion parameters on the C4 corpus. Reported a 4x pretraining speedup over a tuned T5-XXL dense baseline at matched compute.

  3. Dec 2021

    GLaM (Du et al., Google)

    arxiv 2112.06905. 1.2T total parameters, 64 experts per MoE layer, top-2 routing, roughly 97B activated per token. Reported using one-third the training energy of GPT-3 and half its inference FLOPs while beating it on average across 29 NLP tasks zero- and one-shot.

  4. Feb 2022

    Expert choice routing (Zhou et al.)

    arxiv 2202.09368. Flips the routing direction: each expert picks its top-k tokens instead of each token picking its top-k experts. Guarantees uniform expert load without an auxiliary balance loss, at the cost of giving up the contract that every token gets routed.

  5. Feb 2022

    ST-MoE and router z-loss (Zoph et al.)

    arxiv 2202.08906. Diagnosed the training-instability mode where router logits grow unboundedly and softmax explodes. Added an auxiliary log-sum-exp penalty on router logits (the z-loss) that has since become standard in Mixtral, DeepSeek-V3, and most large MoE training stacks.

  6. Dec 2023 / Jan 2024

    Mixtral 8x7B (Mistral)

    arxiv 2401.04088. Eight experts per layer, top-2 routing, 47B total parameters with about 13B active per token. The first widely-deployed open-weights MoE, distributed under Apache 2.0. Made MoE inference a practical concern for the open-source community.

  7. Jan 2024

    DeepSeekMoE (Dai et al.)

    arxiv 2401.06066. Introduced fine-grained expert segmentation (splitting each expert's FFN into smaller pieces and increasing N) and shared-expert isolation (a few always-on experts for common knowledge). Established the design template that DeepSeek-V2 and V3 would later scale.

  8. Apr 2024

    Mixtral 8x22B (Mistral)

    Eight experts, top-2 routing, 141B total parameters with about 39B active. Released under Apache 2.0 with a 64k context window. Reference architecture, no separate paper; details are in the Mistral release notes and the original Mixtral arxiv 2401.04088 lineage.

  9. Oct 2024

    IBM Granite 3.0 MoE

    Small enterprise-grade MoE: Granite-3.0-1B-A400M and Granite-3.0-3B-A800M, sized for CPU servers and on-device inference. The 'A400M' / 'A800M' suffix is the active parameter count. Trained on 10T tokens. Released under Apache 2.0.

  10. Dec 2024

    DeepSeek-V3 (DeepSeek-AI)

    arxiv 2412.19437. 671B total parameters, 37B active per token. Pioneered an auxiliary-loss-free load-balancing strategy (bias-adjusted routing) and a multi-token prediction training objective. Reported 2.788M H800 GPU-hours for full training — far below comparable closed dense models.

  11. May 2025

    Qwen3 MoE (Alibaba Qwen Team)

    arxiv 2505.09388. Two MoE checkpoints: Qwen3-235B-A22B (235B total, 22B active) and Qwen3-30B-A3B (30B total, 3B active). 128 experts per layer, top-8 routing, no shared expert. Adopted a global-batch load balancing loss for expert specialisation.

The model index

ModelShazeer et al. MoE LSTM
ReleasedJan 2017
Total paramsup to ~137B
Active / tokenvaries (top-k softmax, k=2-4)
RoutingNoisy top-k
NotesFirst production-scale sparse MoE. arxiv 1701.06538.
ModelSwitch Transformer
ReleasedJan 2021
Total paramsup to 1.6T
Active / tokenshared + 1 expert
RoutingTop-1
NotesCapacity factor, overflow drop. arxiv 2101.03961.
ModelGLaM
ReleasedDec 2021
Total params1.2T
Active / token~96.6B
RoutingTop-2
Notes64 experts/layer, 32 MoE layers. arxiv 2112.06905.
ModelMixtral 8x7B
ReleasedDec 2023
Total params~47B
Active / token~13B
RoutingTop-2
Notes8 experts/layer. arxiv 2401.04088. Apache 2.0.
ModelDeepSeekMoE 16B
ReleasedJan 2024
Total params16B
Active / token~2.8B
RoutingTop-k, fine-grained + shared
NotesArchitecture template for V2/V3. arxiv 2401.06066.
ModelMixtral 8x22B
ReleasedApr 2024
Total params141B
Active / token~39B
RoutingTop-2
Notes64k context. Apache 2.0.
ModelOpenMoE
ReleasedFeb 2024
Total params650M to 34B
Active / tokenvaries by size
RoutingTop-k
NotesFully open recipe + data + checkpoints. arxiv 2402.01739.
ModelGranite 3.0-3B-A800M
ReleasedOct 2024
Total params3B
Active / token800M
RoutingTop-k
NotesIBM. CPU / edge target. Apache 2.0.
ModelGranite 3.0-1B-A400M
ReleasedOct 2024
Total params1B
Active / token400M
RoutingTop-k
NotesIBM. Same family, smaller.
ModelDeepSeek-V3
ReleasedDec 2024
Total params671B
Active / token37B
RoutingTop-k, aux-loss-free + bias
NotesMulti-token prediction. arxiv 2412.19437.
ModelQwen3-235B-A22B
ReleasedMay 2025
Total params235B
Active / token22B
RoutingTop-8 of 128
NotesGlobal-batch load balance loss. arxiv 2505.09388.
ModelQwen3-30B-A3B
ReleasedMay 2025
Total params30B
Active / token3B
RoutingTop-8 of 128
NotesSmaller sibling of the 235B model.

How routing actually works

At each MoE layer, the router is a linear projection from the token's hidden state d to a vector of length N (the number of experts), followed by softmax. Call that score vector s. Top-k routing selects the k largest entries of s; the token's output is the weighted sum of those k experts' outputs, weighted by their (renormalised) softmax scores. That is the whole inner loop. The choice of k matters. Top-1 (Switch Transformer) halves communication cost in distributed setups and simplifies gradients, but loses the ability to combine expertise. Top-2 (GShard, Mixtral, GLaM) is the most common setting in production: it gets most of the routing flexibility with modest extra cost. Top-8 of 128 (Qwen3) is a fine-grained variant: many small experts, each token still touches a small fraction, but the combinatorics are richer. Expert choice (Zhou et al., arxiv 2202.09368) inverts the contract. Instead of each token claiming its top-k experts, each expert claims its top-k tokens. By construction, every expert sees the same number of tokens, which removes the need for a load-balancing loss. The price is that some tokens may not be claimed by any expert; the paper handles this and reports more than 2x faster pretraining convergence at matched compute relative to Switch top-1 and GShard top-2. DeepSeek-V3 contributes a more recent twist: an auxiliary-loss-free strategy in which the router maintains a per-expert bias that is nudged up or down to enforce balance, without adding a loss term that competes with the cross-entropy objective. The reported benefit is that the router can specialise more freely without being penalised by a balance term.

Training tricks that keep MoE stable

Load-balancing auxiliary loss

arxiv 2101.03961

Introduced in Shazeer 2017, standardised by Switch Transformer. An auxiliary term added to the cross-entropy loss that penalises the router for over- or under-using any expert. Switch Transformer uses a coefficient of 10^-2; the loss is the product of the fraction of tokens routed to each expert and the fraction of routing probability assigned to it, summed and scaled by N.

Router z-loss

arxiv 2202.08906

Introduced in ST-MoE (Zoph et al. 2022). Penalises log-sum-exp of router logits to stop them growing unboundedly and blowing up softmax in bfloat16. Has become standard in essentially every large MoE since: Mixtral, DeepSeek, Qwen, Granite all use it or an equivalent.

Capacity factor and overflow

Switch Transformer §2.2

Each expert gets a fixed per-batch token budget. If more tokens want it than the budget allows, the excess is dropped (the residual stream is passed through unchanged) or rerouted. Switch Transformer used capacity factors between 1.0 and 1.5. Setting it too low drops too many tokens; setting it too high wastes memory.

Fine-grained experts

arxiv 2401.06066

DeepSeekMoE's contribution. Instead of N large experts, use 4N or 8N small experts (smaller intermediate FFN dim) and route to more of them. Empirically improves expert specialisation at the same active-parameter budget. Granularity is a knob, not a free win — see arxiv 2505.06839 for a recent analysis.

Shared experts

DeepSeekMoE / Qwen3

One or more experts that run for every token, alongside the routed experts. The idea is that some computation is genuinely common to all tokens (basic syntax, generic semantics) and should not have to be re-learned in every routed expert. Used in DeepSeekMoE and DeepSeek-V2/V3. Qwen3 dropped shared experts in favour of more routed experts plus global-batch balancing.

Auxiliary-loss-free balance

arxiv 2412.19437

DeepSeek-V3's contribution. Instead of an auxiliary loss term that competes with cross-entropy, maintain a per-expert bias that is adjusted online to push the router toward balanced usage. The model still uses a small sequence-wise balance loss as a safety net.

Why MoE matters — the inference-cost argument

The headline reason MoE is interesting is that it decouples two numbers that are coupled in dense transformers: parameters and FLOPs per token. In a dense model, every parameter participates in every forward pass. A 70B-parameter dense model touches roughly 140 GFLOPs per token per layer pass (two FLOPs per parameter, ignoring attention). Doubling the parameters doubles the inference cost. There is no escape route at serving time. In an MoE model, the active-parameter budget per token can be held constant while total parameters grow. Mixtral 8x22B has 141B parameters but does the per-token work of a ~39B-active model. DeepSeek-V3 has 671B but does the per-token work of a ~37B-active model. The cost surfaces are: (1) HBM and storage scale with total parameters — you still have to hold all those weights in memory somewhere, and (2) routing introduces an extra all-to-all communication step in distributed serving, which is non-trivial at high batch sizes. The Switch Transformer paper put substantial engineering effort into making that all-to-all cheap. What MoE does not give you is free memory. A 671B-parameter MoE checkpoint is still 671B parameters of weights on disk. The inference benefit shows up in FLOPs, latency, and (with care) cost per token, not in VRAM requirements. As of June 2026, serving DeepSeek-V3 still needs roughly the same HBM as a 671B dense model of equivalent precision — multi-node, expert-parallel — but the per-token compute is in the tens-of-billions-of-active-parameters range, not the hundreds. That is the trade the architecture is built around.

Active vs total — the FLOPs picture

ModelMixtral 8x7B
Total params47B
Active / token~13B
Active / total ratio~28%
Equivalent dense FLOPs class~13B dense
ModelMixtral 8x22B
Total params141B
Active / token~39B
Active / total ratio~28%
Equivalent dense FLOPs class~39B dense
ModelGLaM (full)
Total params1.2T
Active / token~96.6B
Active / total ratio~8%
Equivalent dense FLOPs class~96.6B dense
ModelDeepSeek-V3
Total params671B
Active / token37B
Active / total ratio~5.5%
Equivalent dense FLOPs class~37B dense
ModelQwen3-235B-A22B
Total params235B
Active / token22B
Active / total ratio~9%
Equivalent dense FLOPs class~22B dense
ModelQwen3-30B-A3B
Total params30B
Active / token3B
Active / total ratio~10%
Equivalent dense FLOPs class~3B dense
ModelGranite 3.0-3B-A800M
Total params3B
Active / token0.8B
Active / total ratio~27%
Equivalent dense FLOPs class~0.8B dense

Honest caveats and known failure modes

Three things to remember when reading MoE marketing copy. First, total-parameter counts are not directly comparable to dense-model parameter counts. A 671B MoE is not '10x GPT-3' in any operational sense. The fair comparison is active parameters versus dense parameters at matched training tokens. Second, MoE training is more brittle. Router z-loss, fine-grained experts, and auxiliary-loss-free balance are all responses to real training collapses that happened to real teams. The OpenMoE paper (arxiv 2402.01739) found that routing decisions get locked in extremely early and barely change afterwards — context-independent specialisation by token ID is the dominant regime. That is a finding worth absorbing before claiming an MoE has 'learned' something semantically sophisticated about its expert layout. Third, MoE inference at low batch sizes is bandwidth-bound in unintuitive ways. Each token may need to pull weights for a different subset of experts from HBM; the FLOPs win is real, but the memory-bandwidth win is smaller and sometimes absent. Provider pricing reflects this — check provider docs for current per-token pricing rather than reading off active-parameter counts. As of June 2026 this is a best-effort summary. New MoEs ship weekly; specific routing-mechanism details for vendor models should be confirmed against the official technical report or model card before relying on them in production.

Open questions, as of mid-2026

  • Is fine-grained expert segmentation a strictly better default, or does it just trade load-balance ease for routing-overhead cost? The 2025 arxiv 2505.06839 analysis says granularity boosts expressivity but flattens out; the question is open.
  • Does the OpenMoE finding (routing locks in early, mostly by token ID) generalise to large auxiliary-loss-free models like DeepSeek-V3? No public replication yet.
  • Are shared experts a permanent design feature or a transitional crutch? Qwen3 abandoned them; DeepSeek-V3 kept them. Neither paper presents a clean ablation against a matched baseline.
  • What is the right load-balance mechanism — auxiliary loss, expert choice, or bias-adjusted routing? DeepSeek-V3 argues the loss-free approach is cleaner; nobody has run a head-to-head at matched scale.
  • When does MoE stop helping? At very small total-parameter counts the routing overhead dominates. At very large counts, communication costs can dominate. The middle band where MoE wins is wider than dense-model advocates claim and narrower than MoE marketing implies.

Sources

  1. [01]

    Shazeer et al. 2017 introduce the sparsely-gated mixture-of-experts layer and demonstrate greater than 1000x capacity gains over dense baselines.

    arxiv.org/abs/1701.06538

  2. [02]

    Fedus, Zoph, Shazeer 2021 introduce Switch Transformer with top-1 routing, capacity factor, and a 1.6 trillion-parameter scaling demonstration.

    arxiv.org/abs/2101.03961

  3. [03]

    Du et al. 2022 GLaM paper reports a 1.2T-parameter MoE activating roughly 96.6B parameters per token, with 64 experts per MoE layer and top-2 routing.

    arxiv.org/abs/2112.06905

  4. [04]

    Zoph et al. 2022 ST-MoE paper introduces router z-loss as the standard mechanism for stabilising MoE training against router-logit blowup.

    arxiv.org/abs/2202.08906

  5. [05]

    Zhou et al. 2022 introduce expert-choice routing in which experts pick top-k tokens, guaranteeing perfect load balance and reporting >2x convergence speedup over Switch top-1 and GShard top-2.

    arxiv.org/abs/2202.09368

  6. [06]

    Mistral's Mixtral of Experts paper specifies 8 experts per layer, top-2 routing, ~47B total / ~13B active parameters for Mixtral 8x7B.

    arxiv.org/abs/2401.04088

  7. [07]

    DeepSeekMoE paper introduces fine-grained expert segmentation and shared-expert isolation as MoE architectural primitives.

    arxiv.org/abs/2401.06066

  8. [08]

    OpenMoE paper releases 650M-34B open MoE checkpoints and reports that routing decisions are predominantly context-independent and locked in early in training.

    arxiv.org/abs/2402.01739

  9. [09]

    DeepSeek-V3 technical report specifies 671B total parameters, 37B active per token, auxiliary-loss-free load balancing, multi-token prediction, and 2.788M H800 GPU-hours full training.

    arxiv.org/abs/2412.19437

  10. [10]

    Qwen3 technical report specifies Qwen3-235B-A22B and Qwen3-30B-A3B MoE checkpoints with 128 experts per layer, top-8 routing, no shared expert, global-batch balance loss.

    arxiv.org/abs/2505.09388

  11. [11]

    IBM Granite 3.0 announcement specifies Granite-3.0-3B-A800M and Granite-3.0-1B-A400M as the MoE entries with 800M and 400M active parameters respectively, trained on 10T tokens.

    ibm.com/new/announcements/ibm-granite-3-0-open-state-of-the-art-enterprise-models

  12. [12]

    Mixtral 8x22B reference card confirms 141B total parameters with approximately 39B active per token via top-2 routing across 8 experts.

    huggingface.co/mistral-community/Mixtral-8x22B-v0.1

  13. [13]

    Google's GLaM announcement reports the model uses one-third of GPT-3's training energy and half its inference FLOPs while exceeding GPT-3 on average across 29 NLP tasks.

    research.google/blog/more-efficient-in-context-learning-with-glam/

  14. [14]

    A 2025 analysis paper finds that expert granularity boosts MoE expressivity but with diminishing returns past a certain split ratio.

    arxiv.org/abs/2505.06839

LAB · ATOMEONS · MARCO ISLAND FLÆONS RESEARCH · 12 PAPERS · CC-BY 4.0ORANGEBOX v1.0.0-beta · TURBO-OPTIMIZE CLAUDE · SHIPPED 2026-05-30B00KMAKR v3.2.0 · AI PUBLISHING COCKPIT · MAC + WINDOWSFREE LAUNCH WEEK · ENDS JUNE 6 · §4A NO-SAAS LOCKFOUNDER'S VIEW · NEXT BROADCAST IN ...CITE THE WORK · FORWARD THE LINK · NO ALGORITHMLAB · ATOMEONS · MARCO ISLAND FLÆONS RESEARCH · 12 PAPERS · CC-BY 4.0ORANGEBOX v1.0.0-beta · TURBO-OPTIMIZE CLAUDE · SHIPPED 2026-05-30B00KMAKR v3.2.0 · AI PUBLISHING COCKPIT · MAC + WINDOWSFREE LAUNCH WEEK · ENDS JUNE 6 · §4A NO-SAAS LOCKFOUNDER'S VIEW · NEXT BROADCAST IN ...CITE THE WORK · FORWARD THE LINK · NO ALGORITHM