built throughORANGEBOX·see what it ships·$1 →

What is a mixture-of-experts model?

The short answer

A mixture-of-experts (MoE) model is a neural network that routes each input token to a small subset of specialized sub-networks called “experts,” instead of running every parameter for every token. This lets models like Mixtral 8x7B and DeepSeek-V3 hold hundreds of billions of total parameters while only activating a fraction per forward pass, cutting compute cost without shrinking capacity.

The longer answer

A mixture-of-experts (MoE) model is a conditional-computation architecture in which a learned gating network decides, per input, which of several parallel expertsub-networks should process that input. The idea predates deep learning — Jacobs, Jordan, Nowlan, and Hinton introduced “Adaptive Mixtures of Local Experts” in Neural Computationin 1991 — but it became the dominant scaling strategy for large language models after Shazeer et al.’s “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” (arXiv:1701.06538, 2017), which showed an LSTM with 137 billion parameters could be trained on TPU clusters by activating only ~2 of 2048 experts per token.

The mechanism is simple. A standard Transformer feed-forward layer is replaced by N independent feed-forward “experts,” typically 8 to 256 of them. A small gating function (usually a learned linear projection followed by softmax) scores the experts for each token, and a top-k router selects the k highest-scoring experts (commonly k=1 or k=2). Only those k experts run; their outputs are weighted by the gate scores and summed. Because k is much smaller than N, the active parameter count per token is far below the total parameter count. Mistral AI’s Mixtral 8x7B (arXiv:2401.04088, January 2024), for example, has 46.7B total parameters but only 12.9B active per token because k=2 of 8 experts fire per layer.

This sparsity is the entire point. Dense models pay for every parameter on every token; MoE models pay only for the experts the router picks, so they trade some routing overhead and memory footprint for a dramatically lower FLOPs-per-token cost. Google’s GShard (arXiv:2006.16668, 2020) scaled this to a 600B-parameter translation model, and Switch Transformer (arXiv:2101.03961, 2021) showed that even k=1 routing — the simplest possible MoE — could pretrain 7x faster than a dense T5 baseline at matched compute.

The hard problems in MoE are load balancing and routing instability. If the gate sends most tokens to a handful of popular experts, the others starve and never learn; if the routing decision is too noisy, training diverges. Switch Transformer introduced an auxiliary load-balancing loss that penalizes uneven expert utilization. Expert Choice Routing (arXiv:2202.09368, 2022) inverted the formulation — experts pick tokens rather than tokens picking experts — which guarantees balance by construction. DeepSeek-V3 (arXiv:2412.19437, December 2024) pushed this further with an auxiliary-loss-free balancing scheme and 671B total / 37B active parameters across 256 routed experts plus 1 shared expert per layer.

MoE has costs too. Total parameter count drives memory and inter-GPU communication, not just compute, so MoE models need high-bandwidth interconnects (NVLink, InfiniBand) and expert parallelism— sharding experts across devices — to train and serve efficiently. Mixtral 8x7B requires roughly 90GB of VRAM in bf16 despite its 13B active footprint. The router is also a single point of failure: a poorly trained gate produces “dead experts” that never fire, wasting capacity.

As of 2025–2026, MoE is the default for frontier open-weight models. Mixtral 8x22B, DeepSeek-V3, Qwen3-235B-A22B, Llama 4 Maverick (17B active / 400B total, 128 experts), and Grok-1 (314B total, 2 of 8 experts per token) are all sparse MoE Transformers. Closed models are widely reported to use MoE as well — GPT-4 was described as a 16-expert MoE in multiple credible analyses, though OpenAI has not officially confirmed the architecture.

Key facts

  • The original MoE formulation appeared in Jacobs, Jordan, Nowlan, and Hinton, "Adaptive Mixtures of Local Experts," Neural Computation 3(1):79–87, 1991.
  • The sparsely-gated MoE layer that enabled modern LLM scaling is Shazeer et al., 2017 (arXiv:1701.06538).
  • Mixtral 8x7B has 46.7B total parameters and 12.9B active parameters per token via top-2 routing over 8 experts (arXiv:2401.04088).
  • Switch Transformer uses top-1 routing and achieved 7x pretraining speedup over T5-Base at matched FLOPs (arXiv:2101.03961).
  • GShard scaled MoE to 600B parameters across 2048 TPU cores for multilingual translation (arXiv:2006.16668).
  • DeepSeek-V3 is a 671B-parameter MoE with 37B active per token, 256 routed experts, and an auxiliary-loss-free load balancing strategy (arXiv:2412.19437).
  • The auxiliary load-balancing loss that prevents expert starvation was formalized in Switch Transformer (Fedus, Zoph, Shazeer, JMLR 2022).
  • Expert Choice Routing inverts the gating direction so experts select tokens, guaranteeing balanced utilization (Zhou et al., arXiv:2202.09368).
  • Llama 4 Maverick is a 400B-total / 17B-active MoE with 128 experts plus a shared expert, released April 2025 (Meta AI model card).
  • MoE serving requires expert parallelism because total parameter count, not active count, determines memory footprint — Mixtral 8x7B needs ~90GB VRAM in bf16 (Mistral AI release notes, December 2023).

Related questions

Sources

LAB · ATOMEONS · MARCO ISLAND FLÆONS RESEARCH · 12 PAPERS · CC-BY 4.0ORANGEBOX v1.0.0-beta · TURBO-OPTIMIZE CLAUDE · SHIPPED 2026-05-30B00KMAKR v3.2.0 · AI PUBLISHING COCKPIT · MAC + WINDOWSFREE LAUNCH WEEK · ENDS JUNE 6 · §4A NO-SAAS LOCKFOUNDER'S VIEW · NEXT BROADCAST IN ...CITE THE WORK · FORWARD THE LINK · NO ALGORITHMLAB · ATOMEONS · MARCO ISLAND FLÆONS RESEARCH · 12 PAPERS · CC-BY 4.0ORANGEBOX v1.0.0-beta · TURBO-OPTIMIZE CLAUDE · SHIPPED 2026-05-30B00KMAKR v3.2.0 · AI PUBLISHING COCKPIT · MAC + WINDOWSFREE LAUNCH WEEK · ENDS JUNE 6 · §4A NO-SAAS LOCKFOUNDER'S VIEW · NEXT BROADCAST IN ...CITE THE WORK · FORWARD THE LINK · NO ALGORITHM