What is a mixture-of-experts model?

The short answer

A mixture-of-experts (MoE) model is a neural network that routes each input token to a small subset of specialized sub-networks called “experts,” instead of running every parameter for every token. This lets models like Mixtral 8x7B and DeepSeek-V3 hold hundreds of billions of total parameters while only activating a fraction per forward pass, cutting compute cost without shrinking capacity.

The longer answer

A mixture-of-experts (MoE) model is a conditional-computation architecture in which a learned gating network decides, per input, which of several parallel expertsub-networks should process that input. The idea predates deep learning — Jacobs, Jordan, Nowlan, and Hinton introduced “Adaptive Mixtures of Local Experts” in Neural Computationin 1991 — but it became the dominant scaling strategy for large language models after Shazeer et al.’s “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” (arXiv:1701.06538, 2017), which showed an LSTM with 137 billion parameters could be trained on TPU clusters by activating only ~2 of 2048 experts per token.

The mechanism is simple. A standard Transformer feed-forward layer is replaced by N independent feed-forward “experts,” typically 8 to 256 of them. A small gating function (usually a learned linear projection followed by softmax) scores the experts for each token, and a top-k router selects the k highest-scoring experts (commonly k=1 or k=2). Only those k experts run; their outputs are weighted by the gate scores and summed. Because k is much smaller than N, the active parameter count per token is far below the total parameter count. Mistral AI’s Mixtral 8x7B (arXiv:2401.04088, January 2024), for example, has 46.7B total parameters but only 12.9B active per token because k=2 of 8 experts fire per layer.

This sparsity is the entire point. Dense models pay for every parameter on every token; MoE models pay only for the experts the router picks, so they trade some routing overhead and memory footprint for a dramatically lower FLOPs-per-token cost. Google’s GShard (arXiv:2006.16668, 2020) scaled this to a 600B-parameter translation model, and Switch Transformer (arXiv:2101.03961, 2021) showed that even k=1 routing — the simplest possible MoE — could pretrain 7x faster than a dense T5 baseline at matched compute.

The hard problems in MoE are load balancing and routing instability. If the gate sends most tokens to a handful of popular experts, the others starve and never learn; if the routing decision is too noisy, training diverges. Switch Transformer introduced an auxiliary load-balancing loss that penalizes uneven expert utilization. Expert Choice Routing (arXiv:2202.09368, 2022) inverted the formulation — experts pick tokens rather than tokens picking experts — which guarantees balance by construction. DeepSeek-V3 (arXiv:2412.19437, December 2024) pushed this further with an auxiliary-loss-free balancing scheme and 671B total / 37B active parameters across 256 routed experts plus 1 shared expert per layer.

MoE has costs too. Total parameter count drives memory and inter-GPU communication, not just compute, so MoE models need high-bandwidth interconnects (NVLink, InfiniBand) and expert parallelism— sharding experts across devices — to train and serve efficiently. Mixtral 8x7B requires roughly 90GB of VRAM in bf16 despite its 13B active footprint. The router is also a single point of failure: a poorly trained gate produces “dead experts” that never fire, wasting capacity.

As of 2025–2026, MoE is the default for frontier open-weight models. Mixtral 8x22B, DeepSeek-V3, Qwen3-235B-A22B, Llama 4 Maverick (17B active / 400B total, 128 experts), and Grok-1 (314B total, 2 of 8 experts per token) are all sparse MoE Transformers. Closed models are widely reported to use MoE as well — GPT-4 was described as a 16-expert MoE in multiple credible analyses, though OpenAI has not officially confirmed the architecture.

Key facts

The original MoE formulation appeared in Jacobs, Jordan, Nowlan, and Hinton, "Adaptive Mixtures of Local Experts," Neural Computation 3(1):79–87, 1991.
The sparsely-gated MoE layer that enabled modern LLM scaling is Shazeer et al., 2017 (arXiv:1701.06538).
Mixtral 8x7B has 46.7B total parameters and 12.9B active parameters per token via top-2 routing over 8 experts (arXiv:2401.04088).
Switch Transformer uses top-1 routing and achieved 7x pretraining speedup over T5-Base at matched FLOPs (arXiv:2101.03961).
GShard scaled MoE to 600B parameters across 2048 TPU cores for multilingual translation (arXiv:2006.16668).
DeepSeek-V3 is a 671B-parameter MoE with 37B active per token, 256 routed experts, and an auxiliary-loss-free load balancing strategy (arXiv:2412.19437).
The auxiliary load-balancing loss that prevents expert starvation was formalized in Switch Transformer (Fedus, Zoph, Shazeer, JMLR 2022).
Expert Choice Routing inverts the gating direction so experts select tokens, guaranteeing balanced utilization (Zhou et al., arXiv:2202.09368).
Llama 4 Maverick is a 400B-total / 17B-active MoE with 128 experts plus a shared expert, released April 2025 (Meta AI model card).
MoE serving requires expert parallelism because total parameter count, not active count, determines memory footprint — Mixtral 8x7B needs ~90GB VRAM in bf16 (Mistral AI release notes, December 2023).

What is a mixture-of-experts model?

The short answer

The longer answer

Key facts

Related questions

Sources