Last reviewed:

What is MoE (Mixture of Experts)? Definition and business implications

Mixture of Experts (MoE) is an AI model architecture that splits the network into specialised sub-models, called experts. For each token processed, a router dynamically selects a few experts, leaving the others inactive. The model has the capacity of a large model but the compute cost of a smaller one.

The concept of mixture of experts was formalised for transformers by Fedus, Zoph, and Shazeer in 2021 (Switch Transformer, Google). The idea: instead of activating the entire network on each inference, a lightweight router directs each token to a subset of specialised experts, typically 1 to 8 out of the dozens available. Consequence: one can train and deploy models with hundreds of billions of parameters, but only one-tenth is mobilised on each token. In 2026, nearly all frontier models use an MoE architecture, with the notable exception of Claude Opus 4.7. GPT-4 reportedly totals 1.76 trillion parameters split across 16 experts. DeepSeek V3: 671 billion total parameters, 37 billion active per token. Mixtral 8x22B (Mistral): 141 billion total, 39 billion active. The competitive advantage of MoE is so clear that it has become an architectural standard for very-large-scale models.

Concrete example

DeepSeek V3 illustrates the economic advantage of MoE. The model has 671 billion parameters in total, but activates only 37 billion per token, that is, 5.4% of the whole. Direct consequence: it achieves performance comparable to GPT-4 on most public benchmarks, while costing about 5.6 million dollars to train, against 78 to 100 million for GPT-4. At inference, cost per token is about half that of GPT-4 according to public pricing. This efficiency explains the massive adoption of MoE in 2025-2026: almost all frontier models have adopted it, except Anthropic, which maintains a dense architecture for Claude Opus 4.7.

See also

Further reading

Switch Transformers: Scaling to Trillion Parameter Models, Fedus et al., 2021 (external resource)

Sources

  1. Switch Transformers: Scaling to Trillion Parameter Models, Fedus, Zoph & Shazeer, arXiv:2101.03961, 2021. https://arxiv.org/abs/2101.03961 (accessed 2026-05-24)
  2. DeepSeek-V3 Technical Report, DeepSeek-AI, arXiv:2412.19437, 2024. https://arxiv.org/abs/2412.19437 (accessed 2026-05-24)

← Back to glossary

Address copied