Last reviewed:
What is MoE (Mixture of Experts)? Definition and business implications
Mixture of Experts (MoE) is an AI model architecture that splits the network into specialised sub-models, called experts. For each token processed, a router dynamically selects a few experts, leaving the others inactive. The model has the capacity of a large model but the compute cost of a smaller one.
The concept of mixture of experts was formalised for transformers by Fedus, Zoph, and Shazeer in 2021 (Switch Transformer, Google). The idea: instead of activating the entire network on each inference, a lightweight router directs each token to a subset of specialised experts, typically 1 to 8 out of the dozens available. Consequence: one can train and deploy models with hundreds of billions of parameters, but only one-tenth is mobilised on each token. In 2026, nearly all frontier models use an MoE architecture, with the notable exception of Claude Opus 4.7. GPT-4 reportedly totals 1.76 trillion parameters split across 16 experts. DeepSeek V3: 671 billion total parameters, 37 billion active per token. Mixtral 8x22B (Mistral): 141 billion total, 39 billion active. The competitive advantage of MoE is so clear that it has become an architectural standard for very-large-scale models.
Concrete example
DeepSeek V3 illustrates the economic advantage of MoE. The model has 671 billion parameters in total, but activates only 37 billion per token, that is, 5.4% of the whole. Direct consequence: it achieves performance comparable to GPT-4 on most public benchmarks, while costing about 5.6 million dollars to train, against 78 to 100 million for GPT-4. At inference, cost per token is about half that of GPT-4 according to public pricing. This efficiency explains the massive adoption of MoE in 2025-2026: almost all frontier models have adopted it, except Anthropic, which maintains a dense architecture for Claude Opus 4.7.
See also
Further reading
Switch Transformers: Scaling to Trillion Parameter Models, Fedus et al., 2021
Sources
- Switch Transformers: Scaling to Trillion Parameter Models, Fedus, Zoph & Shazeer, arXiv:2101.03961, 2021. https://arxiv.org/abs/2101.03961
- DeepSeek-V3 Technical Report, DeepSeek-AI, arXiv:2412.19437, 2024. https://arxiv.org/abs/2412.19437