Last reviewed: May 24, 2026

What is MoE (Mixture of Experts)? Definition and business implications

Mixture of Experts (MoE) is an AI model architecture that splits the network into specialised sub-models, called experts. For each token processed, a router dynamically selects a few experts, leaving the others inactive. The model has the capacity of a large model but the compute cost of a smaller one.

The concept of mixture of experts was formalised for transformers by Fedus, Zoph, and Shazeer in 2021 (Switch Transformer, Google). The idea: instead of activating the entire network on each inference, a lightweight router directs each token to a subset of specialised experts, typically 1 to 8 out of the dozens available. Consequence: one can train and deploy models with hundreds of billions of parameters, but only one-tenth is mobilised on each token. In 2026, nearly all frontier models use an MoE architecture, with the notable exception of Claude Opus 4.7. GPT-4 reportedly totals 1.76 trillion parameters split across 16 experts. DeepSeek V3: 671 billion total parameters, 37 billion active per token. Mixtral 8x22B (Mistral): 141 billion total, 39 billion active. The competitive advantage of MoE is so clear that it has become an architectural standard for very-large-scale models.

Concrete example

DeepSeek V3 illustrates the economic advantage of MoE. The model has 671 billion parameters in total, but activates only 37 billion per token, that is, 5.4% of the whole. Direct consequence: it achieves performance comparable to GPT-4 on most public benchmarks, while costing about 5.6 million dollars to train, against 78 to 100 million for GPT-4. At inference, cost per token is about half that of GPT-4 according to public pricing. This efficiency explains the massive adoption of MoE in 2025-2026: almost all frontier models have adopted it, except Anthropic, which maintains a dense architecture for Claude Opus 4.7.

To ask your IT director

“Is the model we use dense or MoE, and do we know why it matters?” This question, asked of your IT director or AI provider, often reveals a blind spot in the technical decision. Three sub-questions to chain. First, does the announced inference cost per token reflect actual activation (for example, 5% of parameters in MoE) or total size? Without this distinction, price comparison is misleading. Second, has the chosen MoE model been load-tested for stability? A well-documented effect, routing collapse, can concentrate 90% of traffic on a few experts and degrade average quality. Third, does our deployment require the corresponding VRAM infrastructure? An MoE model with 671 billion parameters requires several hundred gigabytes of memory, even if activation is minimal.

Sources

Switch Transformers: Scaling to Trillion Parameter Models, Fedus, Zoph & Shazeer, arXiv:2101.03961, 2021. https://arxiv.org/abs/2101.03961 (accessed 2026-05-24)
DeepSeek-V3 Technical Report, DeepSeek-AI, arXiv:2412.19437, 2024. https://arxiv.org/abs/2412.19437 (accessed 2026-05-24)

← Back to glossary

What is MoE (Mixture of Experts)? Definition and business implications

Concrete example

See also

Further reading

Sources