Mixture of Experts

Mixture of Experts Visualized

How sparse models achieve massive capacity with fixed compute cost

8
Experts
Top-2
Routing
3x
Capacity
Same
FLOPs

Dense vs Sparse — The Key Insight

A dense model activates every parameter for every token. An MoE model activates only a fraction, achieving more capacity with the same compute.

Dense Feed-Forward Network

Input Token Single Large FFN N parameters — ALL active 100% utilization per token Output
N params total = N params active

Mixture of Experts FFN

Input Token Router (Gating) E0 FFN E1 FFN E2 FFN E3 FFN E4 FFN E5 FFN E6 FFN E7 FFN Top-2 Selected Weighted Sum Output
8N params total, only 2N active
3x more parameters, same compute cost
Key Insight: MoE gives the model a larger "brain" (more total parameters to store knowledge) while keeping per-token computation fixed. Each token only needs to consult 2 out of 8 expert sub-networks, so throughput stays the same as a smaller dense model.

Inside the MoE Layer

Walk through each step of the Mixture of Experts forward pass, from token arrival to weighted output.

Step 1: Token Arrives

A token embedding vector enters the MoE layer, ready to be routed to the best experts.

Token Embedding x 0.3 -0.7 1.2 0.1 -0.4 0.8 ... d_model = 4096
E0 FFN
E1 FFN
E2 FFN
E3 FFN
E4 FFN
E5 FFN
E6 FFN
E7 FFN

Token Routing in Action

Watch how a sequence of tokens is distributed across experts. Each token picks its top-2 experts via the router.

Token 0
Expert Load Distribution — How many tokens each expert receives
Load Balancing Problem: Notice how some experts receive many tokens while others receive few. This imbalance wastes capacity and can bottleneck training throughput.

Load Balancing & Auxiliary Loss

Without intervention, the router collapses to a few "favorite" experts. An auxiliary loss term encourages uniform routing.

Without Auxiliary Loss

Router collapses to 2-3 experts. Most experts are underutilized.

With Auxiliary Loss

Aux loss encourages balanced routing across all experts.

Auxiliary Load Balancing Loss Laux = α · N · ∑i=1N fi · Pi
fi = fraction of tokens routed to expert i
Pi = average routing probability for expert i
N = number of experts
Why it works: The product fi · Pi is minimized when routing is perfectly uniform (fi = Pi = 1/N for all i). Any imbalance increases the loss.

Where MoE Layers Go in the Transformer

MoE replaces the standard FFN in transformer blocks. Different architectures choose different placement strategies.

Standard Transformer Block with MoE

Input + Positional Encoding
Layer Multi-Head Self-Attention
Add & Norm
Replaces Standard FFN MoE Feed-Forward (8 Experts, Top-2)
Add & Norm → Output

Every Layer: All FFN layers replaced with MoE

Design tradeoff: "Every Layer" maximizes capacity but increases memory and communication overhead. "Alternating" (used by Mixtral, Switch Transformer) balances capacity and efficiency — MoE only on every other layer, dense FFN on the rest.