Mixture of Experts — Visualized

Section A

Dense vs Sparse — The Key Insight

A dense model activates every parameter for every token. An MoE model activates only a fraction, achieving more capacity with the same compute.

Dense Feed-Forward Network

N params total = N params active

Mixture of Experts FFN

8N params total, only 2N active

3x more parameters, same compute cost

Key Insight: MoE gives the model a larger "brain" (more total parameters to store knowledge) while keeping per-token computation fixed. Each token only needs to consult 2 out of 8 expert sub-networks, so throughput stays the same as a smaller dense model.

Section B

Inside the MoE Layer

Walk through each step of the Mixture of Experts forward pass, from token arrival to weighted output.

Step 1: Token Arrives

A token embedding vector enters the MoE layer, ready to be routed to the best experts.

E0 FFN

E1 FFN

E2 FFN

E3 FFN

E4 FFN

E5 FFN

E6 FFN

E7 FFN

Section C

Token Routing in Action

Watch how a sequence of tokens is distributed across experts. Each token picks its top-2 experts via the router.

Highlight Token: Token 0

Expert Load Distribution — How many tokens each expert receives

Load Balancing Problem: Notice how some experts receive many tokens while others receive few. This imbalance wastes capacity and can bottleneck training throughput.

Section D

Load Balancing & Auxiliary Loss

Without intervention, the router collapses to a few "favorite" experts. An auxiliary loss term encourages uniform routing.

Without Auxiliary Loss

Router collapses to 2-3 experts. Most experts are underutilized.

With Auxiliary Loss

Aux loss encourages balanced routing across all experts.

Auxiliary Load Balancing Loss L_aux = α · N · ∑_i=1^N f_i · P_i

f_i = fraction of tokens routed to expert i
P_i = average routing probability for expert i
N = number of experts

Why it works: The product f_i · P_i is minimized when routing is perfectly uniform (f_i = P_i = 1/N for all i). Any imbalance increases the loss.

Section E

Where MoE Layers Go in the Transformer

MoE replaces the standard FFN in transformer blocks. Different architectures choose different placement strategies.

Standard Transformer Block with MoE

Input + Positional Encoding

Layer Multi-Head Self-Attention

Add & Norm

Replaces Standard FFN MoE Feed-Forward (8 Experts, Top-2)

Add & Norm → Output

Every Layer: All FFN layers replaced with MoE

Design tradeoff: "Every Layer" maximizes capacity but increases memory and communication overhead. "Alternating" (used by Mixtral, Switch Transformer) balances capacity and efficiency — MoE only on every other layer, dense FFN on the rest.

Mixture of Experts Visualized

Dense vs Sparse — The Key Insight

Dense Feed-Forward Network

Mixture of Experts FFN

Inside the MoE Layer

Step 1: Token Arrives

Token Routing in Action

Load Balancing & Auxiliary Loss

Without Auxiliary Loss

With Auxiliary Loss

Where MoE Layers Go in the Transformer

Standard Transformer Block with MoE

Every Layer: All FFN layers replaced with MoE