Expert Parallelism

Expert Parallelism Visualized

How MoE experts are distributed across GPUs using All-to-All communication

8
Experts
4
GPUs
A2A
All-to-All Comm
2
Experts / GPU

Too Many Experts for One GPU

In a Mixture-of-Experts model, each expert is a full FFN block. With 8 experts, a single GPU simply runs out of memory.

0
GPU 0 — Single GPU Attempt
Trying to fit everything on one device
Attention Layers + Embeddings ~15% VRAM
Attn
8 Expert FFN Blocks ~120% VRAM !!!
8 Experts = OOM
E0
E1
E2
E3
E4
E5
E6
E7
Out of Memory — Each expert is a full FFN (same size as the original dense FFN)

The Solution: Expert Parallelism

Distribute the 8 experts across 4 GPUs — each GPU holds only 2 experts while keeping the full attention layers.

Distributing Experts Across GPUs

Each GPU holds the full model (attention, embeddings) but only a subset of the expert FFN blocks.

GPU 0
Attention + Embeddings + LayerNorm (full copy on every GPU)
Expert 0
Expert 1
GPU 1
Attention + Embeddings + LayerNorm (full copy on every GPU)
Expert 2
Expert 3
GPU 2
Attention + Embeddings + LayerNorm (full copy on every GPU)
Expert 4
Expert 5
GPU 3
Attention + Embeddings + LayerNorm (full copy on every GPU)
Expert 6
Expert 7

All-to-All Communication Step by Step

The All-to-All dispatch and combine is the heart of expert parallelism. Watch tokens flow between GPUs to reach their target experts.

Step 1 / 5

Before All-to-All

Each GPU has a batch of tokens. The router has assigned each token to a target expert.

Source GPUs — each has local tokens
GPU 0
E0
E1
GPU 1
E2
E3
GPU 2
E4
E5
GPU 3
E6
E7

EP Communication Analysis

Every MoE layer requires two All-to-All operations: dispatch (send tokens) and combine (receive results).

Comm / MoE layer = 2 × B × S × H
Two All-to-All collective operations per Mixture-of-Experts layer (dispatch + combine)
B = batch size S = sequence length H = hidden dimension

Expert Parallelism

All-to-All
Tokens dispatched to target expert's GPU, results sent back. Per MoE layer.
2 × B × S × H
Scales with tokens, not model size

Tensor Parallelism

AllReduce
Partial activations reduced across TP group. Per attention + FFN layer.
2 × B × S × H × L
Scales with num layers (L)

Data Parallelism

AllReduce
Gradient synchronization across all DP ranks. Once per training step.
2 × P (all params)
Scales with total model size

Key Insight

EP communication scales with the number of tokens (B × S) and hidden size (H), not with model size. This makes EP particularly efficient for large MoE models with many experts — adding more experts doesn't increase communication, only adding more tokens does. In contrast, DP gradient sync scales with total parameter count, making it increasingly expensive as models grow.

EP + TP + DP Together

Expert Parallelism is another axis of parallelism. It can be combined with Tensor and Data Parallelism for maximum efficiency.

Tensor Parallel (TP)
Expert Parallel (EP)
Data Parallel (DP)
Key Insight: EP replaces some TP or DP GPUs — it is another dimension of parallelism. Within a node (fast NVLink), you typically combine TP + EP. Across nodes (slower network), you use DP. The total GPU count = TP × EP × DP.