Expert Parallelism

Section A — The Problem

Too Many Experts for One GPU

In a Mixture-of-Experts model, each expert is a full FFN block. With 8 experts, a single GPU simply runs out of memory.

0

GPU 0 — Single GPU Attempt

Trying to fit everything on one device

Attention Layers + Embeddings ~15% VRAM

Attn

8 Expert FFN Blocks ~120% VRAM !!!

8 Experts = OOM

E0

E1

E2

E3

E4

E5

E6

E7

Out of Memory — Each expert is a full FFN (same size as the original dense FFN)

The Solution: Expert Parallelism

Distribute the 8 experts across 4 GPUs — each GPU holds only 2 experts while keeping the full attention layers.

Section B — Expert Distribution

Distributing Experts Across GPUs

Each GPU holds the full model (attention, embeddings) but only a subset of the expert FFN blocks.

GPU 0

Attention + Embeddings + LayerNorm (full copy on every GPU)

Expert 0

Expert 1

GPU 1

Attention + Embeddings + LayerNorm (full copy on every GPU)

Expert 2

Expert 3

GPU 2

Attention + Embeddings + LayerNorm (full copy on every GPU)

Expert 4

Expert 5

GPU 3

Attention + Embeddings + LayerNorm (full copy on every GPU)

Expert 6

Expert 7

Section C — Core Pattern

All-to-All Communication Step by Step

The All-to-All dispatch and combine is the heart of expert parallelism. Watch tokens flow between GPUs to reach their target experts.

Step 1 / 5

Before All-to-All

Each GPU has a batch of tokens. The router has assigned each token to a target expert.

Source GPUs — each has local tokens

GPU 0

E0

E1

GPU 1

E2

E3

GPU 2

E4

E5

GPU 3

E6

E7

Section D — Communication Cost

EP Communication Analysis

Every MoE layer requires two All-to-All operations: dispatch (send tokens) and combine (receive results).

Comm / MoE layer = 2 × B × S × H

Two All-to-All collective operations per Mixture-of-Experts layer (dispatch + combine)

B = batch size S = sequence length H = hidden dimension

All-to-All

Tokens dispatched to target expert's GPU, results sent back. Per MoE layer.

2 × B × S × H

Scales with tokens, not model size

Tensor Parallelism

AllReduce

Partial activations reduced across TP group. Per attention + FFN layer.

2 × B × S × H × L

Scales with num layers (L)

Data Parallelism

AllReduce

Gradient synchronization across all DP ranks. Once per training step.

2 × P (all params)

Scales with total model size

Key Insight

EP communication scales with the number of tokens (B × S) and hidden size (H), not with model size. This makes EP particularly efficient for large MoE models with many experts — adding more experts doesn't increase communication, only adding more tokens does. In contrast, DP gradient sync scales with total parameter count, making it increasingly expensive as models grow.

Section E — Combining Parallelisms

EP + TP + DP Together

Expert Parallelism is another axis of parallelism. It can be combined with Tensor and Data Parallelism for maximum efficiency.

Tensor Parallel (TP)

Expert Parallel (EP)

Data Parallel (DP)

Key Insight: EP replaces some TP or DP GPUs — it is another dimension of parallelism. Within a node (fast NVLink), you typically combine TP + EP. Across nodes (slower network), you use DP. The total GPU count = TP × EP × DP.

Expert Parallelism Visualized

Too Many Experts for One GPU

The Solution: Expert Parallelism

Distributing Experts Across GPUs

All-to-All Communication Step by Step

Before All-to-All

EP Communication Analysis

Expert Parallelism

Tensor Parallelism

Data Parallelism

Key Insight

EP + TP + DP Together