How MoE experts are distributed across GPUs using All-to-All communication
In a Mixture-of-Experts model, each expert is a full FFN block. With 8 experts, a single GPU simply runs out of memory.
Distribute the 8 experts across 4 GPUs — each GPU holds only 2 experts while keeping the full attention layers.
Each GPU holds the full model (attention, embeddings) but only a subset of the expert FFN blocks.
The All-to-All dispatch and combine is the heart of expert parallelism. Watch tokens flow between GPUs to reach their target experts.
Every MoE layer requires two All-to-All operations: dispatch (send tokens) and combine (receive results).
EP communication scales with the number of tokens (B × S) and hidden size (H), not with model size. This makes EP particularly efficient for large MoE models with many experts — adding more experts doesn't increase communication, only adding more tokens does. In contrast, DP gradient sync scales with total parameter count, making it increasingly expensive as models grow.
Expert Parallelism is another axis of parallelism. It can be combined with Tensor and Data Parallelism for maximum efficiency.