A prompt comes in. Tokens must be generated one by one. Watch what happens inside the GPUs under each strategy — and why one is ~10,000× more efficient.
A user sends a prompt. The model must generate a response, one token at a time. Each new token requires a complete forward pass through all 80 layers.
Before any token is generated, the 70B model's weights must be distributed across the 8 GPUs. ZeRO and TP distribute them very differently.
Each GPU holds 1/8 of parameters — but these are row shards, so no GPU has a complete layer. To compute anything, a GPU must first collect shards from all other GPUs.
Each GPU holds its column slice of EVERY layer — it can immediately compute a partial result for any layer without fetching anything.
Watch what happens inside the GPUs as each token is generated. Click play to step through the forward pass for each token.
Let's see exactly what each GPU does during a single transformer layer's forward pass, for a single token being decoded.
A typical response is 200+ tokens. Each one pays the full communication cost. Drag the slider to see how total data transferred grows with response length.
ZeRO isn't a bad idea — the same all-gather that kills inference is barely noticeable during training. Here's why.
With batch=32, seq=2048, each layer does 2×32×2048×d×d_ff FLOPs.
The ~200 MB all-gather becomes a tiny fraction of total time.
During training, you also do backward passes. ZeRO can overlap next-layer's weight gather with current-layer's backward computation — hiding latency.
ZeRO's real value is memory savings: each GPU stores only 1/N of parameters, gradients, and optimizer states. This lets you train models that can't fit on one GPU.
Adjust model parameters to see how communication costs change for each strategy.
| Metric | ZeRO-3 | Tensor Parallelism | Ratio |
|---|
During autoregressive decoding, each new token triggers a complete walk through all layers. The parallelism strategy's communication cost is paid per layer, per token.
ZeRO-3 must all-gather full weight matrices before each layer can compute. For a 70B model, that's ~16 GB of network traffic per token.
TP only all-reduces tiny activation vectors. Weights are always resident. Communication per token is ~2.5 MB — roughly 6,400× less than ZeRO-3.
With batch=1, seq=1, the actual matmul per layer takes microseconds. ZeRO-3's all-gather takes milliseconds. Communication completely dominates.
The same all-gather that's 83% of inference time is only 5% during training, because large batches create 65,000× more compute per layer.
vLLM, TGI, TensorRT-LLM all use Tensor Parallelism within a node. ZeRO is reserved for training where memory savings justify the communication.