Follow every number through a complete training step. See exactly how optimizer state partitioning saves memory with zero communication overhead.
A deliberately tiny model so every matrix fits on screen. Every number is real and traceable.
Hidden dim d = 4
Heads h = 2 → d_k = 2
FFN inner = 16 (4× expansion)
Vocab size = 8
Sequence length T = 3 tokens
GPUs: 2 (GPU-0 & GPU-1)
Optimizer: Adam
Mixed precision: BF16
LN1 → Attention → residual → LN2 → FFN → residual → W_vocab → softmax
| Layer | Name | Shape | #Elements |
|---|---|---|---|
| LayerNorm 1 | γ₁ | (4,) | 4 |
| β₁ | (4,) | 4 | |
| Attention | W_q | (4, 4) | 16 |
| W_k | (4, 4) | 16 | |
| W_v | (4, 4) | 16 | |
| W_o | (4, 4) | 16 | |
| LayerNorm 2 | γ₂ | (4,) | 4 |
| β₂ | (4,) | 4 | |
| FFN | W₁ | (4, 16) | 64 |
| b₁ | (16,) | 16 | |
| W₂ | (16, 4) | 64 | |
| b₂ | (4,) | 4 | |
| Output Head | W_vocab | (4, 8) | 32 |
| TOTAL | 260 | ||
For each parameter, Adam stores 16 bytes. Most of that is redundantly copied across GPUs.
| What | Precision | Bytes/Element |
|---|---|---|
| Parameter (fwd/bwd) | BF16 | 2 |
| Gradient (after backward) | BF16 | 2 |
| Master copy of parameter | FP32 | 4 |
| First moment m (Adam) | FP32 | 4 |
| Second moment v (Adam) | FP32 | 4 |
| TOTAL per element | 16 bytes | |
All 260 parameters are flattened into a single vector and split evenly between GPUs.
Hover over each segment to see which parameters it contains.
Click through each phase to see exactly what happens on each GPU, with real numbers.
Each GPU independently runs the full model on its own micro-batch. GPU-0 processes batch A, GPU-1 processes batch B.
x = [0.5, -0.3, 0.8, 0.1; 0.2, 0.7, -0.1, 0.4; -0.6, 0.3, 0.5, -0.2]
Instead of all-reduce (giving everyone the full average), reduce-scatter gives each GPU only the averaged gradient for its assigned slice.
Each GPU runs Adam on only its 130-element slice. Let's trace one element: W_q[0,0].
Each GPU broadcasts its freshly updated parameter slice to all others.
ZeRO-1 doesn't add any communication — a standard all-reduce is already reduce-scatter + all-gather internally.
Drag the slider to see how ZeRO-1 scales with GPU count.
| Component | No ZeRO | ZeRO-1 (8 GPUs) |
|---|---|---|
| Params (BF16) | 14.0 GB | 14.0 GB |
| Gradients (BF16) | 14.0 GB | 14.0 GB |
| Optimizer m (FP32) | 28.0 GB | 3.5 GB (÷8) |
| Optimizer v (FP32) | 28.0 GB | 3.5 GB (÷8) |
| Master params (FP32) | 28.0 GB | 3.5 GB (÷8) |
| TOTAL per GPU | 112.0 GB | 38.5 GB |
| Saving | — | 65.6% |
The complete memory layout on each GPU after training step 1.
Adam stores m, v, and master params in FP32 — that's 12 of the 16 bytes per parameter (75%). ZeRO-1 partitions exactly this.
All-reduce = reduce-scatter + all-gather. ZeRO-1 just inserts the optimizer step between the two phases. No extra bytes moved.
Memory per GPU: (4 + 12/N) bytes per param. With 8 GPUs: 5.5 B/param vs 16 B/param — a 65.6% saving. With 64 GPUs: 4.2 B/param — 73.8%.
Unlike ZeRO-2/3, every GPU keeps the full parameters and gradients. This means no extra communication during forward/backward passes.
ZeRO-1 is the default in DeepSpeed because it's pure upside: memory savings with zero performance cost. It's the first thing to enable.
Per GPU: full params (2B) + full grads (2B) + 1/N×optimizer (12B/N). As N→∞, approaches 4 bytes/param — just params + grads.