ZeRO Optimization · Stage 1

ZeRO-1: A Concrete Walkthrough

Follow every number through a complete training step. See exactly how optimizer state partitioning saves memory with zero communication overhead.

260
Total Parameters
37.5%
Memory Saved (2 GPUs)
0
Extra Communication
4
Steps per Iteration

Our Tiny Transformer

A deliberately tiny model so every matrix fits on screen. Every number is real and traceable.

Dimensions

Hidden dim d = 4
Heads h = 2d_k = 2
FFN inner = 16 (4× expansion)
Vocab size = 8

Training Setup

Sequence length T = 3 tokens
GPUs: 2 (GPU-0 & GPU-1)
Optimizer: Adam
Mixed precision: BF16

Architecture Flow

LN1Attention → residual → LN2FFN → residual → W_vocab → softmax

All 260 Parameters — Itemized

LayerNameShape#Elements
LayerNorm 1γ₁(4,)4
β₁(4,)4
AttentionW_q(4, 4)16
W_k(4, 4)16
W_v(4, 4)16
W_o(4, 4)16
LayerNorm 2γ₂(4,)4
β₂(4,)4
FFNW₁(4, 16)64
b₁(16,)16
W₂(16, 4)64
b₂(4,)4
Output HeadW_vocab(4, 8)32
TOTAL260

Concrete Parameter Values

W_q (Query weights, 4×4)

0.12
0.34
-0.21
0.05
-0.15
0.22
0.11
-0.08
0.30
-0.10
0.18
0.27
-0.05
0.14
-0.33
0.09

W_k (Key weights, 4×4)

0.20
-0.11
0.07
0.15
0.03
0.28
-0.14
0.10
-0.22
0.06
0.31
-0.05
0.17
-0.09
0.02
0.24

W_v (Value weights, 4×4)

0.08
0.19
-0.12
0.25
-0.07
0.33
0.04
-0.16
0.21
-0.03
0.15
0.11
-0.14
0.08
0.26
-0.02

W_vocab (Output head, 4×8)

0.11
-0.05
0.18
0.03
-0.14
0.22
-0.09
0.07
-0.06
0.13
0.02
-0.17
0.08
0.04
0.20
-0.11
0.15
-0.02
0.09
0.21
-0.08
0.12
-0.03
0.16
-0.10
0.07
-0.13
0.05
0.19
-0.06
0.14
0.01

Why ZeRO-1 Matters

For each parameter, Adam stores 16 bytes. Most of that is redundantly copied across GPUs.

Memory per Parameter Element

WhatPrecisionBytes/Element
Parameter (fwd/bwd)BF162
Gradient (after backward)BF162
Master copy of parameterFP324
First moment m (Adam)FP324
Second moment v (Adam)FP324
TOTAL per element16 bytes
Key insight: Optimizer states (master params + m + v) account for 12 out of 16 bytes — that's 75% of per-parameter memory. This is what ZeRO-1 targets.

Interactive Memory Comparison

GPU-0 Memory Breakdown
GPU-1 Memory Breakdown
Params (BF16)
Gradients (BF16)
Optimizer (FP32)
Saved
Memory per GPU 260 × 16 = 4,160 bytes

Who Owns What

All 260 parameters are flattened into a single vector and split evenly between GPUs.

Flattened Parameter Vector (260 elements)

Hover over each segment to see which parameters it contains.

GPU-0 owns optimizer — indices 0–129
GPU-1 owns optimizer — indices 130–259
GPU-0 Optimizer Slice
γ₁, β₁8 elements
W_q16 elements
W_k16 elements
W_v16 elements
W_o16 elements
γ₂, β₂8 elements
W₁ (first 2 rows)32 elements
Optimizer states stored: m[130], v[130], p[130] = 1,560 bytes
GPU-1 Optimizer Slice
W₁ (rows 3–4)32 elements
b₁16 elements
W₂64 elements
b₂4 elements
W_vocab32 elements
Optimizer states stored: m[130], v[130], p[130] = 1,560 bytes
Both GPUs still store the full model. All 260 parameters in BF16 (520 bytes) and all 260 gradients in BF16 (520 bytes) are kept on every GPU — they're needed for forward and backward passes. Only the optimizer states (m, v, master params) are partitioned.

One Complete Training Step

Click through each phase to see exactly what happens on each GPU, with real numbers.

Step 1: Forward & Backward Pass (Local Compute)

Each GPU independently runs the full model on its own micro-batch. GPU-0 processes batch A, GPU-1 processes batch B.

Input (after embedding): Shape (3, 4) — 3 tokens, hidden dim 4
x = [0.5, -0.3, 0.8, 0.1; 0.2, 0.7, -0.1, 0.4; -0.6, 0.3, 0.5, -0.2]

Forward Pass Trace (GPU-0)

LayerNorm 1
μ = (0.5 - 0.3 + 0.8 + 0.1)/4 = 0.275
σ = √var ≈ 0.398
x₁[0] = [0.565, -1.445, 1.319, -0.439]
Attention (Q computation)
Q = x₁ · W_q → (3,4)×(4,4) = (3,4)
Q[0] = x₁[0] · W_q
  = [0.703, -0.319, ...]
Split: Head 0 = Q[:,0:2], Head 1 = Q[:,2:4]
FFN
x_ffn = ReLU(x₂ · W₁ + b₁) · W₂ + b₂
  (3,4)×(4,16) = (3,16)
  (3,16)×(16,4) = (3,4)
Output + Loss
logits = x · W_vocab → (3,8)
z = softmax(logits) → (3,8)
Loss = CrossEntropy(z, targets) → scalar

Backward Pass Results — Concrete Gradients for W_q

GPU-0 (micro-batch A)

g_q(A) = ∂Loss_A / ∂W_q
0.023
-0.011
0.045
-0.008
-0.031
0.019
-0.007
0.014
0.012
-0.028
0.033
-0.005
-0.016
0.009
-0.021
0.038

GPU-1 (micro-batch B)

g_q(B) = ∂Loss_B / ∂W_q
0.017
-0.025
0.031
-0.013
-0.009
0.041
-0.018
0.006
0.028
-0.014
0.022
-0.035
-0.020
0.016
-0.012
0.027
Both GPUs now hold 260 gradient values (one per parameter) in BF16. But they're from different micro-batches — we need to average them.

Step 2: Reduce-Scatter (Communicate Gradients)

Instead of all-reduce (giving everyone the full average), reduce-scatter gives each GPU only the averaged gradient for its assigned slice.

Reduce-Scatter Operation
GPU-0
Has: g(A)[0:260]
Receives: avg_g[0:130]
g[130:260]
g[0:130]
GPU-1
Has: g(B)[0:260]
Receives: avg_g[130:260]

Concrete: Averaged Gradient for W_q (GPU-0 receives this)

avg_g_q = (g_q(A) + g_q(B)) / 2
avg_g_q (GPU-0 only)
0.020
-0.018
0.038
-0.0105
-0.020
0.030
-0.0125
0.010
0.020
-0.021
0.0275
-0.020
-0.018
0.0125
-0.0165
0.0325
GPU-1 does NOT have avg_g_q! It only has the averaged gradients for indices 130–259. Similarly, GPU-0 does not have avg_g for W_vocab.

Step 3: Adam Optimizer Step (Local, on Owned Slice Only)

Each GPU runs Adam on only its 130-element slice. Let's trace one element: W_q[0,0].

GPU-0 updates W_q[0,0] — Adam at t=1
Hyperparameters: lr=0.001, β₁=0.9, β₂=0.999, ε=1e-8
Current param:
p = 0.12
Avg gradient:
g = 0.020
Update m:
m = 0.9 × 0.0 + 0.1 × 0.020 = 0.002
Update v:
v = 0.999 × 0.0 + 0.001 × 0.020² = 0.0000004
Bias-correct m̂:
m̂ = 0.002 / (1 - 0.9¹) = 0.002 / 0.1 = 0.02
Bias-correct v̂:
v̂ = 0.0000004 / (1 - 0.999¹) = 0.0000004 / 0.001 = 0.0004
Update param:
p_new = 0.12 - 0.001 × 0.02 / (√0.0004 + 1e-8)
= 0.12 - 0.001 × 0.02 / 0.02 = 0.12 - 0.001 = 0.119
GPU-0 After Optimizer Step
Params [0:130]UPDATED (new)
Params [130:260]OLD values
m₀[130], v₀[130]Updated
GPU-1 After Optimizer Step
Params [0:130]OLD values
Params [130:260]UPDATED (new)
m₁[130], v₁[130]Updated
The model is now inconsistent! Each GPU has half new, half old parameters. We need one more communication step.

Step 4: All-Gather (Sync Updated Parameters)

Each GPU broadcasts its freshly updated parameter slice to all others.

All-Gather Operation
GPU-0
Sends: new_params[0:130]
Result: FULL updated model
params[0:130]
params[130:260]
GPU-1
Sends: new_params[130:260]
Result: FULL updated model

W_q After Update (Identical on Both GPUs)

W_q (updated)
0.119
0.341
-0.211
0.051
-0.149
0.219
0.111
-0.081
0.299
-0.099
0.179
0.271
-0.049
0.139
-0.329
0.089
Both GPUs now have the identical, fully updated model. Ready for the next training step!

Zero Extra Cost

ZeRO-1 doesn't add any communication — a standard all-reduce is already reduce-scatter + all-gather internally.

Standard All-Reduce (No ZeRO)

Send: 260 values × 2 bytes = 520 B
Recv: 260 values × 2 bytes = 520 B
Total: 1,040 bytes
Internally: reduce-scatter + all-gather (same two phases!)

ZeRO-1 (Our approach)

Reduce-Scatter: 260 × 2 = 520 B
All-Gather: 260 × 2 = 520 B
Total: 1,040 bytes
Identical! We just insert the optimizer step between the two phases.
The key realization All-Reduce = Reduce-Scatter + All-Gather
ZeRO-1 = Reduce-Scatter + Optimizer Step + All-Gather
Same bytes moved. Memory saved for free.

From Toy Model to 7B Parameters

Drag the slider to see how ZeRO-1 scales with GPU count.

Memory Calculator

Number of GPUs 8
Model Parameters (Billions) 7B

Per-GPU Memory Breakdown

Stacked bar: parameters + gradients + optimizer states
Params (BF16)
Gradients (BF16)
Optimizer (FP32)

Reference: 7B Parameters on 8 GPUs

ComponentNo ZeROZeRO-1 (8 GPUs)
Params (BF16)14.0 GB14.0 GB
Gradients (BF16)14.0 GB14.0 GB
Optimizer m (FP32)28.0 GB3.5 GB (÷8)
Optimizer v (FP32)28.0 GB3.5 GB (÷8)
Master params (FP32)28.0 GB3.5 GB (÷8)
TOTAL per GPU112.0 GB38.5 GB
Saving65.6%

What Lives Where After One Step

The complete memory layout on each GPU after training step 1.

GPU-0
θ[260] in BF16 (FULL model, updated)520 B
g[260] in BF16 (can be freed)520 B
Optimizer (ONLY slice 0:130)
m₀[130] in FP32 (first moments)520 B
v₀[130] in FP32 (second moments)520 B
p₀[130] in FP32 (master params)520 B
TOTAL: 2,600 bytes
GPU-1
θ[260] in BF16 (FULL model, updated)520 B
g[260] in BF16 (can be freed)520 B
Optimizer (ONLY slice 130:260)
m₁[130] in FP32 (first moments)520 B
v₁[130] in FP32 (second moments)520 B
p₁[130] in FP32 (master params)520 B
TOTAL: 2,600 bytes
Compare to standard data-parallel (no ZeRO): 4,160 bytes per GPU.
ZeRO-1 saves 37.5% memory with our 2-GPU setup, and the saving grows with more GPUs.

The Key Insights

01

Optimizer States Dominate

Adam stores m, v, and master params in FP32 — that's 12 of the 16 bytes per parameter (75%). ZeRO-1 partitions exactly this.

02

Same Communication, Less Memory

All-reduce = reduce-scatter + all-gather. ZeRO-1 just inserts the optimizer step between the two phases. No extra bytes moved.

03

Scales with GPU Count

Memory per GPU: (4 + 12/N) bytes per param. With 8 GPUs: 5.5 B/param vs 16 B/param — a 65.6% saving. With 64 GPUs: 4.2 B/param — 73.8%.

04

Full Model on Every GPU

Unlike ZeRO-2/3, every GPU keeps the full parameters and gradients. This means no extra communication during forward/backward passes.

05

The Default Choice

ZeRO-1 is the default in DeepSpeed because it's pure upside: memory savings with zero performance cost. It's the first thing to enable.

06

The Formula

Per GPU: full params (2B) + full grads (2B) + 1/N×optimizer (12B/N). As N→∞, approaches 4 bytes/param — just params + grads.