ZeRO-1: A Concrete Walkthrough

Model Architecture

Our Tiny Transformer

A deliberately tiny model so every matrix fits on screen. Every number is real and traceable.

Dimensions

Hidden dim d = 4
Heads h = 2 → d_k = 2
FFN inner = 16 (4× expansion)
Vocab size = 8

Training Setup

Sequence length T = 3 tokens
GPUs: 2 (GPU-0 & GPU-1)
Optimizer: Adam
Mixed precision: BF16

Architecture Flow

LN1 → Attention → residual → LN2 → FFN → residual → W_vocab → softmax

All 260 Parameters — Itemized

Layer	Name	Shape	#Elements
LayerNorm 1	γ₁	(4,)	4
	β₁	(4,)	4
Attention	W_q	(4, 4)	16
	W_k	(4, 4)	16
	W_v	(4, 4)	16
	W_o	(4, 4)	16
LayerNorm 2	γ₂	(4,)	4
	β₂	(4,)	4
FFN	W₁	(4, 16)	64
	b₁	(16,)	16
	W₂	(16, 4)	64
	b₂	(4,)	4
Output Head	W_vocab	(4, 8)	32
TOTAL			260

Concrete Parameter Values

W_q (Query weights, 4×4)

0.12

0.34

-0.21

0.05

-0.15

0.22

0.11

-0.08

0.30

-0.10

0.18

0.27

-0.05

0.14

-0.33

0.09

W_k (Key weights, 4×4)

0.20

-0.11

0.07

0.15

0.03

0.28

-0.14

0.10

-0.22

0.06

0.31

-0.05

0.17

-0.09

0.02

0.24

W_v (Value weights, 4×4)

0.08

0.19

-0.12

0.25

-0.07

0.33

0.04

-0.16

0.21

-0.03

0.15

0.11

-0.14

0.08

0.26

-0.02

W_vocab (Output head, 4×8)

0.11

-0.05

0.18

0.03

-0.14

0.22

-0.09

0.07

-0.06

0.13

0.02

-0.17

0.08

0.04

0.20

-0.11

0.15

-0.02

0.09

0.21

-0.08

0.12

-0.03

0.16

-0.10

0.07

-0.13

0.05

0.19

-0.06

0.14

0.01

Memory Analysis

Why ZeRO-1 Matters

For each parameter, Adam stores 16 bytes. Most of that is redundantly copied across GPUs.

Memory per Parameter Element

What	Precision	Bytes/Element
Parameter (fwd/bwd)	`BF16`	2
Gradient (after backward)	`BF16`	2
Master copy of parameter	`FP32`	4
First moment m (Adam)	`FP32`	4
Second moment v (Adam)	`FP32`	4
TOTAL per element		16 bytes

Key insight: Optimizer states (master params + m + v) account for 12 out of 16 bytes — that's 75% of per-parameter memory. This is what ZeRO-1 targets.

Interactive Memory Comparison

GPU-0 Memory Breakdown

GPU-1 Memory Breakdown

Params (BF16)

Gradients (BF16)

Optimizer (FP32)

Saved

Memory per GPU 260 × 16 = 4,160 bytes

Optimizer Partitioning

Who Owns What

All 260 parameters are flattened into a single vector and split evenly between GPUs.

Flattened Parameter Vector (260 elements)

Hover over each segment to see which parameters it contains.

GPU-0 owns optimizer — indices 0–129

GPU-1 owns optimizer — indices 130–259

GPU-0 Optimizer Slice

γ₁, β₁8 elements

W_q16 elements

W_k16 elements

W_v16 elements

W_o16 elements

γ₂, β₂8 elements

W₁ (first 2 rows)32 elements

Optimizer states stored: m[130], v[130], p[130] = 1,560 bytes

GPU-1 Optimizer Slice

W₁ (rows 3–4)32 elements

b₁16 elements

W₂64 elements

b₂4 elements

W_vocab32 elements

Optimizer states stored: m[130], v[130], p[130] = 1,560 bytes

Both GPUs still store the full model. All 260 parameters in BF16 (520 bytes) and all 260 gradients in BF16 (520 bytes) are kept on every GPU — they're needed for forward and backward passes. Only the optimizer states (m, v, master params) are partitioned.

Step-by-Step

One Complete Training Step

Click through each phase to see exactly what happens on each GPU, with real numbers.

Step 1: Forward & Backward Pass (Local Compute)

Each GPU independently runs the full model on its own micro-batch. GPU-0 processes batch A, GPU-1 processes batch B.

Input (after embedding): Shape (3, 4) — 3 tokens, hidden dim 4
x = [0.5, -0.3, 0.8, 0.1; 0.2, 0.7, -0.1, 0.4; -0.6, 0.3, 0.5, -0.2]

Forward Pass Trace (GPU-0)

LayerNorm 1

μ = (0.5 - 0.3 + 0.8 + 0.1)/4 = 0.275
σ = √var ≈ 0.398
x₁[0] = [0.565, -1.445, 1.319, -0.439]

Attention (Q computation)

Q = x₁ · W_q → (3,4)×(4,4) = (3,4)
Q[0] = x₁[0] · W_q
= [0.703, -0.319, ...]
Split: Head 0 = Q[:,0:2], Head 1 = Q[:,2:4]

FFN

x_ffn = ReLU(x₂ · W₁ + b₁) · W₂ + b₂
(3,4)×(4,16) = (3,16)
(3,16)×(16,4) = (3,4)

Output + Loss

logits = x · W_vocab → (3,8)
z = softmax(logits) → (3,8)
Loss = CrossEntropy(z, targets) → scalar

Backward Pass Results — Concrete Gradients for W_q

GPU-0 (micro-batch A)

g_q^(A) = ∂Loss_A / ∂W_q

0.023

-0.011

0.045

-0.008

-0.031

0.019

-0.007

0.014

0.012

-0.028

0.033

-0.005

-0.016

0.009

-0.021

0.038

GPU-1 (micro-batch B)

g_q^(B) = ∂Loss_B / ∂W_q

0.017

-0.025

0.031

-0.013

-0.009

0.041

-0.018

0.006

0.028

-0.014

0.022

-0.035

-0.020

0.016

-0.012

0.027

Both GPUs now hold 260 gradient values (one per parameter) in BF16. But they're from different micro-batches — we need to average them.

Step 2: Reduce-Scatter (Communicate Gradients)

Instead of all-reduce (giving everyone the full average), reduce-scatter gives each GPU only the averaged gradient for its assigned slice.

Reduce-Scatter Operation

GPU-0

Has: g^(A)[0:260]

Receives: avg_g[0:130]

g[130:260]

→

←

g[0:130]

GPU-1

Has: g^(B)[0:260]

Receives: avg_g[130:260]

Concrete: Averaged Gradient for W_q (GPU-0 receives this)

avg_g_q = (g_q^(A) + g_q^(B)) / 2

avg_g_q (GPU-0 only)

0.020

-0.018

0.038

-0.0105

-0.020

0.030

-0.0125

0.010

0.020

-0.021

0.0275

-0.020

-0.018

0.0125

-0.0165

0.0325

GPU-1 does NOT have avg_g_q! It only has the averaged gradients for indices 130–259. Similarly, GPU-0 does not have avg_g for W_vocab.

Step 3: Adam Optimizer Step (Local, on Owned Slice Only)

Each GPU runs Adam on only its 130-element slice. Let's trace one element: W_q[0,0].

GPU-0 updates W_q[0,0] — Adam at t=1

Hyperparameters: lr=0.001, β₁=0.9, β₂=0.999, ε=1e-8

Current param:

p = 0.12

Avg gradient:

g = 0.020

Update m:

m = 0.9 × 0.0 + 0.1 × 0.020 = 0.002

Update v:

v = 0.999 × 0.0 + 0.001 × 0.020² = 0.0000004

Bias-correct m̂:

m̂ = 0.002 / (1 - 0.9¹) = 0.002 / 0.1 = 0.02

Bias-correct v̂:

v̂ = 0.0000004 / (1 - 0.999¹) = 0.0000004 / 0.001 = 0.0004

Update param:

p_new = 0.12 - 0.001 × 0.02 / (√0.0004 + 1e-8)

= 0.12 - 0.001 × 0.02 / 0.02 = 0.12 - 0.001 = 0.119

GPU-0 After Optimizer Step

Params [0:130]UPDATED (new)

Params [130:260]OLD values

m₀[130], v₀[130]Updated

GPU-1 After Optimizer Step

Params [0:130]OLD values

Params [130:260]UPDATED (new)

m₁[130], v₁[130]Updated

The model is now inconsistent! Each GPU has half new, half old parameters. We need one more communication step.

Step 4: All-Gather (Sync Updated Parameters)

Each GPU broadcasts its freshly updated parameter slice to all others.

All-Gather Operation

GPU-0

Sends: new_params[0:130]

Result: FULL updated model

params[0:130]

→

←

params[130:260]

GPU-1

Sends: new_params[130:260]

Result: FULL updated model

W_q After Update (Identical on Both GPUs)

W_q (updated)

0.119

0.341

-0.211

0.051

-0.149

0.219

0.111

-0.081

0.299

-0.099

0.179

0.271

-0.049

0.139

-0.329

0.089

Both GPUs now have the identical, fully updated model. Ready for the next training step!

Communication Analysis

Zero Extra Cost

ZeRO-1 doesn't add any communication — a standard all-reduce is already reduce-scatter + all-gather internally.

Standard All-Reduce (No ZeRO)

Send: 260 values × 2 bytes = 520 B
Recv: 260 values × 2 bytes = 520 B
Total: 1,040 bytes

Internally: reduce-scatter + all-gather (same two phases!)

ZeRO-1 (Our approach)

Reduce-Scatter: 260 × 2 = 520 B
All-Gather: 260 × 2 = 520 B
Total: 1,040 bytes

Identical! We just insert the optimizer step between the two phases.

The key realization All-Reduce = Reduce-Scatter + All-Gather
ZeRO-1 = Reduce-Scatter + Optimizer Step + All-Gather
Same bytes moved. Memory saved for free.

Real-World Scaling

From Toy Model to 7B Parameters

Drag the slider to see how ZeRO-1 scales with GPU count.

Memory Calculator

Number of GPUs 8

Model Parameters (Billions) 7B

Per-GPU Memory Breakdown

Stacked bar: parameters + gradients + optimizer states

Params (BF16)

Gradients (BF16)

Optimizer (FP32)

Reference: 7B Parameters on 8 GPUs

Component	No ZeRO	ZeRO-1 (8 GPUs)
Params (BF16)	14.0 GB	14.0 GB
Gradients (BF16)	14.0 GB	14.0 GB
Optimizer m (FP32)	28.0 GB	3.5 GB (÷8)
Optimizer v (FP32)	28.0 GB	3.5 GB (÷8)
Master params (FP32)	28.0 GB	3.5 GB (÷8)
TOTAL per GPU	112.0 GB	38.5 GB
Saving	—	65.6%

End State

What Lives Where After One Step

The complete memory layout on each GPU after training step 1.

GPU-0

θ[260] in BF16 (FULL model, updated)520 B

g[260] in BF16 (can be freed)520 B

Optimizer (ONLY slice 0:130)

m₀[130] in FP32 (first moments)520 B

v₀[130] in FP32 (second moments)520 B

p₀[130] in FP32 (master params)520 B

TOTAL: 2,600 bytes

GPU-1

θ[260] in BF16 (FULL model, updated)520 B

g[260] in BF16 (can be freed)520 B

Optimizer (ONLY slice 130:260)

m₁[130] in FP32 (first moments)520 B

v₁[130] in FP32 (second moments)520 B

p₁[130] in FP32 (master params)520 B

TOTAL: 2,600 bytes

Compare to standard data-parallel (no ZeRO): 4,160 bytes per GPU.
ZeRO-1 saves 37.5% memory with our 2-GPU setup, and the saving grows with more GPUs.

Summary

The Key Insights

01

Optimizer States Dominate

Adam stores m, v, and master params in FP32 — that's 12 of the 16 bytes per parameter (75%). ZeRO-1 partitions exactly this.

02

Same Communication, Less Memory

All-reduce = reduce-scatter + all-gather. ZeRO-1 just inserts the optimizer step between the two phases. No extra bytes moved.

03

Scales with GPU Count

Memory per GPU: (4 + 12/N) bytes per param. With 8 GPUs: 5.5 B/param vs 16 B/param — a 65.6% saving. With 64 GPUs: 4.2 B/param — 73.8%.

04

Full Model on Every GPU

Unlike ZeRO-2/3, every GPU keeps the full parameters and gradients. This means no extra communication during forward/backward passes.

05

The Default Choice

ZeRO-1 is the default in DeepSpeed because it's pure upside: memory savings with zero performance cost. It's the first thing to enable.

06

The Formula

Per GPU: full params (2B) + full grads (2B) + 1/N×optimizer (12B/N). As N→∞, approaches 4 bytes/param — just params + grads.