GPU Workshop โ€” Training Memory Deep Dive

Finding the Maximum Local Batch Size

How many sequences can one GPU process at once before it runs out of memory? The answer lives in understanding which memory buckets grow with batch size โ€” and which don't.

4
Memory Buckets
1
Scales with Batch Size
16N
Fixed bytes (params+grads+optim)
b×S×H
Variable activation memory

The Four Memory Buckets

During training, GPU memory is split into four distinct categories. Understanding what goes into each bucket is the key to finding your max batch size.

๐Ÿงฑ
Parameters
4N bytes
Fixed
๐Ÿ“
Gradients
4N bytes
Fixed
โš™๏ธ
Optimizer
8N bytes
Fixed
โšก
Activations
~L×34×S×b×H×2
Scales with b!
Total GPU Memory Mtotal = 16N + f(b, S, H, L)
fixed (doesn't change with batch size)  +  activations (grows linearly with b)
The central insight: Parameters, gradients, and optimizer states depend only on the number of model parameters N. They are the same whether you process 1 sequence or 100. Only activations grow when you increase batch size b. So finding the max batch size is really asking: "How much memory is left after the fixed costs, and how many batches of activations can I fit?"

Why Only Activations Scale with Batch Size?

Let's look at each bucket and understand exactly what's inside and why it does or doesn't depend on how many sequences you feed the model.

Fixed Buckets: Parameters, Gradients, Optimizer

These all depend on the model's weight matrices, which have a fixed shape determined by H (hidden size), L (layers), V (vocab).

Example: One Linear Layer W of shape [4096, 4096]
Parameters (W itself) 4096×4096×4 = 64 MB
Gradients (dW, same shape) 4096×4096×4 = 64 MB
Optimizer (Adam m + v) 64 MB + 64 MB = 128 MB
None of these change if b=1 or b=100. Shape is always [4096, 4096].
Why? The weight matrix W has shape [H, H] โ€” determined by the architecture, not the data. Whether you multiply [1, H] @ W or [100, H] @ W, W itself doesn't change.

Variable Bucket: Activations

Activations are the intermediate outputs at every layer during the forward pass, saved for use in the backward pass.

Example: Y = X @ W where X has shape [b, S, H]
b = 1: Input stored 1×S×H×2 bytes
b = 4: Input stored 4×S×H×2 bytes
b = 8: Input stored 8×S×H×2 bytes
Every activation tensor has b in its shape → memory grows linearly with b.
Why? For Y = XW, the backward pass needs X to compute dW = XT dY. We must store X, which has shape [b, S, H]. More sequences = bigger X = more memory.

Interactive: Watch Memory Grow with Batch Size

Drag the slider to increase batch size. Watch how the fixed buckets stay the same while activations eat more and more GPU memory.

Local Batch Size: 1 Total: โ€” / 80.0 GiB
Params
Gradients
Optimizer
Activations
Free

Deriving Maximum Batch Size

We can compute the theoretical max batch size analytically โ€” no trial-and-error needed.

Step-by-Step Derivation

Step 1: Total memory constraint MGPU ≥ Mfixed + Mactivations(b)
Step 2: Fixed memory (depends only on N) Mfixed = 4N + 4N + 8N = 16N bytes
Step 3: Activation memory per layer (bf16) Mact = L × (34 × S × b × H + 5 × nheads × S² × b) × 2 bytes
Step 4: Solve for b bmax = ⌊ (MGPU - 16N) / (L × (34×S×H + 5×n×S²) × 2) ⌋
In plain English: Take your GPU's total memory, subtract the fixed cost (16N bytes for params + grads + optimizer), and divide by how much memory one sequence needs in activations across all layers. The result is how many sequences fit.

Worked Example: Llama 3.1 8B on A100 80GB

ComponentValueMemory
GPU MemoryA10080.0 GiB
Parameters (fp32)8B × 4 bytes29.8 GiB
Gradients (fp32)8B × 4 bytes29.8 GiB
Optimizer (Adam m+v)8B × 8 bytes59.6 GiB
Total Fixed16 × 8B119.2 GiB
Already exceeds 80 GiB! Can't even fit b=0 in fp32.
This is why ZeRO/FSDP is essential! Even with mixed precision, you still need fp32 master weights for the optimizer update, so fixed memory remains 16N bytes. Mixed precision helps with activation memory (stored in bf16) and compute speed (tensor cores), but does NOT reduce fixed memory. To actually fit the model, you need ZeRO/FSDP to shard the model states across multiple GPUs, reducing the per-GPU fixed cost from 16N to 16N/D (where D = number of GPUs).

Batch Size Calculator

Enter your GPU specs and model config. Get the theoretical maximum local batch size computed in real time.

How this calculator works: It computes b_max = floor((M_GPU ร— 0.9 โˆ’ M_fixed / D) / M_act_per_sample) where M_fixed = 16N bytes (params + gradients + optimizer), D = number of GPUs (with ZeRO-3, fixed memory is sharded), and M_act_per_sample depends on the model architecture. The 0.9 factor reserves 10% for CUDA overhead. Try changing the values below and watch the result update in real time!

GPU & Training Config

GPU Memory
GiB
Number of GPUs
ZeRO-3 sharding
Training Precision
Activation Checkpointing

Model Config

Parameters (N)
Billion
Hidden Size (H)
Layers (L)
Attention Heads
Sequence Length (S)
Quick presets:
Maximum Local Batch Size (per GPU)
โ€”
Memory Usage Breakdown
Fixed (model states) Activations (bmax samples) CUDA buffer (10%)

How to Fit a Larger Batch Size

If your max batch size is too small, here are your options โ€” each reduces either the fixed cost or the per-sample activation cost.

๐Ÿ—œ๏ธ

Mixed Precision (bf16)

Forward/backward pass uses bf16 (faster tensor core ops). Fixed memory stays at 16N (fp32 master weights are still needed), but activation memory is halved since intermediates are stored in bf16 (2 bytes) instead of fp32 (4 bytes).

Activation memory saved Activations: fp32 → bf16 = 50% less per sample
๐Ÿ”„

Activation Checkpointing

Don't store intermediate activations โ€” recompute them during backward. This directly reduces the per-sample activation cost, letting you fit more sequences.

Activation memory reduction Up to ~90% less activations
๐Ÿ”€

ZeRO / FSDP

Shard parameters, gradients, and optimizer states across multiple GPUs. Each GPU only stores 1/Ngpu of the fixed cost.

Fixed memory per GPU 16N → 16N / Ngpus

But I need a global batch size larger than bmax!

No problem. This is exactly what gradient accumulation is for. You process bmax sequences at a time, accumulate the gradients over multiple steps, and only update weights after reaching your desired global batch size.

Global Batch Size Bglobal = blocal × Ngpus × grad_accum_steps
Example: If bmax = 4, you have 8 GPUs, and you want Bglobal = 256: set grad_accum_steps = 256 / (4 × 8) = 8.

What Affects bmax Most?

See how the maximum batch size changes as you vary sequence length, model size, and GPU memory.

Maximum Batch Size vs. Sequence Length

For different model sizes on an A100-80GB with mixed precision + attention checkpointing.
1B
7B
13B
70B

Key Takeaways

1

Only activations scale with b

Parameters, gradients, and optimizer states are fixed by the model architecture. Activations are the only memory bucket that grows when you increase batch size.

2

bmax = (MGPU - Mfixed) / Mact_per_sample

The max batch size is simply: leftover memory after fixed costs, divided by the per-sample activation cost. You can compute this analytically or find it empirically via OOM search.

3

Sequence length is the biggest lever

Activation memory has an S² term from attention. Doubling sequence length more than halves your max batch size. This is why long-context training is so memory-hungry.

4

Three ways to increase bmax

Mixed precision (halve activation memory per sample), activation checkpointing (reduce per-sample cost by ~90%), and ZeRO/FSDP (shard the 16N fixed cost across GPUs).

5

Gradient accumulation bridges the gap

If bmax is smaller than your desired global batch size, gradient accumulation lets you simulate larger batches without needing more GPU memory.

6

Always leave a ~10% memory buffer

CUDA memory fragmentation + temporary buffers mean the practical bmax is slightly less than the theoretical maximum. Always leave headroom.