How many sequences can one GPU process at once before it runs out of memory? The answer lives in understanding which memory buckets grow with batch size โ and which don't.
During training, GPU memory is split into four distinct categories. Understanding what goes into each bucket is the key to finding your max batch size.
Let's look at each bucket and understand exactly what's inside and why it does or doesn't depend on how many sequences you feed the model.
These all depend on the model's weight matrices, which have a fixed shape determined by H (hidden size), L (layers), V (vocab).
[H, H] โ determined by the architecture, not the data. Whether you multiply [1, H] @ W or [100, H] @ W, W itself doesn't change.
Activations are the intermediate outputs at every layer during the forward pass, saved for use in the backward pass.
dW = XT dY. We must store X, which has shape [b, S, H]. More sequences = bigger X = more memory.
Drag the slider to increase batch size. Watch how the fixed buckets stay the same while activations eat more and more GPU memory.
We can compute the theoretical max batch size analytically โ no trial-and-error needed.
| Component | Value | Memory |
|---|---|---|
| GPU Memory | A100 | 80.0 GiB |
| Parameters (fp32) | 8B × 4 bytes | 29.8 GiB |
| Gradients (fp32) | 8B × 4 bytes | 29.8 GiB |
| Optimizer (Adam m+v) | 8B × 8 bytes | 59.6 GiB |
| Total Fixed | 16 × 8B | 119.2 GiB |
| Already exceeds 80 GiB! Can't even fit b=0 in fp32. | ||
In practice, you find the max batch size empirically: keep increasing b until the GPU crashes with an Out-Of-Memory error.
Configure your GPU and model, then watch the binary search find the maximum batch size.
Enter your GPU specs and model config. Get the theoretical maximum local batch size computed in real time.
b_max = floor((M_GPU ร 0.9 โ M_fixed / D) / M_act_per_sample) where M_fixed = 16N bytes (params + gradients + optimizer), D = number of GPUs (with ZeRO-3, fixed memory is sharded), and M_act_per_sample depends on the model architecture. The 0.9 factor reserves 10% for CUDA overhead. Try changing the values below and watch the result update in real time!
If your max batch size is too small, here are your options โ each reduces either the fixed cost or the per-sample activation cost.
Forward/backward pass uses bf16 (faster tensor core ops). Fixed memory stays at 16N (fp32 master weights are still needed), but activation memory is halved since intermediates are stored in bf16 (2 bytes) instead of fp32 (4 bytes).
Don't store intermediate activations โ recompute them during backward. This directly reduces the per-sample activation cost, letting you fit more sequences.
Shard parameters, gradients, and optimizer states across multiple GPUs. Each GPU only stores 1/Ngpu of the fixed cost.
No problem. This is exactly what gradient accumulation is for. You process bmax sequences at a time, accumulate the gradients over multiple steps, and only update weights after reaching your desired global batch size.
grad_accum_steps = 256 / (4 × 8) = 8.
See how the maximum batch size changes as you vary sequence length, model size, and GPU memory.
Parameters, gradients, and optimizer states are fixed by the model architecture. Activations are the only memory bucket that grows when you increase batch size.
The max batch size is simply: leftover memory after fixed costs, divided by the per-sample activation cost. You can compute this analytically or find it empirically via OOM search.
Activation memory has an S² term from attention. Doubling sequence length more than halves your max batch size. This is why long-context training is so memory-hungry.
Mixed precision (halve activation memory per sample), activation checkpointing (reduce per-sample cost by ~90%), and ZeRO/FSDP (shard the 16N fixed cost across GPUs).
If bmax is smaller than your desired global batch size, gradient accumulation lets you simulate larger batches without needing more GPU memory.
CUDA memory fragmentation + temporary buffers mean the practical bmax is slightly less than the theoretical maximum. Always leave headroom.