Gradient Accumulation
Train with large effective batch sizes even when your GPU memory can't hold them. A visual, interactive guide to one of the most practical tricks in deep learning.
Begin the Journey ↓Train with large effective batch sizes even when your GPU memory can't hold them. A visual, interactive guide to one of the most practical tricks in deep learning.
Begin the Journey ↓As batch size grows, activation memory grows with it. At some point your GPU simply runs out of memory — the dreaded OOM error.
Instead of processing the full batch at once, split it into micro-batches. Process each one, accumulate gradients, then update once.
Watch the complete gradient accumulation training loop in action. Each micro-batch contributes gradients that build up before a single optimizer step.
Gradient accumulation is mathematically equivalent to training with a large batch — the key is dividing by N.
Step through the code line-by-line while watching the corresponding animation. Each highlighted line maps to a visual action.
# Gradient Accumulation Training Loop
accumulation_steps = 4
optimizer.zero_grad()
for i, (data, target) in enumerate(dataloader):
# Forward pass
output = model(data)
loss = criterion(output, target) / accumulation_steps
# Backward pass (gradients accumulate)
loss.backward()
if (i + 1) % accumulation_steps == 0:
# Update weights
optimizer.step()
optimizer.zero_grad()
Same effective batch size, dramatically different memory usage. Gradient accumulation lets you trade time for memory.
Gradient accumulation is powerful but not free. Understand the advantages and limitations.
Simulate batch sizes that would never fit in GPU memory. Essential for tasks like contrastive learning and large-model fine-tuning.
Larger batches produce more stable gradient estimates, reducing noise and often leading to smoother training curves.
No special hardware or communication needed. Works on a single GPU, with DDP, or even on CPUs. Pure software trick.
Just 3 extra lines in your training loop. Divide loss, skip zero_grad, step every N iterations. That's it.
N micro-batches processed sequentially instead of one big batch. Training takes ~N× longer per optimizer step.
BatchNorm statistics are computed per micro-batch, not the full effective batch. Use GroupNorm or LayerNorm instead.
Gradients from micro-batch 1 are computed on parameters that haven't been updated with micro-batch 2's information.
Gradients themselves accumulate in memory (same size as parameters). You save on activations, not gradient storage.
Configure the training loop and watch it run in real time. Observe how accumulation steps affect the effective batch, memory, and loss.