GPU Workshop — Memory Optimization

Gradient Accumulation

Train with large effective batch sizes even when your GPU memory can't hold them. A visual, interactive guide to one of the most practical tricks in deep learning.

Begin the Journey ↓

The Problem: GPU Memory Is Finite

As batch size grows, activation memory grows with it. At some point your GPU simply runs out of memory — the dreaded OOM error.

16
16 GB
Model parameters & optimizer states use a fixed amount of memory regardless of batch size.
Activations grow linearly with batch size. Past a threshold, total memory exceeds the GPU limit — OOM!

Core Idea: Split, Accumulate, Update

Instead of processing the full batch at once, split it into micro-batches. Process each one, accumulate gradients, then update once.

Phase: Ready

Full Training Loop

Watch the complete gradient accumulation training loop in action. Each micro-batch contributes gradients that build up before a single optimizer step.

Micro-batch: 0 / 4

The Mathematics

Gradient accumulation is mathematically equivalent to training with a large batch — the key is dividing by N.

1
g = (1/B) ∑ ∇L(xi, θ)
Standard training — compute gradient over full batch of size B
2
gaccum = k=1..N ∇L(micro_batchk, θ)
Accumulate gradients from N micro-batches (each of size B/N)
3
gfinal = gaccum / N
Divide by N to get the average — this is the key step!
4
θ = θ - α · gfinal
Single optimizer step with the accumulated, averaged gradient
=
effective_batch_size = micro_batch_size × accumulation_steps
The result is mathematically equivalent to training with the full effective batch

PyTorch Code Walkthrough

Step through the code line-by-line while watching the corresponding animation. Each highlighted line maps to a visual action.

train_with_accumulation.py
# Gradient Accumulation Training Loop
accumulation_steps = 4
optimizer.zero_grad()

for i, (data, target) in enumerate(dataloader):
    # Forward pass
    output = model(data)
    loss = criterion(output, target) / accumulation_steps

    # Backward pass (gradients accumulate)
    loss.backward()

    if (i + 1) % accumulation_steps == 0:
        # Update weights
        optimizer.step()
        optimizer.zero_grad()
Ready

Memory Comparison

Same effective batch size, dramatically different memory usage. Gradient accumulation lets you trade time for memory.

64

Trade-offs

Gradient accumulation is powerful but not free. Understand the advantages and limitations.

Advantages

Larger Effective Batches

Simulate batch sizes that would never fit in GPU memory. Essential for tasks like contrastive learning and large-model fine-tuning.

📊

Better Convergence

Larger batches produce more stable gradient estimates, reducing noise and often leading to smoother training curves.

💻

Works Everywhere

No special hardware or communication needed. Works on a single GPU, with DDP, or even on CPUs. Pure software trick.

🔧

Easy to Implement

Just 3 extra lines in your training loop. Divide loss, skip zero_grad, step every N iterations. That's it.

Limitations

Slower Wall-Clock Time

N micro-batches processed sequentially instead of one big batch. Training takes ~N× longer per optimizer step.

📈

BatchNorm Mismatch

BatchNorm statistics are computed per micro-batch, not the full effective batch. Use GroupNorm or LayerNorm instead.

🎯

Stale Gradients

Gradients from micro-batch 1 are computed on parameters that haven't been updated with micro-batch 2's information.

📐

Memory Not Free

Gradients themselves accumulate in memory (same size as parameters). You save on activations, not gradient storage.

Interactive Playground

Configure the training loop and watch it run in real time. Observe how accumulation steps affect the effective batch, memory, and loss.

16
4
64
Effective Batch Size
4.2 GB
Peak Memory
0
Optimizer Steps
2.30
Current Loss