How large should your batch size be? Too small and you waste sequential steps. Too large and you waste compute. The gradient noise scale tells you exactly where the sweet spot is.
Understanding the two extremes helps us find the sweet spot in between
B « Bnoise
~Linear speedup, free parallelism
B ≈ Bnoise
Balanced noise & signal
B » Bnoise
Wasted computation
Small batches take noisy, zigzag paths. Large batches move smoothly toward the minimum.
From gradient estimation to the critical batch size formula
We approximate the true gradient G(θ) by averaging over a batch of B samples:
The expected value equals the true gradient, and the covariance scales as 1/B:
With a noisy gradient, the optimal learning rate depends on the batch size:
And the best possible loss improvement from one step:
The critical batch size where noise and signal balance:
Training speed as a function of batch size, with the turning point at Bnoise
εopt(B) / εmax as a function of B / Bnoise (log-log scale)
Every batch size choice places you on a Pareto frontier between training time and total compute
How total compute and training steps change as batch size increases
Increasing batch size cuts training time with virtually no extra compute. Each step does more useful work because variance is the bottleneck.
Around B ≈ Bnoise, you enter the tradeoff zone. More compute buys less time savings. This is the optimal operating point for most practitioners.
Diminishing returns dominate. You're spending 2× compute for <10% speedup. The gradient estimate is already nearly perfect.
Enter your estimated noise scale and explore the efficiency of different batch sizes
Bsimple = tr(Σ) / |G|² = (variance of gradients) / (squared mean gradient)The noise scale is not static — it evolves as training progresses
Bnoise typically grows as the loss decreases, suggesting batch size should increase over time
Early in training, the gradient signal is strong (large |G|) because the model is far from optimal.
As training progresses and the loss decreases, the gradient magnitude shrinks while per-example variance stays relatively high.
The ratio tr(Σ)/|G|² therefore increases.
Start training with a smaller batch size (within the linear scaling regime). As training progresses and Bnoise grows, increase the batch size to maintain efficiency. This is exactly what adaptive batch size schedules do, and the noise scale tells you when.
The gradient noise scale marks the transition from "free parallelism" to "wasted compute." At B = Bnoise, training speed is 50% of maximum.
Estimate Bnoise ≈ tr(Σ)/|G|² by computing gradient variance across mini-batches. No Hessian computation needed.
As the model converges, gradient signal shrinks while noise stays high. Re-estimate periodically and increase batch size accordingly.
The noise scale predicts batch size efficiency for image classification, reinforcement learning, generative models, and language models.
Despite being derived for vanilla SGD, the noise scale accurately predicts behavior for momentum, Adam, RMSProp, and other optimizers.
Below Bnoise: near-perfect parallelism. Above Bnoise: pay in compute for diminishing time savings. Choose based on your budget.