Batch Size Selection

The Gradient Noise Scale

How large should your batch size be? Too small and you waste sequential steps. Too large and you waste compute. The gradient noise scale tells you exactly where the sweet spot is.

Bnoise
The Critical Batch Size
50%
Speed at B = Bnoise
1/B
Variance Scaling
Domains It Works For

Small Batch vs Large Batch

Understanding the two extremes helps us find the sweet spot in between

Small Batch

  • Gradient estimate has very high variance — mostly noise
  • Each update is nearly random; you must do many steps to make progress
  • These noisy steps could be aggregated in parallel for the same effect
  • Doubling B gives ~2× speedup with almost no extra compute
vs

Large Batch

  • Gradient estimate is nearly exact — almost matches true gradient
  • Two random batches give nearly the same gradient
  • Doubling B barely improves the update — 2× compute for little gain
  • You're in the regime of diminishing returns
The Key Insight: The transition between "free speedup" and "wasted compute" occurs where the noise and signal of the gradient are balanced — where the variance of the gradient is at the same scale as the gradient itself. This is the gradient noise scale Bnoise.

⚡ Perfect Scaling

B « Bnoise
~Linear speedup, free parallelism

⚖ Sweet Spot

B ≈ Bnoise
Balanced noise & signal

⚠ Diminishing Returns

B » Bnoise
Wasted computation

Watching SGD Navigate a Loss Landscape

Small batches take noisy, zigzag paths. Large batches move smoothly toward the minimum.

Small batch — noisy path
What you see: The contour lines show the loss surface (a 2D quadratic bowl). Small batches (red) take erratic, noisy steps that often overshoot. Large batches (blue) move in a nearly straight line toward the minimum. The optimal batch size is large enough to smooth the noise but not so large that you waste compute per step.

Formalizing the Noise Scale

From gradient estimation to the critical batch size formula

Step 1: The Gradient Estimate

We approximate the true gradient G(θ) by averaging over a batch of B samples:

Mini-batch Gradient Gest = 1/B · Σi=1..B ∇Lxi(θ)

The expected value equals the true gradient, and the covariance scales as 1/B:

Covariance of Estimate Cov(Gest) = (1/B) · Σ(θ)

Step 2: Optimal Step Size

With a noisy gradient, the optimal learning rate depends on the batch size:

Optimal Learning Rate εopt(B) = εmax / (1 + Bnoise/B)

And the best possible loss improvement from one step:

Optimal Loss Improvement ΔLopt(B) = ΔLmax / (1 + Bnoise/B)

Step 3: The Gradient Noise Scale

The critical batch size where noise and signal balance:

Gradient Noise Scale Bnoise = tr(HΣ) / GTHG
Numerator: tr(HΣ) measures the total gradient noise weighted by the curvature of the loss landscape
Denominator: GTHG measures the signal — how much the true gradient aligns with the direction of steepest curvature
Simplified Approximation: In practice, the noise scale can be approximated without computing the Hessian:
Bsimple = tr(Σ) / |G|²
This is just the ratio of gradient variance to gradient magnitude squared — measure the variance of per-example gradients and divide by the squared norm of the mean gradient.

The S-Curve of Batch Size Efficiency

Training speed as a function of batch size, with the turning point at Bnoise

Training Speed vs Batch Size

εopt(B) / εmax as a function of B / Bnoise (log-log scale)

Batch Size (B / Bnoise) 1.00
50%
Training Speed (% of max)
50%
Compute Efficiency
Sweet Spot
Current Regime
Reading the curve: When B/Bnoise « 1, training speed scales linearly with batch size (perfect scaling). At B = Bnoise, you reach 50% of maximum training speed. Beyond Bnoise, additional compute yields diminishing returns.

Compute vs Wall-Clock Time

Every batch size choice places you on a Pareto frontier between training time and total compute

Compute-Time Tradeoff Curve

How total compute and training steps change as batch size increases

Batch Size (B / Bnoise) 1.00
🟢

Small B Region

Increasing batch size cuts training time with virtually no extra compute. Each step does more useful work because variance is the bottleneck.

🟡

The Turning Point

Around B ≈ Bnoise, you enter the tradeoff zone. More compute buys less time savings. This is the optimal operating point for most practitioners.

🔴

Large B Region

Diminishing returns dominate. You're spending 2× compute for <10% speedup. The gradient estimate is already nearly perfect.

Find Your Optimal Batch Size

Enter your estimated noise scale and explore the efficiency of different batch sizes

Typical range: 10² – 10&sup6; depending on model & task
Explore Batch Size 256
5.9%
Training Speed
94%
Compute Efficiency
5.9%
εopt / εmax
0.06×
B / Bnoise
How to estimate Bnoise in practice:
  1. Run training with a small batch size for a few hundred steps
  2. Compute the gradient on many small batches — record per-example gradients
  3. Calculate: Bsimple = tr(Σ) / |G|² = (variance of gradients) / (squared mean gradient)
  4. The noise scale grows during training as the model approaches convergence, so re-estimate periodically

How Bnoise Changes During Training

The noise scale is not static — it evolves as training progresses

Noise Scale Evolution During Training

Bnoise typically grows as the loss decreases, suggesting batch size should increase over time

📈 Why Bnoise Grows

Early in training, the gradient signal is strong (large |G|) because the model is far from optimal. As training progresses and the loss decreases, the gradient magnitude shrinks while per-example variance stays relatively high. The ratio tr(Σ)/|G|² therefore increases.

💡 The Practical Recipe

Start training with a smaller batch size (within the linear scaling regime). As training progresses and Bnoise grows, increase the batch size to maintain efficiency. This is exactly what adaptive batch size schedules do, and the noise scale tells you when.

Key Takeaways

01

Bnoise Is the Critical Batch Size

The gradient noise scale marks the transition from "free parallelism" to "wasted compute." At B = Bnoise, training speed is 50% of maximum.

02

It's Easy to Measure

Estimate Bnoise ≈ tr(Σ)/|G|² by computing gradient variance across mini-batches. No Hessian computation needed.

03

Bnoise Grows During Training

As the model converges, gradient signal shrinks while noise stays high. Re-estimate periodically and increase batch size accordingly.

04

Works Across Domains

The noise scale predicts batch size efficiency for image classification, reinforcement learning, generative models, and language models.

05

Optimizer-Agnostic

Despite being derived for vanilla SGD, the noise scale accurately predicts behavior for momentum, Adam, RMSProp, and other optimizers.

06

Compute vs Time Tradeoff

Below Bnoise: near-perfect parallelism. Above Bnoise: pay in compute for diminishing time savings. Choose based on your budget.