Gradient Noise Scale — Choosing the Right Batch Size

The Intuition

Small Batch vs Large Batch

Understanding the two extremes helps us find the sweet spot in between

■ Small Batch

Gradient estimate has very high variance — mostly noise
Each update is nearly random; you must do many steps to make progress
These noisy steps could be aggregated in parallel for the same effect
Doubling B gives ~2× speedup with almost no extra compute

vs

■ Large Batch

Gradient estimate is nearly exact — almost matches true gradient
Two random batches give nearly the same gradient
Doubling B barely improves the update — 2× compute for little gain
You're in the regime of diminishing returns

The Key Insight: The transition between "free speedup" and "wasted compute" occurs where the noise and signal of the gradient are balanced — where the variance of the gradient is at the same scale as the gradient itself. This is the gradient noise scale B_noise.

⚡ Perfect Scaling

B « B_noise
~Linear speedup, free parallelism

⚖ Sweet Spot

B ≈ B_noise
Balanced noise & signal

⚠ Diminishing Returns

B » B_noise
Wasted computation

Visualization

Watching SGD Navigate a Loss Landscape

Small batches take noisy, zigzag paths. Large batches move smoothly toward the minimum.

Small batch — noisy path

What you see: The contour lines show the loss surface (a 2D quadratic bowl). Small batches (red) take erratic, noisy steps that often overshoot. Large batches (blue) move in a nearly straight line toward the minimum. The optimal batch size is large enough to smooth the noise but not so large that you waste compute per step.

The Math

Formalizing the Noise Scale

From gradient estimation to the critical batch size formula

Step 1: The Gradient Estimate

We approximate the true gradient G(θ) by averaging over a batch of B samples:

Mini-batch Gradient G_est = 1/B · Σ_i=1..B ∇L_{x_i}(θ)

The expected value equals the true gradient, and the covariance scales as 1/B:

Covariance of Estimate Cov(G_est) = (1/B) · Σ(θ)

Step 2: Optimal Step Size

With a noisy gradient, the optimal learning rate depends on the batch size:

Optimal Learning Rate ε_opt(B) = ε_max / (1 + B_noise/B)

And the best possible loss improvement from one step:

Optimal Loss Improvement ΔL_opt(B) = ΔL_max / (1 + B_noise/B)

Step 3: The Gradient Noise Scale

The critical batch size where noise and signal balance:

Gradient Noise Scale B_noise = tr(HΣ) / G^THG

Numerator: tr(HΣ) measures the total gradient noise weighted by the curvature of the loss landscape

Denominator: G^THG measures the signal — how much the true gradient aligns with the direction of steepest curvature

Simplified Approximation: In practice, the noise scale can be approximated without computing the Hessian:

B_simple = tr(Σ) / |G|²

This is just the ratio of gradient variance to gradient magnitude squared — measure the variance of per-example gradients and divide by the squared norm of the mean gradient.

Predicted Training Speed

The S-Curve of Batch Size Efficiency

Training speed as a function of batch size, with the turning point at B_noise

Training Speed vs Batch Size

ε_opt(B) / ε_max as a function of B / B_noise (log-log scale)

Batch Size (B / B_noise) 1.00

50%

Training Speed (% of max)

50%

Compute Efficiency

Sweet Spot

Current Regime

Reading the curve: When B/B_noise « 1, training speed scales linearly with batch size (perfect scaling). At B = B_noise, you reach 50% of maximum training speed. Beyond B_noise, additional compute yields diminishing returns.

The Tradeoff

Compute vs Wall-Clock Time

Every batch size choice places you on a Pareto frontier between training time and total compute

Compute-Time Tradeoff Curve

How total compute and training steps change as batch size increases

Batch Size (B / B_noise) 1.00

🟢

Small B Region

Increasing batch size cuts training time with virtually no extra compute. Each step does more useful work because variance is the bottleneck.

🟡

The Turning Point

Around B ≈ B_noise, you enter the tradeoff zone. More compute buys less time savings. This is the optimal operating point for most practitioners.

🔴

Large B Region

Diminishing returns dominate. You're spending 2× compute for <10% speedup. The gradient estimate is already nearly perfect.

Interactive Calculator

Find Your Optimal Batch Size

Enter your estimated noise scale and explore the efficiency of different batch sizes

Estimated B_noise Typical range: 10² – 10&sup6; depending on model & task

Your Current Batch Size

Explore Batch Size 256

5.9%

Training Speed

94%

Compute Efficiency

5.9%

ε_opt / ε_max

0.06×

B / B_noise

How to estimate B_noise in practice:

Run training with a small batch size for a few hundred steps
Compute the gradient on many small batches — record per-example gradients
Calculate: B_simple = tr(Σ) / |G|² = (variance of gradients) / (squared mean gradient)
The noise scale grows during training as the model approaches convergence, so re-estimate periodically

Practical Insights

How B_noise Changes During Training

The noise scale is not static — it evolves as training progresses

Noise Scale Evolution During Training

B_noise typically grows as the loss decreases, suggesting batch size should increase over time

📈 Why B_noise Grows

Early in training, the gradient signal is strong (large |G|) because the model is far from optimal. As training progresses and the loss decreases, the gradient magnitude shrinks while per-example variance stays relatively high. The ratio tr(Σ)/|G|² therefore increases.

💡 The Practical Recipe

Start training with a smaller batch size (within the linear scaling regime). As training progresses and B_noise grows, increase the batch size to maintain efficiency. This is exactly what adaptive batch size schedules do, and the noise scale tells you when.

Summary

Key Takeaways

01

B_noise Is the Critical Batch Size

The gradient noise scale marks the transition from "free parallelism" to "wasted compute." At B = B_noise, training speed is 50% of maximum.

02

It's Easy to Measure

Estimate B_noise ≈ tr(Σ)/|G|² by computing gradient variance across mini-batches. No Hessian computation needed.

03

B_noise Grows During Training

As the model converges, gradient signal shrinks while noise stays high. Re-estimate periodically and increase batch size accordingly.

04

Works Across Domains

The noise scale predicts batch size efficiency for image classification, reinforcement learning, generative models, and language models.

05

Optimizer-Agnostic

Despite being derived for vanilla SGD, the noise scale accurately predicts behavior for momentum, Adam, RMSProp, and other optimizers.

06

Compute vs Time Tradeoff

Below B_noise: near-perfect parallelism. Above B_noise: pay in compute for diminishing time savings. Choose based on your budget.

The Gradient Noise Scale

Small Batch vs Large Batch

■ Small Batch

■ Large Batch

⚡ Perfect Scaling

⚖ Sweet Spot

⚠ Diminishing Returns

Watching SGD Navigate a Loss Landscape

Formalizing the Noise Scale

Step 1: The Gradient Estimate

Step 2: Optimal Step Size

Step 3: The Gradient Noise Scale

The S-Curve of Batch Size Efficiency

Training Speed vs Batch Size

Compute vs Wall-Clock Time

Compute-Time Tradeoff Curve

Small B Region

The Turning Point

Large B Region

Find Your Optimal Batch Size

How Bnoise Changes During Training

Noise Scale Evolution During Training

📈 Why Bnoise Grows

💡 The Practical Recipe

Key Takeaways

Bnoise Is the Critical Batch Size

It's Easy to Measure

Bnoise Grows During Training

Works Across Domains

Optimizer-Agnostic

Compute vs Time Tradeoff

How B_noise Changes During Training

📈 Why B_noise Grows

B_noise Is the Critical Batch Size

B_noise Grows During Training