Finding Maximum Local Batch Size

Fundamentals

The Four Memory Buckets

During training, GPU memory is split into four distinct categories. Understanding what goes into each bucket is the key to finding your max batch size.

🧱

Parameters

4N bytes

Fixed

📐

Gradients

4N bytes

Fixed

⚙️

Optimizer

8N bytes

Fixed

⚡

Activations

~L×34×S×b×H×2

Scales with b!

Total GPU Memory M_total = 16N + f(b, S, H, L)
fixed (doesn't change with batch size) + activations (grows linearly with b)

The central insight: Parameters, gradients, and optimizer states depend only on the number of model parameters N. They are the same whether you process 1 sequence or 100. Only activations grow when you increase batch size b. So finding the max batch size is really asking: "How much memory is left after the fixed costs, and how many batches of activations can I fit?"

Deep Dive

Why Only Activations Scale with Batch Size?

Let's look at each bucket and understand exactly what's inside and why it does or doesn't depend on how many sequences you feed the model.

Fixed Buckets: Parameters, Gradients, Optimizer

These all depend on the model's weight matrices, which have a fixed shape determined by H (hidden size), L (layers), V (vocab).

Example: One Linear Layer W of shape [4096, 4096]

Parameters (W itself) 4096×4096×4 = 64 MB

Gradients (dW, same shape) 4096×4096×4 = 64 MB

Optimizer (Adam m + v) 64 MB + 64 MB = 128 MB

None of these change if b=1 or b=100. Shape is always [4096, 4096].

Why? The weight matrix W has shape [H, H] — determined by the architecture, not the data. Whether you multiply [1, H] @ W or [100, H] @ W, W itself doesn't change.

Variable Bucket: Activations

Activations are the intermediate outputs at every layer during the forward pass, saved for use in the backward pass.

Example: Y = X @ W where X has shape [b, S, H]

b = 1: Input stored 1×S×H×2 bytes

b = 4: Input stored 4×S×H×2 bytes

b = 8: Input stored 8×S×H×2 bytes

Every activation tensor has b in its shape → memory grows linearly with b.

Why? For Y = XW, the backward pass needs X to compute dW = X^T dY. We must store X, which has shape [b, S, H]. More sequences = bigger X = more memory.

Interactive: Watch Memory Grow with Batch Size

Drag the slider to increase batch size. Watch how the fixed buckets stay the same while activations eat more and more GPU memory.

Local Batch Size: 1 Total: — / 80.0 GiB

Params

Gradients

Optimizer

Activations

Free

The Math

Deriving Maximum Batch Size

We can compute the theoretical max batch size analytically — no trial-and-error needed.

Step-by-Step Derivation

Step 1: Total memory constraint M_GPU ≥ M_fixed + M_activations(b)

Step 2: Fixed memory (depends only on N) M_fixed = 4N + 4N + 8N = 16N bytes

Step 3: Activation memory per layer (bf16) M_act = L × (34 × S × b × H + 5 × n_heads × S² × b) × 2 bytes

Step 4: Solve for b b_max = ⌊ (M_GPU - 16N) / (L × (34×S×H + 5×n×S²) × 2) ⌋

In plain English: Take your GPU's total memory, subtract the fixed cost (16N bytes for params + grads + optimizer), and divide by how much memory one sequence needs in activations across all layers. The result is how many sequences fit.

Worked Example: Llama 3.1 8B on A100 80GB

Component	Value	Memory
GPU Memory	A100	80.0 GiB
Parameters (fp32)	8B × 4 bytes	29.8 GiB
Gradients (fp32)	8B × 4 bytes	29.8 GiB
Optimizer (Adam m+v)	8B × 8 bytes	59.6 GiB
Total Fixed	16 × 8B	119.2 GiB
Already exceeds 80 GiB! Can't even fit b=0 in fp32.

This is why ZeRO/FSDP is essential! Even with mixed precision, you still need fp32 master weights for the optimizer update, so fixed memory remains 16N bytes. Mixed precision helps with activation memory (stored in bf16) and compute speed (tensor cores), but does NOT reduce fixed memory. To actually fit the model, you need ZeRO/FSDP to shard the model states across multiple GPUs, reducing the per-GPU fixed cost from 16N to 16N/D (where D = number of GPUs).

In Practice

The OOM Binary Search

In practice, you find the max batch size empirically: keep increasing b until the GPU crashes with an Out-Of-Memory error.

Interactive OOM Stress Test

Configure your GPU and model, then watch the binary search find the maximum batch size.

GPU Memory:

Model Size:

Sequence Length:

Training Mode:

Interactive

Batch Size Calculator

Enter your GPU specs and model config. Get the theoretical maximum local batch size computed in real time.

How this calculator works: It computes b_max = floor((M_GPU × 0.9 − M_fixed / D) / M_act_per_sample) where M_fixed = 16N bytes (params + gradients + optimizer), D = number of GPUs (with ZeRO-3, fixed memory is sharded), and M_act_per_sample depends on the model architecture. The 0.9 factor reserves 10% for CUDA overhead. Try changing the values below and watch the result update in real time!

GPU & Training Config

GPU Memory

GiB

Number of GPUs

ZeRO-3 sharding

Training Precision

Activation Checkpointing

Model Config

Parameters (N)

Billion

Hidden Size (H)

Layers (L)

Attention Heads

Sequence Length (S)

Quick presets:

Maximum Local Batch Size (per GPU)

—

Memory Usage Breakdown

Fixed (model states) Activations (b_max samples) CUDA buffer (10%)

Strategies

How to Fit a Larger Batch Size

If your max batch size is too small, here are your options — each reduces either the fixed cost or the per-sample activation cost.

🗜️

Mixed Precision (bf16)

Forward/backward pass uses bf16 (faster tensor core ops). Fixed memory stays at 16N (fp32 master weights are still needed), but activation memory is halved since intermediates are stored in bf16 (2 bytes) instead of fp32 (4 bytes).

Activation memory saved Activations: fp32 → bf16 = 50% less per sample

🔄

Activation Checkpointing

Don't store intermediate activations — recompute them during backward. This directly reduces the per-sample activation cost, letting you fit more sequences.

Activation memory reduction Up to ~90% less activations

🔀

ZeRO / FSDP

Shard parameters, gradients, and optimizer states across multiple GPUs. Each GPU only stores 1/N_gpu of the fixed cost.

Fixed memory per GPU 16N → 16N / N_gpus

But I need a global batch size larger than b_max!

No problem. This is exactly what gradient accumulation is for. You process b_max sequences at a time, accumulate the gradients over multiple steps, and only update weights after reaching your desired global batch size.

Global Batch Size B_global = b_local × N_gpus × grad_accum_steps

Example: If b_max = 4, you have 8 GPUs, and you want B_global = 256: set grad_accum_steps = 256 / (4 × 8) = 8.

Visualization

What Affects b_max Most?

See how the maximum batch size changes as you vary sequence length, model size, and GPU memory.

Maximum Batch Size vs. Sequence Length

For different model sizes on an A100-80GB with mixed precision + attention checkpointing.

1B

7B

13B

70B

Summary

Key Takeaways

1

Only activations scale with b

Parameters, gradients, and optimizer states are fixed by the model architecture. Activations are the only memory bucket that grows when you increase batch size.

2

b_max = (M_GPU - M_fixed) / M_{act_per_sample}

The max batch size is simply: leftover memory after fixed costs, divided by the per-sample activation cost. You can compute this analytically or find it empirically via OOM search.

3

Sequence length is the biggest lever

Activation memory has an S² term from attention. Doubling sequence length more than halves your max batch size. This is why long-context training is so memory-hungry.

4

Three ways to increase b_max

Mixed precision (halve activation memory per sample), activation checkpointing (reduce per-sample cost by ~90%), and ZeRO/FSDP (shard the 16N fixed cost across GPUs).

5

Gradient accumulation bridges the gap

If b_max is smaller than your desired global batch size, gradient accumulation lets you simulate larger batches without needing more GPU memory.

6

Always leave a ~10% memory buffer

CUDA memory fragmentation + temporary buffers mean the practical b_max is slightly less than the theoretical maximum. Always leave headroom.

Finding the Maximum Local Batch Size

The Four Memory Buckets

Why Only Activations Scale with Batch Size?

Fixed Buckets: Parameters, Gradients, Optimizer

Variable Bucket: Activations

Interactive: Watch Memory Grow with Batch Size

Deriving Maximum Batch Size

Step-by-Step Derivation

Worked Example: Llama 3.1 8B on A100 80GB

The OOM Binary Search

Interactive OOM Stress Test

Batch Size Calculator

GPU & Training Config

Model Config

How to Fit a Larger Batch Size

Mixed Precision (bf16)

Activation Checkpointing

ZeRO / FSDP

But I need a global batch size larger than bmax!

What Affects bmax Most?

Maximum Batch Size vs. Sequence Length

Key Takeaways

Only activations scale with b

bmax = (MGPU - Mfixed) / Mact_per_sample

Sequence length is the biggest lever

Three ways to increase bmax

Gradient accumulation bridges the gap

Always leave a ~10% memory buffer

But I need a global batch size larger than b_max!

What Affects b_max Most?

b_max = (M_GPU - M_fixed) / M_{act_per_sample}

Three ways to increase b_max