Why Does Autograd Save Activations?

01 — The Misconception

What people think vs. what actually happens

✗ Wrong Mental Model

"During the forward pass, autograd computes all the gradients and stores them with each Tensor. Then .backward() just topologically sorts these pre-computed gradients and chains them."

✓ What Actually Happens

During forward, autograd builds a DAG of operations. For each op, it records a grad_fn — a recipe for computing the gradient — and saves the input activations the recipe will need. No gradient value is computed until .backward() is called.

💡 Why the confusion?

"Autograd" suggests automatic gradient computation. It's natural to assume gradients are computed automatically during forward. In reality, it means the system automatically records what it needs during forward so it can defer gradient computation to backward. This is called reverse-mode automatic differentiation.

02 — Step by Step

Following `y = relu(x @ W + b)` through autograd

At each operation, watch what gets created and what gets saved.

MatMul: z₁ = x @ W

Creates a MmBackward node in the graph. Saves x and W because the gradient formulas need them.

z1 = x @ W # grad_fn = MmBackward # saved: x (for ∂L/∂W = xᵀ @ grad) # W (for ∂L/∂x = grad @ Wᵀ)

Add: z₂ = z₁ + b

Creates AddBackward. Addition's gradient is just 1 — nothing needs to be saved.

z2 = z1 + b # grad_fn = AddBackward # saved: nothing (grad = pass-through)

ReLU: y = relu(z₂)

Creates ReluBackward. Saves z₂ (or the boolean mask z₂ > 0) because ReLU's gradient is 1 where positive, 0 elsewhere.

y = z2.relu() # grad_fn = ReluBackward # saved: z2 → mask = (z2 > 0)

←

loss.backward() — NOW gradients are computed

The engine toposorts the DAG and walks it in reverse. At each node, it runs the grad_fn using saved tensors + incoming upstream gradient to produce the gradient for that op.

loss.backward() # Engine walks graph in reverse: # ReluBackward: grad * (z2 > 0) ← uses saved z2 # AddBackward: pass grad through # MmBackward: ∂W = xᵀ @ grad ← uses saved x # ∂x = grad @ Wᵀ ← uses saved W

03 — The Math

Each backward formula needs forward values

The orange terms must come from saved forward state. This is why activations are stored.

Matrix Multiply

Y = X @ W

∂L/∂W = Xᵀ @ ∂L/∂Y

∂L/∂X = ∂L/∂Y @ Wᵀ

Need saved X and W.

ReLU

Y = max(0, X)

∂L/∂X = ∂L/∂Y ⊙ (X > 0)

Need saved X (or its sign mask).

Softmax

Yᵢ = exp(Xᵢ) / Σexp(Xⱼ)

∂L/∂X = Y ⊙ (∂L/∂Y − ∂L/∂Y · Y)

Need saved output Y (the probabilities).

Element-wise Multiply

Y = A ⊙ B

∂L/∂A = ∂L/∂Y ⊙ B

∂L/∂B = ∂L/∂Y ⊙ A

Each input is the other's gradient coefficient.

Addition (the exception)

Y = A + B

∂L/∂A = ∂L/∂Y

∂L/∂B = ∂L/∂Y

Gradient is 1. No saved values needed — the rare exception.

Layer Normalization

Y = (X − μ) / σ

∂L/∂X = f(∂L/∂Y, X, μ, σ)

Need saved input, mean, and std deviation.

💡 The Pattern

For almost every operation, the backward formula needs two things: (1) the upstream gradient flowing from the next layer, and (2) some value saved during forward — an input, output, or intermediate. Autograd saves only what each operation specifically needs.

05 — PyTorch Internals

What's inside a Tensor with `requires_grad=True`

📦

.data

The actual numeric values stored in memory.

🔗

.grad_fn

Pointer to the backward function that created this tensor. The graph node.

💾

.grad_fn.saved_tensors

The activations saved during forward. This is where they live.

📊

.grad

Initially None. Only filled after .backward() runs.

🏷️

.requires_grad

Boolean flag that tells autograd to track operations on this tensor.

🔙

.next_functions

DAG edges linking to parent grad_fns. This is what gets topologically sorted.

💡 The Key Distinction

A common misconception is that gradients for all ops are calculated during forward. In reality: .grad_fn is registered during forward (a function pointer, not a computed value). .grad stays None until backward. The actual gradient values only exist after .backward() runs, because each gradient depends on the upstream gradient from the next layer — which doesn't exist yet during forward.

06 — Memory Cost

Activation memory is the training bottleneck

If gradients were pre-computed in forward, there'd be no memory problem. The fact that saved activations dominate GPU memory proves they're needed.

Standard Training

Parameters2 GB

Gradients2 GB

Optimizer4 GB

Activations8+ GB

Activations dominate. For a 1B model at batch 32, they can be 10× the model size.

Gradient Checkpointing

Parameters2 GB

Gradients2 GB

Optimizer4 GB

Activations~1 GB

Only save at checkpoints. Re-run forward during backward. ~30% more compute, far less memory.

💡 The Ultimate Proof

If autograd pre-computed gradients during forward, there would be no memory problem — just store gradients (same size as parameters) and discard everything. The fact that GPU memory during training is overwhelmingly consumed by saved activations — and that gradient checkpointing exists to address this — proves gradients are not computed during forward.

Why Does Autograd Save Activations for Backward?

What people think vs. what actually happens

Following `y = relu(x @ W + b)` through autograd

MatMul: z₁ = x @ W

Add: z₂ = z₁ + b

ReLU: y = relu(z₂)

loss.backward() — NOW gradients are computed

Each backward formula needs forward values

Y = X @ W

Y = max(0, X)

Yᵢ = exp(Xᵢ) / Σexp(Xⱼ)

Y = A ⊙ B

Y = A + B

Y = (X − μ) / σ

Watch the autograd engine step by step

Autograd Simulation

State

What's inside a Tensor with `requires_grad=True`

.data

.grad_fn

.grad_fn.saved_tensors

.grad

.requires_grad

.next_functions

Activation memory is the training bottleneck

Standard Training

Gradient Checkpointing

What people think vs. what actually happens

Following y = relu(x @ W + b) through autograd

MatMul: z₁ = x @ W

Add: z₂ = z₁ + b

ReLU: y = relu(z₂)

loss.backward() — NOW gradients are computed

Each backward formula needs forward values

Y = X @ W

Y = max(0, X)

Yᵢ = exp(Xᵢ) / Σexp(Xⱼ)

Y = A ⊙ B

Y = A + B

Y = (X − μ) / σ

Watch the autograd engine step by step

Autograd Simulation

State

What's inside a Tensor with requires_grad=True

.data

.grad_fn

.grad_fn.saved_tensors

.grad

.requires_grad

.next_functions

Activation memory is the training bottleneck

Standard Training

Gradient Checkpointing

Following `y = relu(x @ W + b)` through autograd

What's inside a Tensor with `requires_grad=True`