Autograd doesn't pre-compute gradients during the forward pass. It saves the ingredients each operation will need to compute gradients later. Here's exactly why.
"During the forward pass, autograd computes all the gradients and stores them with each Tensor. Then .backward() just topologically sorts these pre-computed gradients and chains them."
During forward, autograd builds a DAG of operations. For each op, it records a grad_fn — a recipe for computing the gradient — and saves the input activations the recipe will need. No gradient value is computed until .backward() is called.
"Autograd" suggests automatic gradient computation. It's natural to assume gradients are computed automatically during forward. In reality, it means the system automatically records what it needs during forward so it can defer gradient computation to backward. This is called reverse-mode automatic differentiation.
y = relu(x @ W + b) through autogradAt each operation, watch what gets created and what gets saved.
Creates a MmBackward node in the graph. Saves x and W because the gradient formulas need them.
Creates AddBackward. Addition's gradient is just 1 — nothing needs to be saved.
Creates ReluBackward. Saves z₂ (or the boolean mask z₂ > 0) because ReLU's gradient is 1 where positive, 0 elsewhere.
The engine toposorts the DAG and walks it in reverse. At each node, it runs the grad_fn using saved tensors + incoming upstream gradient to produce the gradient for that op.
The orange terms must come from saved forward state. This is why activations are stored.
Need saved X and W.
Need saved X (or its sign mask).
Need saved output Y (the probabilities).
Each input is the other's gradient coefficient.
Gradient is 1. No saved values needed — the rare exception.
Need saved input, mean, and std deviation.
For almost every operation, the backward formula needs two things: (1) the upstream gradient flowing from the next layer, and (2) some value saved during forward — an input, output, or intermediate. Autograd saves only what each operation specifically needs.
Click each step. Notice how .grad stays None until backward.
Click a step to walk through the forward pass →
requires_grad=TrueThe actual numeric values stored in memory.
Pointer to the backward function that created this tensor. The graph node.
The activations saved during forward. This is where they live.
Initially None. Only filled after .backward() runs.
Boolean flag that tells autograd to track operations on this tensor.
DAG edges linking to parent grad_fns. This is what gets topologically sorted.
A common misconception is that gradients for all ops are calculated during forward. In reality: .grad_fn is registered during forward (a function pointer, not a computed value). .grad stays None until backward. The actual gradient values only exist after .backward() runs, because each gradient depends on the upstream gradient from the next layer — which doesn't exist yet during forward.
If gradients were pre-computed in forward, there'd be no memory problem. The fact that saved activations dominate GPU memory proves they're needed.
Activations dominate. For a 1B model at batch 32, they can be 10× the model size.
Only save at checkpoints. Re-run forward during backward. ~30% more compute, far less memory.
If autograd pre-computed gradients during forward, there would be no memory problem — just store gradients (same size as parameters) and discard everything. The fact that GPU memory during training is overwhelmingly consumed by saved activations — and that gradient checkpointing exists to address this — proves gradients are not computed during forward.