Why does PyTorch DDP train faster than naive data parallelism? The secret is overlapping communication with computation. Let's see exactly how it works, step by step.
In data parallelism, each GPU processes a different mini-batch but must end up with the same updated weights. This requires an AllReduce on every gradient.
Each GPU holds a full copy of the model. The training batch is split across GPUs — each computes forward + backward on its local shard. But before the optimizer can step, all GPUs must agree on the same gradients.
For a model with P parameters, each AllReduce must transfer ~2P bytes (reduce-scatter + all-gather) across the network. For a 7B model, that's ~28 GB of data per step!
The question is: when does this communication happen? That's what makes all the difference.
Compute ALL gradients first, THEN communicate ALL of them. The GPU sits idle during communication.
Start communicating gradients as soon as they're ready, while backward is still running. Communication is hidden behind computation.
The simplest approach: finish ALL computation, then synchronize ALL gradients. Simple, but wasteful.
The key insight: no DDP wrapper, no hooks. We run loss.backward() completely, then manually loop over every parameter and call dist.all_reduce(). This ensures zero overlap.
register_hook() on parameters for "naive" DP. Hooks fire during backward as each gradient is computed — that's already overlapped! For truly sequential behavior, you must manually all-reduce after backward completes.
PyTorch's DistributedDataParallel groups parameters into buckets and fires AllReduce on each bucket as soon as its gradients are ready — while backward is still running.
DDP doesn't send each gradient individually (too many small messages = slow). Instead, it groups parameters into buckets of ~25MB each. When ALL gradients in a bucket are computed, the entire bucket is AllReduced at once.
Smaller buckets = more overlap opportunities (more frequent AllReduce calls), but each call has some fixed overhead.
In a neural network, backward runs last layer to first. DDP buckets parameters in reverse order — so the last layers' parameters are in Bucket 0 (the first to be ready).
DDP uses autograd hooks internally. When a parameter's gradient is computed, the hook checks if its bucket is full. If so, it kicks off AllReduce for that bucket on the NCCL stream — all while backward continues on the compute stream.
See the difference between naive and overlapped gradient synchronization, layer by layer, in real time.
We profiled both strategies on 2 GPUs with a 40-layer MLP (4096 hidden). Here's how to read the traces.
The CPU thread executing your Python training loop. Shows record_function spans like "forward", "backward", "step_N".
The CUDA compute stream. Shows matmul, relu, and gradient kernels. Look for gaps — those are idle time.
The NCCL communication stream. Shows all_reduce kernels. In DDP, these overlap with stream 7.
How does bucket_cap_mb affect the overlap? Drag the slider to see.
A complete breakdown of both strategies across every dimension that matters.
| Dimension | Naive DP | DDP Overlap |
|---|---|---|
| Communication timing | After ALL backward completes | During backward (per bucket) |
| GPU utilization | ~65% (idle during AllReduce) | ~95% (always computing) |
| Communication overhead | ~35% of step time | ~0% (hidden behind compute) |
| Number of AllReduce calls | 1 per parameter (~80 calls) | 1 per bucket (~4-8 calls) |
| Implementation | Manual loop after backward | DDP() wrapper — automatic |
| Code complexity | More code, but explicit | Simpler — 1 line wrapper |
| Gradient correctness | Mathematically identical | Mathematically identical |
| Tuning knob | None | bucket_cap_mb (default 25MB) |
| Profiler signature | Sequential blocks on streams | Interleaved blocks on streams |
How much time does overlap save? It depends on the compute-to-communication ratio.
The total data transferred is the same in both strategies. The difference is when communication happens — sequential vs overlapped with computation.
DDP groups parameters into ~25MB buckets in reverse layer order. As each bucket's gradients finish computing, AllReduce starts immediately on the NCCL stream.
Modern GPUs run compute and communication on separate streams simultaneously. The profiler shows this as parallel activity on stream 7 (compute) and stream 16 (NCCL).
A common mistake: using register_hook() for "naive" DP. Hooks fire during backward — that's already overlapped! Truly naive DP requires manual AllReduce after backward.
Overlap is most effective when backward computation takes longer than communication. For tiny models with fast backward, the savings are smaller.
Use PyTorch Profiler + TensorBoard to visually confirm overlap. Look for interleaved blocks on the compute and NCCL streams — if they're sequential, something is wrong.