Data Parallelism — Deep Dive

Gradient Synchronization

Why does PyTorch DDP train faster than naive data parallelism? The secret is overlapping communication with computation. Let's see exactly how it works, step by step.

2 GPUs
Data parallel workers
~35%
Communication overhead (naive)
~0%
Visible overhead (DDP overlap)
Buckets
The key DDP mechanism

Why Synchronize Gradients?

In data parallelism, each GPU processes a different mini-batch but must end up with the same updated weights. This requires an AllReduce on every gradient.

Data Parallelism Basics

Each GPU holds a full copy of the model. The training batch is split across GPUs — each computes forward + backward on its local shard. But before the optimizer can step, all GPUs must agree on the same gradients.

Gradient averaging (AllReduce) gavg = 1/N × ∑i=1..N gi

The Cost of Communication

For a model with P parameters, each AllReduce must transfer ~2P bytes (reduce-scatter + all-gather) across the network. For a 7B model, that's ~28 GB of data per step!

The question is: when does this communication happen? That's what makes all the difference.

Naive DP: Sequential

Compute ALL gradients first, THEN communicate ALL of them. The GPU sits idle during communication.

forwardbackwardAllReduceoptim

DDP: Overlapped

Start communicating gradients as soon as they're ready, while backward is still running. Communication is hidden behind computation.

forwardbackward + AllReduceoptim

Naive Data Parallelism

The simplest approach: finish ALL computation, then synchronize ALL gradients. Simple, but wasteful.

Naive DP Timeline — One Training Step
Each row is a GPU stream. Notice the gap where compute is idle during AllReduce.
CPU
forward
backward
sync
allreduce_ALL_GRADS
sync
optim
sync
GPU Compute
matmul, relu
grad kernels
IDLE — waiting for NCCL
SGD
NCCL Stream
idle — no communication
all_reduce × 80 params
The problem: The GPU compute stream is completely idle during the AllReduce phase. For a 40-layer model with 4096 hidden size, this communication takes ~35% of the total step time. That's 35% of your expensive GPU wasted doing nothing!

How Naive DP is Implemented

The key insight: no DDP wrapper, no hooks. We run loss.backward() completely, then manually loop over every parameter and call dist.all_reduce(). This ensures zero overlap.

# Step 1: Pure backward — NO communication happens here
loss.backward()
torch.cuda.synchronize()  # Wait for backward to fully finish

# Step 2: NOW communicate — GPU compute is idle during this
for p in model.parameters():
    if p.grad is not None:
        dist.all_reduce(p.grad, op=dist.ReduceOp.SUM)
        p.grad /= world_size
Common mistake: Using register_hook() on parameters for "naive" DP. Hooks fire during backward as each gradient is computed — that's already overlapped! For truly sequential behavior, you must manually all-reduce after backward completes.

DDP with Overlap

PyTorch's DistributedDataParallel groups parameters into buckets and fires AllReduce on each bucket as soon as its gradients are ready — while backward is still running.

What Are Buckets?

DDP doesn't send each gradient individually (too many small messages = slow). Instead, it groups parameters into buckets of ~25MB each. When ALL gradients in a bucket are computed, the entire bucket is AllReduced at once.

Bucket formation Bucket = {param_k, param_k+1, ...} where ∑ sizes ≤ bucket_cap_mb

Smaller buckets = more overlap opportunities (more frequent AllReduce calls), but each call has some fixed overhead.

Why Backward Order Matters

In a neural network, backward runs last layer to first. DDP buckets parameters in reverse order — so the last layers' parameters are in Bucket 0 (the first to be ready).

DDP Overlap Timeline — One Training Step
Notice how AllReduce calls overlap with backward computation. The GPU is never idle!
CPU
forward
backward_WITH_OVERLAP
sync
optim
GPU Compute
matmul
L40-31
L30-21
L20-11
L10-1
SGD
NCCL Stream
B0
B1
B2
B3
The magic: While the GPU computes gradients for layers 20-11, the NCCL stream is simultaneously AllReducing the already-finished Bucket 0 (layers 40-31). Communication is hidden behind computation — effectively free!

How DDP Overlap is Implemented

DDP uses autograd hooks internally. When a parameter's gradient is computed, the hook checks if its bucket is full. If so, it kicks off AllReduce for that bucket on the NCCL stream — all while backward continues on the compute stream.

# DDP wrapper handles everything automatically
ddp = DDP(model, device_ids=[local_rank],
        bucket_cap_mb=5,  # Small buckets = more overlap
        gradient_as_bucket_view=True)

# Just call backward — DDP fires AllReduce automatically
loss.backward()  # Compute + communicate SIMULTANEOUSLY

# By the time backward finishes, most AllReduces are done too!
opt.step()

Watch It Happen

See the difference between naive and overlapped gradient synchronization, layer by layer, in real time.

Ready — click Play
F
B
AR
O
Layer Backward (compute gradients) AllReduce (communicate)
Total Step Time
Compute Utilization
Communication Hidden

Reading the PyTorch Profiler

We profiled both strategies on 2 GPUs with a 40-layer MLP (4096 hidden). Here's how to read the traces.

Anatomy of a Profiler Trace

Thread

pt_main_thread

The CPU thread executing your Python training loop. Shows record_function spans like "forward", "backward", "step_N".

GPU 0

stream 7 (compute)

The CUDA compute stream. Shows matmul, relu, and gradient kernels. Look for gaps — those are idle time.

NCCL

stream 16 (comms)

The NCCL communication stream. Shows all_reduce kernels. In DDP, these overlap with stream 7.

Naive DP Trace — What You See
From the PyTorch Profiler TensorBoard trace. Notice the clear sequential blocks.
Main Thread
fwd
backw...
sync
allred...
sync
fwd
backw...
sync
allred...
GPU stream 7
GPU idle
opt
GPU stream 16
nccl:AllReduce
DDP Overlap Trace — What You See
Notice how stream 7 (compute) and stream 16 (NCCL) are active at the same time.
Main Thread
forward
backward_WITH_OVERLAP
sync
forward
backward_WITH_OVERLAP
GPU stream 7
compute kernels (continuous)
opt
GPU stream 16
B0
B1
B2
B3
Key observation from the traces: In the naive trace, there's a clear gap in GPU stream 7 where the compute SM is idle while NCCL runs on stream 16. In the DDP trace, both streams are active simultaneously — the AllReduce buckets (B0, B1, B2, B3) overlap with backward compute kernels. The overall step is shorter because communication time is hidden.

Bucket Size Explorer

How does bucket_cap_mb affect the overlap? Drag the slider to see.

bucket_cap_mb: 25 MB
Model: 40 layers × 4096 hidden = 2.7 GB total gradients
— buckets — overlap potential
Backward + AllReduce overlap pattern:
Backward compute AllReduce (NCCL) Idle / overhead
Trade-off: Smaller buckets = more overlap opportunities, but each AllReduce call has fixed latency overhead (~10-50μs). Too small = overhead dominates. Default in PyTorch is 25MB, which is a good balance. The notebook uses 5MB to make the overlap more visible in traces.

Naive vs DDP Comparison

A complete breakdown of both strategies across every dimension that matters.

Dimension Naive DP DDP Overlap
Communication timing After ALL backward completes During backward (per bucket)
GPU utilization ~65% (idle during AllReduce) ~95% (always computing)
Communication overhead ~35% of step time ~0% (hidden behind compute)
Number of AllReduce calls 1 per parameter (~80 calls) 1 per bucket (~4-8 calls)
Implementation Manual loop after backward DDP() wrapper — automatic
Code complexity More code, but explicit Simpler — 1 line wrapper
Gradient correctness Mathematically identical Mathematically identical
Tuning knob None bucket_cap_mb (default 25MB)
Profiler signature Sequential blocks on streams Interleaved blocks on streams

Time Savings Calculator

How much time does overlap save? It depends on the compute-to-communication ratio.

Compute time: 100ms
Comm time: 35ms
Naive DP step:
DDP Overlap step:
Naive: 135ms DDP: 100ms Saved: 26%

Key Takeaways

1

Overlap is the key insight

The total data transferred is the same in both strategies. The difference is when communication happens — sequential vs overlapped with computation.

2

Buckets enable overlap

DDP groups parameters into ~25MB buckets in reverse layer order. As each bucket's gradients finish computing, AllReduce starts immediately on the NCCL stream.

3

Two streams, one GPU

Modern GPUs run compute and communication on separate streams simultaneously. The profiler shows this as parallel activity on stream 7 (compute) and stream 16 (NCCL).

4

Hooks ≠ naive DP

A common mistake: using register_hook() for "naive" DP. Hooks fire during backward — that's already overlapped! Truly naive DP requires manual AllReduce after backward.

5

Works best when compute > comm

Overlap is most effective when backward computation takes longer than communication. For tiny models with fast backward, the savings are smaller.

6

Profile to verify

Use PyTorch Profiler + TensorBoard to visually confirm overlap. Look for interleaved blocks on the compute and NCCL streams — if they're sequential, something is wrong.