Gradient Synchronization — Naive DP vs DDP Overlap

The Problem

Why Synchronize Gradients?

In data parallelism, each GPU processes a different mini-batch but must end up with the same updated weights. This requires an AllReduce on every gradient.

Data Parallelism Basics

Each GPU holds a full copy of the model. The training batch is split across GPUs — each computes forward + backward on its local shard. But before the optimizer can step, all GPUs must agree on the same gradients.

Gradient averaging (AllReduce) g_avg = 1/N × ∑_i=1..N g_i

The Cost of Communication

For a model with P parameters, each AllReduce must transfer ~2P bytes (reduce-scatter + all-gather) across the network. For a 7B model, that's ~28 GB of data per step!

The question is: when does this communication happen? That's what makes all the difference.

Naive DP: Sequential

Compute ALL gradients first, THEN communicate ALL of them. The GPU sits idle during communication.

forward → backward → AllReduce → optim

DDP: Overlapped

Start communicating gradients as soon as they're ready, while backward is still running. Communication is hidden behind computation.

forward → backward + AllReduce → optim

Strategy 1

Naive Data Parallelism

The simplest approach: finish ALL computation, then synchronize ALL gradients. Simple, but wasteful.

Naive DP Timeline — One Training Step

Each row is a GPU stream. Notice the gap where compute is idle during AllReduce.

CPU

forward

backward

sync

allreduce_ALL_GRADS

sync

optim

sync

GPU Compute

matmul, relu

grad kernels

IDLE — waiting for NCCL

SGD

NCCL Stream

idle — no communication

all_reduce × 80 params

The problem: The GPU compute stream is completely idle during the AllReduce phase. For a 40-layer model with 4096 hidden size, this communication takes ~35% of the total step time. That's 35% of your expensive GPU wasted doing nothing!

How Naive DP is Implemented

The key insight: no DDP wrapper, no hooks. We run loss.backward() completely, then manually loop over every parameter and call dist.all_reduce(). This ensures zero overlap.

        # Step 1: Pure backward — NO communication happens here

        loss.backward()

        torch.cuda.synchronize()  # Wait for backward to fully finish

        # Step 2: NOW communicate — GPU compute is idle during this

        for p in model.parameters():

            if p.grad is not None:

                dist.all_reduce(p.grad, op=dist.ReduceOp.SUM)

                p.grad /= world_size

Common mistake: Using register_hook() on parameters for "naive" DP. Hooks fire during backward as each gradient is computed — that's already overlapped! For truly sequential behavior, you must manually all-reduce after backward completes.

Strategy 2

DDP with Overlap

PyTorch's DistributedDataParallel groups parameters into buckets and fires AllReduce on each bucket as soon as its gradients are ready — while backward is still running.

What Are Buckets?

DDP doesn't send each gradient individually (too many small messages = slow). Instead, it groups parameters into buckets of ~25MB each. When ALL gradients in a bucket are computed, the entire bucket is AllReduced at once.

Bucket formation Bucket = {param_k, param_k+1, ...} where ∑ sizes ≤ bucket_cap_mb

Smaller buckets = more overlap opportunities (more frequent AllReduce calls), but each call has some fixed overhead.

Why Backward Order Matters

In a neural network, backward runs last layer to first. DDP buckets parameters in reverse order — so the last layers' parameters are in Bucket 0 (the first to be ready).

DDP Overlap Timeline — One Training Step

Notice how AllReduce calls overlap with backward computation. The GPU is never idle!

CPU

forward

backward_WITH_OVERLAP

sync

optim

GPU Compute

matmul

L40-31

L30-21

L20-11

L10-1

SGD

NCCL Stream

B0

B1

B2

B3

The magic: While the GPU computes gradients for layers 20-11, the NCCL stream is simultaneously AllReducing the already-finished Bucket 0 (layers 40-31). Communication is hidden behind computation — effectively free!

How DDP Overlap is Implemented

DDP uses autograd hooks internally. When a parameter's gradient is computed, the hook checks if its bucket is full. If so, it kicks off AllReduce for that bucket on the NCCL stream — all while backward continues on the compute stream.

        # DDP wrapper handles everything automatically

        ddp = DDP(model, device_ids=[local_rank],

                bucket_cap_mb=5,  # Small buckets = more overlap

                gradient_as_bucket_view=True)

        # Just call backward — DDP fires AllReduce automatically

        loss.backward()  # Compute + communicate SIMULTANEOUSLY

        # By the time backward finishes, most AllReduces are done too!

        opt.step()

Interactive

Watch It Happen

See the difference between naive and overlapped gradient synchronization, layer by layer, in real time.

Ready — click Play

F

B

AR

O

Layer Backward (compute gradients) AllReduce (communicate)

Total Step Time

—

Compute Utilization

—

Communication Hidden

—

Real Traces

Reading the PyTorch Profiler

We profiled both strategies on 2 GPUs with a 40-layer MLP (4096 hidden). Here's how to read the traces.

Anatomy of a Profiler Trace

Thread

pt_main_thread

The CPU thread executing your Python training loop. Shows record_function spans like "forward", "backward", "step_N".

GPU 0

stream 7 (compute)

The CUDA compute stream. Shows matmul, relu, and gradient kernels. Look for gaps — those are idle time.

NCCL

stream 16 (comms)

The NCCL communication stream. Shows all_reduce kernels. In DDP, these overlap with stream 7.

Naive DP Trace — What You See

From the PyTorch Profiler TensorBoard trace. Notice the clear sequential blocks.

Main Thread

fwd

backw...

sync

allred...

sync

fwd

backw...

sync

allred...

GPU stream 7

GPU idle

opt

GPU stream 16

nccl:AllReduce

DDP Overlap Trace — What You See

Notice how stream 7 (compute) and stream 16 (NCCL) are active at the same time.

Main Thread

forward

backward_WITH_OVERLAP

sync

forward

backward_WITH_OVERLAP

GPU stream 7

compute kernels (continuous)

opt

GPU stream 16

B0

B1

B2

B3

Key observation from the traces: In the naive trace, there's a clear gap in GPU stream 7 where the compute SM is idle while NCCL runs on stream 16. In the DDP trace, both streams are active simultaneously — the AllReduce buckets (B0, B1, B2, B3) overlap with backward compute kernels. The overall step is shorter because communication time is hidden.

Interactive

Bucket Size Explorer

How does bucket_cap_mb affect the overlap? Drag the slider to see.

bucket_cap_mb: 25 MB

Model: 40 layers × 4096 hidden = 2.7 GB total gradients

— buckets — overlap potential

Backward + AllReduce overlap pattern:

Backward compute AllReduce (NCCL) Idle / overhead

Trade-off: Smaller buckets = more overlap opportunities, but each AllReduce call has fixed latency overhead (~10-50μs). Too small = overhead dominates. Default in PyTorch is 25MB, which is a good balance. The notebook uses 5MB to make the overlap more visible in traces.

Head to Head

Naive vs DDP Comparison

A complete breakdown of both strategies across every dimension that matters.

Dimension	Naive DP	DDP Overlap
Communication timing	After ALL backward completes	During backward (per bucket)
GPU utilization	~65% (idle during AllReduce)	~95% (always computing)
Communication overhead	~35% of step time	~0% (hidden behind compute)
Number of AllReduce calls	1 per parameter (~80 calls)	1 per bucket (~4-8 calls)
Implementation	Manual loop after backward	`DDP()` wrapper — automatic
Code complexity	More code, but explicit	Simpler — 1 line wrapper
Gradient correctness	Mathematically identical	Mathematically identical
Tuning knob	None	bucket_cap_mb (default 25MB)
Profiler signature	Sequential blocks on streams	Interleaved blocks on streams

Time Savings Calculator

How much time does overlap save? It depends on the compute-to-communication ratio.

Compute time: 100ms

Comm time: 35ms

Naive DP step:

DDP Overlap step:

Naive: 135ms DDP: 100ms Saved: 26%

Summary

Key Takeaways

1

Overlap is the key insight

The total data transferred is the same in both strategies. The difference is when communication happens — sequential vs overlapped with computation.

2

Buckets enable overlap

DDP groups parameters into ~25MB buckets in reverse layer order. As each bucket's gradients finish computing, AllReduce starts immediately on the NCCL stream.

3

Two streams, one GPU

Modern GPUs run compute and communication on separate streams simultaneously. The profiler shows this as parallel activity on stream 7 (compute) and stream 16 (NCCL).

4

Hooks ≠ naive DP

A common mistake: using register_hook() for "naive" DP. Hooks fire during backward — that's already overlapped! Truly naive DP requires manual AllReduce after backward.

5

Works best when compute > comm

Overlap is most effective when backward computation takes longer than communication. For tiny models with fast backward, the savings are smaller.

6

Profile to verify

Use PyTorch Profiler + TensorBoard to visually confirm overlap. Look for interleaved blocks on the compute and NCCL streams — if they're sequential, something is wrong.