From interleaved stages that shrink the bubble by a factor of v,
to zero-bubble schedules that decompose backward passes into B and W operations,
to DualPipe's dual-stream design used in DeepSeek-V3. The frontier of pipeline parallelism.
1F1B improved memory by limiting in-flight micro-batches, but the pipeline bubble remained stubbornly large.
The bubble fraction with standard 1F1B is (p-1)/m where p = number of pipeline stages and m = number of micro-batches.
1F1B limits peak activations to just p micro-batches in flight (vs. m in GPipe), but the idle time stays the same.
Yes! By slicing the model differently across GPUs and being clever about backward pass decomposition, we can push the bubble toward zero.
m reduces the bubble but increases memory.
Is there another knob we can turn? Enter interleaved stages.
Instead of assigning consecutive layers to each GPU, spread them out.
Each GPU holds v non-contiguous chunks, creating a "looping pipeline."
v=1, micro-batch goes GPU0 → GPU1 → GPU2 → GPU3.
With v=2, it goes GPU0 → GPU1 → GPU2 → GPU3 → GPU0 → GPU1 → GPU2 → GPU3.
Each pass through a GPU is faster (fewer layers), enabling better interleaving.
Each forward and backward pass on a GPU takes 1/v the time of the standard case. This means the "warmup" and "cooldown" phases of the pipeline are v times shorter, directly shrinking the bubble.
Since each micro-batch visits every GPU v times instead of once, the number of point-to-point communications increases by a factor of v.
2(p-1) sends per micro-batch2v(p-1) sends per micro-batch
Explore how pipeline stages (p), micro-batches (m), and interleaved chunks (v) affect the bubble. Drag the sliders to see the formula in action.
With interleaved stages, a GPU must decide: process later layers of earlier micro-batches (depth-first), or earlier layers of later micro-batches (breadth-first)?
Priority: Get each micro-batch through the full model as fast as possible.
Priority: Fill the pipeline as quickly as possible with many micro-batches.
The key insight: the backward pass through a matrix multiplication involves two separate operations that can be scheduled independently.
Computes gradients with respect to the inputs of a layer. This is needed by the preceding stage to continue its backward pass.
Computes gradients with respect to the weights of a layer. This is only needed before the optimizer step, not by other stages.
In practice, fully optimizing these fine-grained schedules involves:
DeepSeek-V3/R1's pipeline schedule: two micro-batch streams propagating from both ends of the pipeline, interleaved to minimize idle time.
One stream of micro-batches enters from GPU 0 (left), another from GPU p-1 (right). They meet in the middle, maximizing utilization.
Like Zero Bubble, DualPipe splits backward into B (input grad) and W (weight grad), gaining scheduling flexibility.
DeepSeek-V3 achieved "near-zero all-to-all communication overhead" using this schedule with 671B parameters.
How do all the advanced pipeline schedules stack up against each other?
| Property | Interleaved 1F1B | Zero Bubble (ZB-H2) | DualPipe |
|---|---|---|---|
| Bubble Fraction | (p-1)/(v*m) |
~0% (theoretical) | ~0% (practical) |
| Key Technique | v model chunks per GPU | B/W decomposition | Dual streams + B/W decomposition |
| Communication | v× more p2p sends | Same as 1F1B | Near-zero overhead |
| Scheduling Complexity | Moderate | High (ILP solver) | Very High |
| Memory Overhead | Same as 1F1B | Slightly higher (deferred W) | Higher (dual streams) |
| Backward Granularity | Coarse (full backward) | Fine (B + W split) | Fine (B + W split) |
| Used In | Llama 3.1 (Meta) | Sea AI Lab research | DeepSeek-V3/R1 |
| Year Introduced | 2021 (Megatron-LM) | 2023 (Sea AI Lab) | 2024 (DeepSeek) |
By assigning v non-contiguous model chunks to each GPU, the bubble shrinks by factor v. Trade-off: v× more communication.
Splitting backward into input-gradient (B) and weight-gradient (W) lets us schedule W to fill bubbles. This is the key to zero-bubble schedules.
DeepSeek-V3/R1's DualPipe combines dual streams from both pipeline ends with B/W decomposition, achieving near-zero overhead at 671B parameter scale.
Interleaved schedules must choose between processing micro-batches deep (lower memory) or wide (better utilization). Llama 3.1 makes this tunable.
Finding the mathematically optimal schedule requires solving an Integer Linear Programming problem. In practice, heuristics are used.
Each advancement brings more scheduling complexity. The implementation effort grows from simple 1F1B to the highly engineered DualPipe.